Re: [PATCH v4 1/3] resource: Use list_head to link sibling resource

2018-05-07 Thread kbuild test robot
Hi Baoquan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4 next-20180504]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180507-144345
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/pci-common.c: In function 'pci_process_bridge_OF_ranges':
>> arch/powerpc/kernel/pci-common.c:764:44: error: incompatible types when 
>> assigning to type 'struct list_head' from type 'void *'
   res->parent = res->child = res->sibling = NULL;
   ^
   arch/powerpc/kernel/pci-common.c: In function 'reparent_resources':
>> arch/powerpc/kernel/pci-common.c:1100:10: error: assignment from 
>> incompatible pointer type [-Werror=incompatible-pointer-types]
 for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
 ^
   arch/powerpc/kernel/pci-common.c:1100:50: error: assignment from 
incompatible pointer type [-Werror=incompatible-pointer-types]
 for (pp = &parent->child; (p = *pp) != NULL; pp = &p->sibling) {
 ^
>> arch/powerpc/kernel/pci-common.c:1113:13: error: incompatible types when 
>> assigning to type 'struct list_head' from type 'struct resource *'
 res->child = *firstpp;
^
   arch/powerpc/kernel/pci-common.c:1114:15: error: incompatible types when 
assigning to type 'struct list_head' from type 'struct resource *'
 res->sibling = *pp;
  ^
>> arch/powerpc/kernel/pci-common.c:1117:9: error: incompatible types when 
>> assigning to type 'struct resource *' from type 'struct list_head'
 for (p = res->child; p != NULL; p = p->sibling) {
^
   arch/powerpc/kernel/pci-common.c:1117:36: error: incompatible types when 
assigning to type 'struct resource *' from type 'struct list_head'
 for (p = res->child; p != NULL; p = p->sibling) {
   ^
   cc1: all warnings being treated as errors

vim +764 arch/powerpc/kernel/pci-common.c

13dccb9e Benjamin Herrenschmidt 2007-12-11  642  
13dccb9e Benjamin Herrenschmidt 2007-12-11  643  /**
13dccb9e Benjamin Herrenschmidt 2007-12-11  644   * 
pci_process_bridge_OF_ranges - Parse PCI bridge resources from device tree
13dccb9e Benjamin Herrenschmidt 2007-12-11  645   * @hose: newly allocated 
pci_controller to be setup
13dccb9e Benjamin Herrenschmidt 2007-12-11  646   * @dev: device node of the 
host bridge
13dccb9e Benjamin Herrenschmidt 2007-12-11  647   * @primary: set if primary 
bus (32 bits only, soon to be deprecated)
13dccb9e Benjamin Herrenschmidt 2007-12-11  648   *
13dccb9e Benjamin Herrenschmidt 2007-12-11  649   * This function will parse 
the "ranges" property of a PCI host bridge device
13dccb9e Benjamin Herrenschmidt 2007-12-11  650   * node and setup the resource 
mapping of a pci controller based on its
13dccb9e Benjamin Herrenschmidt 2007-12-11  651   * content.
13dccb9e Benjamin Herrenschmidt 2007-12-11  652   *
13dccb9e Benjamin Herrenschmidt 2007-12-11  653   * Life would be boring if it 
wasn't for a few issues that we have to deal
13dccb9e Benjamin Herrenschmidt 2007-12-11  654   * with here:
13dccb9e Benjamin Herrenschmidt 2007-12-11  655   *
13dccb9e Benjamin Herrenschmidt 2007-12-11  656   *   - We can only cope with 
one IO space range and up to 3 Memory space
13dccb9e Benjamin Herrenschmidt 2007-12-11  657   * ranges. However, some 
machines (thanks Apple !) tend to split their
13dccb9e Benjamin Herrenschmidt 2007-12-11  658   * space into lots of 
small contiguous ranges. So we have to coalesce.
13dccb9e Benjamin Herrenschmidt 2007-12-11  659   *
13dccb9e Benjamin Herrenschmidt 2007-12-11  660   *   - Some busses have IO 
space not starting at 0, which causes trouble with
13dccb9e Benjamin Herrenschmidt 2007-12-11  661   * the way we do our IO 
resource renumbering. The code somewhat deals with
13dccb9e Benjamin Herrenschmidt 2007-12-11  662   * it for 64 bits but I 
would expect problems on 32 bits.
13dccb9e Benjamin Herrenschmidt 2007-12-11  663   *
13dccb9e Benjamin Herrenschmidt 2007-12-11  664   *   - Some 32 bits platforms 
such as 4xx can have physical space larger than
13dccb9e Benj

Re: [PATCH v4 1/3] resource: Use list_head to link sibling resource

2018-05-07 Thread kbuild test robot
Hi Baoquan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc4 next-20180504]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180507-144345
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   arch/arm/plat-samsung/pm-check.c: In function 's3c_pm_run_res':
>> arch/arm/plat-samsung/pm-check.c:49:18: error: invalid operands to binary != 
>> (have 'struct list_head' and 'void *')
  if (ptr->child != NULL)
  ~~ ^~
>> arch/arm/plat-samsung/pm-check.c:50:19: error: incompatible type for 
>> argument 1 of 's3c_pm_run_res'
   s3c_pm_run_res(ptr->child, fn, arg);
  ^~~
   arch/arm/plat-samsung/pm-check.c:46:13: note: expected 'struct resource *' 
but argument is of type 'struct list_head'
static void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
^~
>> arch/arm/plat-samsung/pm-check.c:60:7: error: incompatible types when 
>> assigning to type 'struct resource *' from type 'struct list_head'
  ptr = ptr->sibling;
  ^

vim +49 arch/arm/plat-samsung/pm-check.c

549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  45  
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  46  static 
void s3c_pm_run_res(struct resource *ptr, run_fn_t fn, u32 *arg)
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  47  {
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  48 while 
(ptr != NULL) {
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12 @49 
if (ptr->child != NULL)
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12 @50 
s3c_pm_run_res(ptr->child, fn, arg);
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  51  
05fee7cf arch/arm/plat-samsung/pm-check.c Toshi Kani 2016-01-26  52 
if ((ptr->flags & IORESOURCE_SYSTEM_RAM)
05fee7cf arch/arm/plat-samsung/pm-check.c Toshi Kani 2016-01-26  53 
== IORESOURCE_SYSTEM_RAM) {
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  54 
S3C_PMDBG("Found system RAM at %08lx..%08lx\n",
840eeeb8 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  55 
  (unsigned long)ptr->start,
840eeeb8 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  56 
  (unsigned long)ptr->end);
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  57 
arg = (fn)(ptr, arg);
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  58 
}
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  59  
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12 @60 
ptr = ptr->sibling;
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  61 }
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  62  }
549c7e33 arch/arm/plat-s3c/pm-check.c Ben Dooks  2008-12-12  63  

:: The code at line 49 was first introduced by commit
:: 549c7e33aeb9bfe441ecf68639d2227bb90978e7 [ARM] S3C: Split the resume 
memory check code from pm.c

:: TO: Ben Dooks 
:: CC: Ben Dooks 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


转发linux-nvdimm:本年度最热议题,合为一伙该如何分配利益

2018-05-07 Thread 侯经理
linux-nvdimm 见附 % 件 
本年度最热议题,合为一伙该如何分配利益
从基础原理到模式设计到实操案例,既是思维的提升,又是实务的落地。
我们在一天课程当中为您解密 “股东合伙人,事业合伙人,生态合伙人”。
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Matthew Wilcox
On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn 
> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
> corresponding drivers, which locate at \drivers\nvdimm\ and 
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.

You probably want to let linux-nvdimm know about this patch set.
Adding to the cc.  Also, I only received patch 0 and 4.  What happened
to 1-3,5 and 6?

> With current kernel, many mm’s classical features like the buddy
> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
> What we are doing is to expand kernel mm’s capacity to make it to handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
> zone, here we created a new zone type as NVM zone. That is to say for 
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.
> 
> We installed two NVDIMMs to Lenovo Thinksystem product as development
> platform, which has 125GB storage capacity respectively. With these 
> patches below, mm can create NVM zones for NVDIMMs.
> 
> Here is dmesg info,
>  Initmem setup node 0 [mem 0x1000-0x00237fff]
>  On node 0 totalpages: 36879666
>DMA zone: 64 pages used for memmap
>DMA zone: 23 pages reserved
>DMA zone: 3999 pages, LIFO batch:0
>  mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096 
>DMA32 zone: 10935 pages used for memmap
>DMA32 zone: 699795 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
>Normal zone: 53248 pages used for memmap
>Normal zone: 3407872 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4456448
>NVM zone: 512000 pages used for memmap
>NVM zone: 32768000 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 0 zone 3 pfns 4456448 -> 37224448
>  Initmem setup node 1 [mem 0x00238000-0x0046bfff]
>  On node 1 totalpages: 36962304
>Normal zone: 65536 pages used for memmap
>Normal zone: 4194304 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 -> 41418752
>NVM zone: 512000 pages used for memmap
>NVM zone: 32768000 pages, LIFO batch:31
>  mminit::memmap_init Initialising map node 1 zone 3 pfns 41418752 -> 74186752
> 
> This comes /proc/zoneinfo
> Node 0, zone  NVM
>   pages free 32768000
> min  15244
> low  48012
> high 80780
> spanned  32768000
> present  32768000
> managed  32768000
> protection: (0, 0, 0, 0, 0, 0)
> nr_free_pages 32768000
> Node 1, zone  NVM
>   pages free 32768000
> min  15244
> low  48012
> high 80780
> spanned  32768000
> present  32768000
> managed  32768000
> 
> Huaisheng Ye (6):
>   mm/memblock: Expand definition of flags to support NVDIMM
>   mm/page_alloc.c: get pfn range with flags of memblock
>   mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
>   arch/x86/kernel: mark NVDIMM regions from e820_table
>   mm: get zone spanned pages separately for DRAM and NVDIMM
>   arch/x86/mm: create page table mapping for DRAM and NVDIMM both
> 
>  arch/x86/include/asm/e820/api.h |  3 +++
>  arch/x86/kernel/e820.c  | 20 +-
>  arch/x86/kernel/setup.c |  8 ++
>  arch/x86/mm/init_64.c   | 16 +++
>  include/linux/gfp.h | 57 ---
>  include/linux/memblock.h| 19 +
>  include/linux/mm.h  |  4 +++
>  include/linux/mmzone.h  |  3 +++
>  mm/Kconfig  | 16 +++
>  mm/memblock.c   | 46 +++
>  mm/nobootmem.c  |  5 ++--
>  mm/page_alloc.c | 60 
> -
>  12 files changed, 245 insertions(+), 12 deletions(-)
> 
> -- 
> 1.8.3.1
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Dan Williams
On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox  wrote:
> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>> corresponding drivers, which locate at \drivers\nvdimm\ and
>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>> memory hot plug implementation.
>
> You probably want to let linux-nvdimm know about this patch set.
> Adding to the cc.

Yes, thanks for that!

> Also, I only received patch 0 and 4.  What happened
> to 1-3,5 and 6?
>
>> With current kernel, many mm’s classical features like the buddy
>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>> What we are doing is to expand kernel mm’s capacity to make it to handle
>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>> separately, that means mm can only put the critical pages to NVDIMM
>> zone, here we created a new zone type as NVM zone. That is to say for
>> traditional(or normal) pages which would be stored at DRAM scope like
>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>> them could be recovered from power fail or system crash, we make them
>> to be persistent by storing them to NVM zone.
>>
>> We installed two NVDIMMs to Lenovo Thinksystem product as development
>> platform, which has 125GB storage capacity respectively. With these
>> patches below, mm can create NVM zones for NVDIMMs.
>>
>> Here is dmesg info,
>>  Initmem setup node 0 [mem 0x1000-0x00237fff]
>>  On node 0 totalpages: 36879666
>>DMA zone: 64 pages used for memmap
>>DMA zone: 23 pages reserved
>>DMA zone: 3999 pages, LIFO batch:0
>>  mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096
>>DMA32 zone: 10935 pages used for memmap
>>DMA32 zone: 699795 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
>>Normal zone: 53248 pages used for memmap
>>Normal zone: 3407872 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4456448
>>NVM zone: 512000 pages used for memmap
>>NVM zone: 32768000 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 0 zone 3 pfns 4456448 -> 37224448
>>  Initmem setup node 1 [mem 0x00238000-0x0046bfff]
>>  On node 1 totalpages: 36962304
>>Normal zone: 65536 pages used for memmap
>>Normal zone: 4194304 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 -> 41418752
>>NVM zone: 512000 pages used for memmap
>>NVM zone: 32768000 pages, LIFO batch:31
>>  mminit::memmap_init Initialising map node 1 zone 3 pfns 41418752 -> 74186752
>>
>> This comes /proc/zoneinfo
>> Node 0, zone  NVM
>>   pages free 32768000
>> min  15244
>> low  48012
>> high 80780
>> spanned  32768000
>> present  32768000
>> managed  32768000
>> protection: (0, 0, 0, 0, 0, 0)
>> nr_free_pages 32768000
>> Node 1, zone  NVM
>>   pages free 32768000
>> min  15244
>> low  48012
>> high 80780
>> spanned  32768000
>> present  32768000
>> managed  32768000

I think adding yet one more mm-zone is the wrong direction. Instead,
what we have been considering is a mechanism to allow a device-dax
instance to be given back to the kernel as a distinct numa node
managed by the VM. It seems it times to dust off those patches.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox  wrote:
>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>> memory hot plug implementation.
>>
>> You probably want to let linux-nvdimm know about this patch set.
>> Adding to the cc.
>
> Yes, thanks for that!
>
>> Also, I only received patch 0 and 4.  What happened
>> to 1-3,5 and 6?
>>
>>> With current kernel, many mm’s classical features like the buddy
>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>> separately, that means mm can only put the critical pages to NVDIMM

Please define "critical pages."

>>> zone, here we created a new zone type as NVM zone. That is to say for
>>> traditional(or normal) pages which would be stored at DRAM scope like
>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>> them could be recovered from power fail or system crash, we make them
>>> to be persistent by storing them to NVM zone.

[...]

> I think adding yet one more mm-zone is the wrong direction. Instead,
> what we have been considering is a mechanism to allow a device-dax
> instance to be given back to the kernel as a distinct numa node
> managed by the VM. It seems it times to dust off those patches.

What's the use case?  The above patch description seems to indicate an
intent to recover contents after a power loss.  Without seeing the whole
series, I'm not sure how that's accomplished in a safe or meaningful
way.

Huaisheng, could you provide a bit more background?

Thanks!
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Dan Williams
On Mon, May 7, 2018 at 12:08 PM, Jeff Moyer  wrote:
> Dan Williams  writes:
>
>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox  wrote:
>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
 Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
 DEVICE zone, which is a virtual zone and both its start and end of pfn
 are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
 corresponding drivers, which locate at \drivers\nvdimm\ and
 \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
 memory hot plug implementation.
>>>
>>> You probably want to let linux-nvdimm know about this patch set.
>>> Adding to the cc.
>>
>> Yes, thanks for that!
>>
>>> Also, I only received patch 0 and 4.  What happened
>>> to 1-3,5 and 6?
>>>
 With current kernel, many mm’s classical features like the buddy
 system, swap mechanism and page cache couldn’t be supported to NVDIMM.
 What we are doing is to expand kernel mm’s capacity to make it to handle
 NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
 separately, that means mm can only put the critical pages to NVDIMM
>
> Please define "critical pages."
>
 zone, here we created a new zone type as NVM zone. That is to say for
 traditional(or normal) pages which would be stored at DRAM scope like
 Normal, DMA32 and DMA zones. But for the critical pages, which we hope
 them could be recovered from power fail or system crash, we make them
 to be persistent by storing them to NVM zone.
>
> [...]
>
>> I think adding yet one more mm-zone is the wrong direction. Instead,
>> what we have been considering is a mechanism to allow a device-dax
>> instance to be given back to the kernel as a distinct numa node
>> managed by the VM. It seems it times to dust off those patches.
>
> What's the use case?

Use NVDIMMs as System-RAM given their potentially higher capacity than
DDR. The expectation in that case is that data is forfeit (not
persisted) after a crash. Any persistent use case would need to go
through the pmem driver, filesystem-dax or device-dax.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Matthew Wilcox
On Mon, May 07, 2018 at 11:57:10AM -0700, Dan Williams wrote:
> I think adding yet one more mm-zone is the wrong direction. Instead,
> what we have been considering is a mechanism to allow a device-dax
> instance to be given back to the kernel as a distinct numa node
> managed by the VM. It seems it times to dust off those patches.

I was wondering how "safe" we think that ability is.  NV-DIMM pages
(obviously) differ from normal pages by their non-volatility.  Do we
want their contents from the previous boot to be observable?  If not,
then we need the BIOS to clear them at boot-up, which means we would
want no kernel changes at all; rather the BIOS should just describe
those pages as if they were DRAM (after zeroing them).
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, May 7, 2018 at 12:08 PM, Jeff Moyer  wrote:
>> Dan Williams  writes:
>>
>>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox  wrote:
 On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn
> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
> corresponding drivers, which locate at \drivers\nvdimm\ and
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.

 You probably want to let linux-nvdimm know about this patch set.
 Adding to the cc.
>>>
>>> Yes, thanks for that!
>>>
 Also, I only received patch 0 and 4.  What happened
 to 1-3,5 and 6?

> With current kernel, many mm’s classical features like the buddy
> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
> What we are doing is to expand kernel mm’s capacity to make it to handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
>>
>> Please define "critical pages."
>>
> zone, here we created a new zone type as NVM zone. That is to say for
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.
>>
>> [...]
>>
>>> I think adding yet one more mm-zone is the wrong direction. Instead,
>>> what we have been considering is a mechanism to allow a device-dax
>>> instance to be given back to the kernel as a distinct numa node
>>> managed by the VM. It seems it times to dust off those patches.
>>
>> What's the use case?
>
> Use NVDIMMs as System-RAM given their potentially higher capacity than
> DDR. The expectation in that case is that data is forfeit (not
> persisted) after a crash. Any persistent use case would need to go
> through the pmem driver, filesystem-dax or device-dax.

OK, but that sounds different from what was being proposed, here.  I'll
quote from above:

> But for the critical pages, which we hope them could be recovered
  ^
> from power fail or system crash, we make them to be persistent by
  ^^^
> storing them to NVM zone.

Hence my confusion.

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Dan Williams
On Mon, May 7, 2018 at 12:28 PM, Jeff Moyer  wrote:
> Dan Williams  writes:
[..]
>>> What's the use case?
>>
>> Use NVDIMMs as System-RAM given their potentially higher capacity than
>> DDR. The expectation in that case is that data is forfeit (not
>> persisted) after a crash. Any persistent use case would need to go
>> through the pmem driver, filesystem-dax or device-dax.
>
> OK, but that sounds different from what was being proposed, here.  I'll
> quote from above:
>
>> But for the critical pages, which we hope them could be recovered
>   ^
>> from power fail or system crash, we make them to be persistent by
>   ^^^
>> storing them to NVM zone.
>
> Hence my confusion.

Yes, now mine too, I overlooked that.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Dan Williams
On Mon, May 7, 2018 at 12:18 PM, Matthew Wilcox  wrote:
> On Mon, May 07, 2018 at 11:57:10AM -0700, Dan Williams wrote:
>> I think adding yet one more mm-zone is the wrong direction. Instead,
>> what we have been considering is a mechanism to allow a device-dax
>> instance to be given back to the kernel as a distinct numa node
>> managed by the VM. It seems it times to dust off those patches.
>
> I was wondering how "safe" we think that ability is.  NV-DIMM pages
> (obviously) differ from normal pages by their non-volatility.  Do we
> want their contents from the previous boot to be observable?  If not,
> then we need the BIOS to clear them at boot-up, which means we would
> want no kernel changes at all; rather the BIOS should just describe
> those pages as if they were DRAM (after zeroing them).

Certainly the BIOS could do it, but the impetus for having a kernel
mechanism to do the same is for supporting the configuration
flexibility afforded by namespaces, or otherwise having the capability
when the BIOS does not offer it. However, you are right that there are
extra security implications when System-RAM is persisted, perhaps
requiring the capacity to be explicitly locked / unlocked could
address that concern?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:33PM -0600, Logan Gunthorpe wrote:
> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCI switch by a small number of lanes

s/PCI/PCIe/

> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same root port (typically through

s/root port/PCI bridge/

> a network of PCIe switches). This is because we have no way of knowing
> whether peer-to-peer routing between PCIe Root Ports is supported
> (PCIe r4.0, sec 1.3.1).  Additionally, the benefits of P2P transfers that
> go through the RC is limited to only reducing DRAM usage and, in some
> cases, coding convenience. The PCI-SIG may be exploring adding a new
> capability bit to advertise whether this is possible for future
> hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  17 ++
>  drivers/pci/Makefile   |   1 +
>  drivers/pci/p2pdma.c   | 694 
> +
>  include/linux/memremap.h   |  18 ++
>  include/linux/pci-p2pdma.h | 100 +++
>  include/linux/pci.h|   4 +
>  6 files changed, 834 insertions(+)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 34b56a8f8480..b2396c22b53e 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -124,6 +124,23 @@ config PCI_PASID
>  
> If unsure, say N.
>  
> +config PCI_P2PDMA
> + bool "PCI peer-to-peer transfer support"
> + depends on PCI && ZONE_DEVICE && EXPERT
> + select GENERIC_ALLOCATOR
> + help
> +   Enableѕ drivers to do PCI peer-to-peer transactions to and from
> +   BARs that are exposed in other devices that are the part of
> +   the hierarchy where peer-to-peer DMA is guaranteed by the PCI
> +   specification to work (ie. anything below a single PCI bridge).
> +
> +   Many PCIe root complexes do not support P2P transactions and
> +   it's hard to tell which support it at all, so at this time, DMA
> +   transations must be between devices behind the same root port.

s/DMA transactions/PCIe DMA transactions/

(Theoretically P2P should work on conventional PCI, and this sentence only
applies to PCIe.)

> +   (Typically behind a network of PCIe switches).

Not sure this last sentence adds useful information.

> +++ b/drivers/pci/p2pdma.c
> @@ -0,0 +1,694 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI Peer 2 Peer DMA support.
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + *

Nit: unnecessary blank line.

> +/*
> + * If a device is behind a switch, we try to find the upstream bridge
> + * port of the switch. This requires two calls to pci_upstream_bridge():
> + * one for the upstream port on the switch, one on the upstream port
> + * for the next level in the hierarchy. Because of this, devices connected
> + * to the root port will be rejected.
> + */
> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)

This function doesn't seem to be used anymore.  Thanks for all your hard
work to get rid of it!

> +{
> + struct pci_dev *up1, *up2;
> +
> + if (!pdev)
> + return NULL;
> +
> + up1 = pci_dev_get(pci_upstream_bridge(pdev));
> + if (!up1)
> + return NULL;
> +
> + up2 = pci_dev_get(pci_upstream_bridge(up1));
> + pci_dev_put(up1);
> +
> + 

Re: [PATCH v4 03/14] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset

2018-05-07 Thread Bjorn Helgaas
s/dma/DMA/ (in subject)

On Mon, Apr 23, 2018 at 05:30:35PM -0600, Logan Gunthorpe wrote:
> The DMA address used when mapping PCI P2P memory must be the PCI bus
> address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct
> addresses when using P2P memory.
> 
> For this, we assume that an SGL passed to these functions contain all
> P2P memory or no P2P memory.
> 
> Signed-off-by: Logan Gunthorpe 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 01/14] PCI/P2PDMA: Support peer-to-peer memory

2018-05-07 Thread Logan Gunthorpe
Thanks for the review. I'll apply all of these for the changes for next
version of the set.
>> +/*
>> + * If a device is behind a switch, we try to find the upstream bridge
>> + * port of the switch. This requires two calls to pci_upstream_bridge():
>> + * one for the upstream port on the switch, one on the upstream port
>> + * for the next level in the hierarchy. Because of this, devices connected
>> + * to the root port will be rejected.
>> + */
>> +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> 
> This function doesn't seem to be used anymore.  Thanks for all your hard
> work to get rid of it!

Oops, I thought I had gotten rid of it entirely, but I guess I messed it
up a bit and it gets removed in patch 4. I'll fix it for v5.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-07 Thread Bjorn Helgaas
[+to Alex]

Alex,

Are you happy with this strategy of turning off ACS based on
CONFIG_PCI_P2PDMA?  We only check this at enumeration-time and 
I don't know if there are other places we would care?

On Mon, Apr 23, 2018 at 05:30:36PM -0600, Logan Gunthorpe wrote:
> For peer-to-peer transactions to work the downstream ports in each
> switch must not have the ACS flags set. At this time there is no way
> to dynamically change the flags and update the corresponding IOMMU
> groups so this is done at enumeration time before the groups are
> assigned.
> 
> This effectively means that if CONFIG_PCI_P2PDMA is selected then
> all devices behind any PCIe switch heirarchy will be in the same IOMMU
> group. Which implies that individual devices behind any switch
> heirarchy will not be able to be assigned to separate VMs because
> there is no isolation between them. Additionally, any malicious PCIe
> devices will be able to DMA to memory exposed by other EPs in the same
> domain as TLPs will not be checked by the IOMMU.
> 
> Given that the intended use case of P2P Memory is for users with
> custom hardware designed for purpose, we do not expect distributors
> to ever need to enable this option. Users that want to use P2P
> must have compiled a custom kernel with this configuration option
> and understand the implications regarding ACS. They will either
> not require ACS or will have design the system in such a way that
> devices that require isolation will be separate from those using P2P
> transactions.
> 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/pci/Kconfig|  9 +
>  drivers/pci/p2pdma.c   | 45 ++---
>  drivers/pci/pci.c  |  6 ++
>  include/linux/pci-p2pdma.h |  5 +
>  4 files changed, 50 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index b2396c22b53e..b6db41d4b708 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -139,6 +139,15 @@ config PCI_P2PDMA
> transations must be between devices behind the same root port.
> (Typically behind a network of PCIe switches).
>  
> +   Enabling this option will also disable ACS on all ports behind
> +   any PCIe switch. This effectively puts all devices behind any
> +   switch heirarchy into the same IOMMU group. Which implies that

s/heirarchy/hierarchy/ (also above in changelog)

> +   individual devices behind any switch will not be able to be
> +   assigned to separate VMs because there is no isolation between
> +   them. Additionally, any malicious PCIe devices will be able to
> +   DMA to memory exposed by other EPs in the same domain as TLPs
> +   will not be checked by the IOMMU.
> +
> If unsure, say N.
>  
>  config PCI_LABEL
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ed9dce8552a2..e9f43b43acac 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -240,27 +240,42 @@ static struct pci_dev *find_parent_pci_dev(struct 
> device *dev)
>  }
>  
>  /*
> - * If a device is behind a switch, we try to find the upstream bridge
> - * port of the switch. This requires two calls to pci_upstream_bridge():
> - * one for the upstream port on the switch, one on the upstream port
> - * for the next level in the hierarchy. Because of this, devices connected
> - * to the root port will be rejected.
> + * pci_p2pdma_disable_acs - disable ACS flags for all PCI bridges
> + * @pdev: device to disable ACS flags for
> + *
> + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need
> + * to be disabled on any PCI bridge in order for the TLPS to not be forwarded
> + * up to the RC which is not what we want for P2P.

s/PCI bridge/PCIe switch/ (ACS doesn't apply to conventional PCI)

> + *
> + * This function is called when the devices are first enumerated and
> + * will result in all devices behind any bridge to be in the same IOMMU
> + * group. At this time, there is no way to "hotplug" IOMMU groups so we rely
> + * on this largish hammer. If you need the devices to be in separate groups
> + * don't enable CONFIG_PCI_P2PDMA.
> + *
> + * Returns 1 if the ACS bits for this device was cleared, otherwise 0.
>   */
> -static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev)
> +int pci_p2pdma_disable_acs(struct pci_dev *pdev)
>  {
> - struct pci_dev *up1, *up2;
> + int pos;
> + u16 ctrl;
>  
> - if (!pdev)
> - return NULL;
> + if (!pci_is_bridge(pdev))
> + return 0;
>  
> - up1 = pci_dev_get(pci_upstream_bridge(pdev));
> - if (!up1)
> - return NULL;
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> + if (!pos)
> + return 0;
> +
> + pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n");
> +
> + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
> +
> + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR);
>  
> - up2 = pci_dev_ge

Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:38PM -0600, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Jonathan Corbet 
> ---
>  Documentation/PCI/index.rst |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 
> 
>  Documentation/index.rst |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> 
> diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst
> new file mode 100644
> index ..2fdc4b3c291d
> --- /dev/null
> +++ b/Documentation/PCI/index.rst
> @@ -0,0 +1,14 @@
> +==
> +Linux PCI Driver Developer's Guide
> +==
> +
> +.. toctree::
> +
> +   p2pdma
> +
> +.. only::  subproject and html
> +
> +   Indices
> +   ===
> +
> +   * :ref:`genindex`
> diff --git a/Documentation/driver-api/pci/index.rst 
> b/Documentation/driver-api/pci/index.rst
> index 03b57cbf8cc2..d12eeafbfc90 100644
> --- a/Documentation/driver-api/pci/index.rst
> +++ b/Documentation/driver-api/pci/index.rst
> @@ -10,6 +10,7 @@ The Linux PCI driver implementer's API guide
> :maxdepth: 2
>  
> pci
> +   p2pdma
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/driver-api/pci/p2pdma.rst 
> b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index ..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +
> +PCI Peer-to-Peer DMA Support
> +
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is

s/endpoints/devices/

> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required

s/PCI Root Complexes .../
  PCI doesn't require forwarding transactions between hierarchy domains,
and in PCIe, each Root Port defines a separate hierarchy domain./

> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.

s/endpoints involved .../
  devices involved are all behind the same PCI bridge, as such devices are
  all in the same PCI hierarchy domain, and the spec guarantees that all
  transactions within the hierarchy will be routable, but it does not
  require routing between hierarchies./

> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-07 Thread Bjorn Helgaas
On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
> rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci|  25 +
>  Documentation/PCI/index.rst|  14 +
>  Documentation/driver-api/index.rst |   2 +-
>  Documentation/driver-api/pci/index.rst |  20 +
>  Documentation/driver-api/pci/p2pdma.rst| 166 ++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst|   3 +-
>  block/blk-core.c   |   3 +
>  drivers/infiniband/core/rw.c   |  13 +-
>  drivers/nvme/host/core.c   |   4 +
>  drivers/nvme/host/nvme.h   |   8 +
>  drivers/nvme/host/pci.c| 118 +++--
>  drivers/nvme/target/configfs.c |  67 +++
>  drivers/nvme/target/core.c | 143 -
>  drivers/nvme/target/io-cmd.c   |   3 +
>  drivers/nvme/target/nvmet.h|  15 +
>  drivers/nvme/target/rdma.c |  22 +-
>  drivers/pci/Kconfig|  26 +
>  drivers/pci/Makefile   |   1 +
>  drivers/pci/p2pdma.c   | 814 
> +
>  drivers/pci/pci.c  |   6 +
>  include/linux/blk_types.h  |  18 +-
>  include/linux/blkdev.h |   3 +
>  include/linux/memremap.h   |  19 +
>  include/linux/pci-p2pdma.h | 118 +
>  include/linux/pci.h|   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-05-07 Thread Logan Gunthorpe

> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch

2018-05-07 Thread Darrick J. Wong
On Thu, May 03, 2018 at 04:53:18PM -0700, Dan Williams wrote:
> On Tue, Apr 24, 2018 at 4:33 PM, Dan Williams  
> wrote:
> > Changes since v8 [1]:
> > * Rebase on v4.17-rc2
> >
> > * Fix get_user_pages_fast() for ZONE_DEVICE pages to revalidate the pte,
> >   pmd, pud after taking references (Jan)
> >
> > * Kill dax_layout_lock(). With get_user_pages_fast() for ZONE_DEVICE
> >   fixed we can then rely on the {pte,pmd}_lock to synchronize
> >   dax_layout_busy_page() vs new page references (Jan)
> >
> > * Hold the iolock over repeated invocations of dax_layout_busy_page() to
> >   enable truncate/hole-punch to make forward progress in the presence of
> >   a constant stream of new direct-I/O requests (Jan).
> >
> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-March/015058.html
> 
> I'll push this for soak time in -next if there are no further comments...

I don't have any. :D

--D
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Huaisheng HS1 Ye

> 
> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> > Traditionally, NVDIMMs are treated by mm(memory management)
> subsystem as
> > DEVICE zone, which is a virtual zone and both its start and end of pfn
> > are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
> uses
> > corresponding drivers, which locate at \drivers\nvdimm\ and
> > \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> > memory hot plug implementation.
> 
> You probably want to let linux-nvdimm know about this patch set.
> Adding to the cc.  Also, I only received patch 0 and 4.  What happened
> to 1-3,5 and 6?

Sorry, It could be something wrong with my git-sendemail, but my mailbox has 
received all of them.
Anyway, I will send them again and CC linux-nvdimm.

Thanks
Huaisheng
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] [RFC PATCH v1 1/6] mm/memblock: Expand definition of flags to support NVDIMM

2018-05-07 Thread Huaisheng HS1 Ye
This patch makes mm to have capability to get special regions
from memblock.

During boot process, memblock marks NVDIMM regions with flag
MEMBLOCK_NVDIMM, also expands the interface of functions and
macros with flags.

Signed-off-by: Huaisheng Ye 
Signed-off-by: Ocean He 
---
 include/linux/memblock.h | 19 +++
 mm/memblock.c| 46 +-
 2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f92ea77..cade5c8d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -26,6 +26,8 @@ enum {
MEMBLOCK_HOTPLUG= 0x1,  /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2,  /* mirrored region */
MEMBLOCK_NOMAP  = 0x4,  /* don't add to kernel direct mapping */
+   MEMBLOCK_NVDIMM = 0x8,  /* NVDIMM region */
+   MEMBLOCK_MAX_TYPE   = 0x10  /* all regions */
 };
 
 struct memblock_region {
@@ -89,6 +91,8 @@ bool memblock_overlaps_region(struct memblock_type *type,
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_nvdimm(phys_addr_t base, phys_addr_t size);
+int memblock_clear_nvdimm(phys_addr_t base, phys_addr_t size);
 ulong choose_memblock_flags(void);
 
 /* Low level functions */
@@ -167,6 +171,11 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t 
*out_start,
 i != (u64)ULLONG_MAX;  \
 __next_reserved_mem_region(&i, p_start, p_end))
 
+static inline bool memblock_is_nvdimm(struct memblock_region *m)
+{
+   return m->flags & MEMBLOCK_NVDIMM;
+}
+
 static inline bool memblock_is_hotpluggable(struct memblock_region *m)
 {
return m->flags & MEMBLOCK_HOTPLUG;
@@ -187,6 +196,11 @@ int memblock_search_pfn_nid(unsigned long pfn, unsigned 
long *start_pfn,
unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
  unsigned long *out_end_pfn, int *out_nid);
+void __next_mem_pfn_range_with_flags(int *idx, int nid,
+unsigned long *out_start_pfn,
+unsigned long *out_end_pfn,
+int *out_nid,
+unsigned long flags);
 
 /**
  * for_each_mem_pfn_range - early memory pfn range iterator
@@ -201,6 +215,11 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long 
*out_start_pfn,
 #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)  \
for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
 i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
+
+#define for_each_mem_pfn_range_with_flags(i, nid, p_start, p_end, p_nid, 
flags) \
+   for (i = -1, __next_mem_pfn_range_with_flags(&i, nid, p_start, p_end, 
p_nid, flags);\
+i >= 0; __next_mem_pfn_range_with_flags(&i, nid, p_start, p_end, 
p_nid, flags))
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
diff --git a/mm/memblock.c b/mm/memblock.c
index 48376bd..7699637 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -771,6 +771,16 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t 
base, phys_addr_t size)
return memblock_setclr_flag(base, size, 0, MEMBLOCK_HOTPLUG);
 }
 
+int __init_memblock memblock_mark_nvdimm(phys_addr_t base, phys_addr_t size)
+{
+   return memblock_setclr_flag(base, size, 1, MEMBLOCK_NVDIMM);
+}
+
+int __init_memblock memblock_clear_nvdimm(phys_addr_t base, phys_addr_t size)
+{
+   return memblock_setclr_flag(base, size, 0, MEMBLOCK_NVDIMM);
+}
+
 /**
  * memblock_mark_mirror - Mark mirrored memory with flag MEMBLOCK_MIRROR.
  * @base: the base phys addr of the region
@@ -891,6 +901,10 @@ void __init_memblock __next_mem_range(u64 *idx, int nid, 
ulong flags,
if (nid != NUMA_NO_NODE && nid != m_nid)
continue;
 
+   /* skip nvdimm memory regions if needed */
+   if (!(flags & MEMBLOCK_NVDIMM) && memblock_is_nvdimm(m))
+   continue;
+
/* skip hotpluggable memory regions if needed */
if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
continue;
@@ -1007,6 +1021,10 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int 
nid, ulong flags,
if (nid != NUMA_NO_NODE && nid != m_nid)
continue;
 
+   /* skip nvdimm memory regions if needed */
+   if (!(flags & MEMBLOCK_NVDIMM) && memblock_is_nvdimm(m))
+   continue;
+
/* skip hotpluggable memory regions if needed */
if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
 

[External] [RFC PATCH v1 2/6] mm/page_alloc.c: get pfn range with flags of memblock

2018-05-07 Thread Huaisheng HS1 Ye
This is used to expand the interface of get_pfn_range_for_nid with
flags of memblock, so mm can get pfn range with special flags.

Signed-off-by: Huaisheng Ye 
Signed-off-by: Ocean He 
---
 include/linux/mm.h |  4 
 mm/page_alloc.c| 17 -
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..8abf9c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2046,6 +2046,10 @@ extern unsigned long absent_pages_in_range(unsigned long 
start_pfn,
unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
unsigned long *start_pfn, unsigned long *end_pfn);
+extern void get_pfn_range_for_nid_with_flags(unsigned int nid,
+unsigned long *start_pfn,
+unsigned long *end_pfn,
+unsigned long flags);
 extern unsigned long find_min_pfn_with_active_regions(void);
 extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1741dd2..266c065 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5705,13 +5705,28 @@ void __init 
sparse_memory_present_with_active_regions(int nid)
 void __meminit get_pfn_range_for_nid(unsigned int nid,
unsigned long *start_pfn, unsigned long *end_pfn)
 {
+   get_pfn_range_for_nid_with_flags(nid, start_pfn, end_pfn,
+MEMBLOCK_MAX_TYPE);
+}
+
+/*
+ * If MAX_NUMNODES, includes all node memmory regions.
+ * If MEMBLOCK_MAX_TYPE, includes all memory regions with or without Flags.
+ */
+
+void __meminit get_pfn_range_for_nid_with_flags(unsigned int nid,
+   unsigned long *start_pfn,
+   unsigned long *end_pfn,
+   unsigned long flags)
+{
unsigned long this_start_pfn, this_end_pfn;
int i;
 
*start_pfn = -1UL;
*end_pfn = 0;
 
-   for_each_mem_pfn_range(i, nid, &this_start_pfn, &this_end_pfn, NULL) {
+   for_each_mem_pfn_range_with_flags(i, nid, &this_start_pfn,
+ &this_end_pfn, NULL, flags) {
*start_pfn = min(*start_pfn, this_start_pfn);
*end_pfn = max(*end_pfn, this_end_pfn);
}
-- 
1.8.3.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE

2018-05-07 Thread Huaisheng HS1 Ye
Expand ZONE_NVM into enum zone_type, and create GFP_NVM
which represents gfp_t flag for NVM zone.

Because there is no lower plain integer GFP bitmask can be
used for ___GFP_NVM, a workable way is to get space from
GFP_ZONE_BAD to fill ZONE_NVM into GFP_ZONE_TABLE.

Signed-off-by: Huaisheng Ye 
Signed-off-by: Ocean He 
---
 include/linux/gfp.h| 57 +++---
 include/linux/mmzone.h |  3 +++
 mm/Kconfig | 16 ++
 mm/page_alloc.c|  3 +++
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..9e4d867 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,9 @@
 #define ___GFP_DIRECT_RECLAIM  0x40u
 #define ___GFP_WRITE   0x80u
 #define ___GFP_KSWAPD_RECLAIM  0x100u
+#ifdef CONFIG_ZONE_NVM
+#define ___GFP_NVM 0x400u
+#endif
 #ifdef CONFIG_LOCKDEP
 #define ___GFP_NOLOCKDEP   0x200u
 #else
@@ -57,7 +60,12 @@
 #define __GFP_HIGHMEM  ((__force gfp_t)___GFP_HIGHMEM)
 #define __GFP_DMA32((__force gfp_t)___GFP_DMA32)
 #define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE 
allowed */
+#ifdef CONFIG_ZONE_NVM
+#define __GFP_NVM  ((__force gfp_t)___GFP_NVM)  /* ZONE_NVM allowed */
+#define GFP_ZONEMASK   
(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE|__GFP_NVM)
+#else
 #define GFP_ZONEMASK   (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+#endif
 
 /*
  * Page mobility and placement hints
@@ -205,7 +213,8 @@
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP) + \
+   (IS_ENABLED(CONFIG_ZONE_NVM) << 1))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
@@ -283,6 +292,9 @@
 #define GFP_TRANSHUGE_LIGHT((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE  (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#ifdef CONFIG_ZONE_NVM
+#define GFP_NVM__GFP_NVM
+#endif
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -342,7 +354,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t 
gfp_flags)
  *   0x0=> NORMAL
  *   0x1=> DMA or NORMAL
  *   0x2=> HIGHMEM or NORMAL
- *   0x3=> BAD (DMA+HIGHMEM)
+ *   0x3=> NVM (DMA+HIGHMEM), now it is used by NVDIMM zone
  *   0x4=> DMA32 or DMA or NORMAL
  *   0x5=> BAD (DMA+DMA32)
  *   0x6=> BAD (HIGHMEM+DMA32)
@@ -370,6 +382,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t 
gfp_flags)
 #error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
 #endif
 
+#ifdef CONFIG_ZONE_NVM
+#define ___GFP_NVM_BIT (___GFP_DMA | ___GFP_HIGHMEM)
+#define GFP_ZONE_TABLE ( \
+   ((__force unsigned long)ZONE_NORMAL << \
+   0 * GFP_ZONES_SHIFT)   \
+   | ((__force unsigned long)OPT_ZONE_DMA <<  \
+   ___GFP_DMA * GFP_ZONES_SHIFT)  \
+   | ((__force unsigned long)OPT_ZONE_HIGHMEM <<  \
+   ___GFP_HIGHMEM * GFP_ZONES_SHIFT)  \
+   | ((__force unsigned long)OPT_ZONE_DMA32 <<\
+   ___GFP_DMA32 * GFP_ZONES_SHIFT)\
+   | ((__force unsigned long)ZONE_NORMAL <<   \
+   ___GFP_MOVABLE * GFP_ZONES_SHIFT)  \
+   | ((__force unsigned long)OPT_ZONE_DMA <<  \
+   (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)   \
+   | ((__force unsigned long)ZONE_MOVABLE <<  \
+   (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)   \
+   | ((__force unsigned long)OPT_ZONE_DMA32 <<\
+   (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT) \
+   | ((__force unsigned long)ZONE_NVM <<  \
+   ___GFP_NVM_BIT * GFP_ZONES_SHIFT)  \
+)
+#else
 #define GFP_ZONE_TABLE ( \
(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)   \
| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)   \
@@ -380,6 +415,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t 
gfp_flags)
| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
 )
+#endif
 
 /*
  * GFP_ZONE_BAD is a bitmap for all combi

[External] [RFC PATCH v1 5/6] mm: get zone spanned pages separately for DRAM and NVDIMM

2018-05-07 Thread Huaisheng HS1 Ye
DRAM and NVDIMM are divided into separate zones, thus NVM
zone is dedicated for NVDIMMs.

During zone_spanned_pages_in_node, spanned pages of zones
are calculated separately for DRAM and NVDIMM by flags
MEMBLOCK_NONE and MEMBLOCK_NVDIMM.

Signed-off-by: Huaisheng Ye 
Signed-off-by: Ocean He 
---
 mm/nobootmem.c  |  5 +++--
 mm/page_alloc.c | 40 
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 9b02fda..19b5291 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -143,8 +143,9 @@ static unsigned long __init free_low_memory_core_early(void)
 *  because in some case like Node0 doesn't have RAM installed
 *  low ram will be on Node1
 */
-   for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
-   NULL)
+   for_each_free_mem_range(i, NUMA_NO_NODE,
+   MEMBLOCK_NONE | MEMBLOCK_NVDIMM,
+   &start, &end, NULL)
count += __free_memory_core(start, end);
 
return count;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8bd20d..3fd0d95 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4221,6 +4221,11 @@ static inline void finalise_ac(gfp_t gfp_mask,
 * also used as the starting point for the zonelist iterator. It
 * may get reset for allocations that ignore memory policies.
 */
+#ifdef CONFIG_ZONE_NVM
+   /* Bypass ZONE_NVM for Normal alloctions */
+   if (ac->high_zoneidx > ZONE_NVM)
+   ac->high_zoneidx = ZONE_NORMAL;
+#endif
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
ac->high_zoneidx, ac->nodemask);
 }
@@ -5808,6 +5813,10 @@ static unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long *zone_end_pfn,
unsigned long *ignored)
 {
+#ifdef CONFIG_ZONE_NVM
+   unsigned long start_pfn, end_pfn;
+#endif
+
/* When hotadd a new node from cpu_up(), the node should be empty */
if (!node_start_pfn && !node_end_pfn)
return 0;
@@ -5815,6 +5824,26 @@ static unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
/* Get the start and end of the zone */
*zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
*zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+#ifdef CONFIG_ZONE_NVM
+   /*
+* Use zone_type to adjust zone size again.
+*/
+   if (zone_type == ZONE_NVM) {
+   get_pfn_range_for_nid_with_flags(nid, &start_pfn, &end_pfn,
+   MEMBLOCK_NVDIMM);
+   } else {
+   get_pfn_range_for_nid_with_flags(nid, &start_pfn, &end_pfn,
+   MEMBLOCK_NONE);
+   }
+
+   if (*zone_end_pfn < start_pfn || *zone_start_pfn > end_pfn)
+   return 0;
+   /* Move the zone boundaries inside the possile_pfn if necessary */
+   *zone_end_pfn = min(*zone_end_pfn, end_pfn);
+   *zone_start_pfn = max(*zone_start_pfn, start_pfn);
+#endif
+
adjust_zone_range_for_zone_movable(nid, zone_type,
node_start_pfn, node_end_pfn,
zone_start_pfn, zone_end_pfn);
@@ -6680,6 +6709,17 @@ void __init free_area_init_nodes(unsigned long 
*max_zone_pfn)
start_pfn = end_pfn;
}
 
+#ifdef CONFIG_ZONE_NVM
+   /*
+* Adjust nvm zone included in normal zone
+*/
+   get_pfn_range_for_nid_with_flags(MAX_NUMNODES, &start_pfn, &end_pfn,
+   MEMBLOCK_NVDIMM);
+
+   arch_zone_lowest_possible_pfn[ZONE_NVM] = start_pfn;
+   arch_zone_highest_possible_pfn[ZONE_NVM] = end_pfn;
+#endif
+
/* Find the PFNs that ZONE_MOVABLE begins at in each node */
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
find_zone_movable_pfns_for_nodes();
-- 
1.8.3.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[External] [RFC PATCH v1 6/6] arch/x86/mm: create page table mapping for DRAM and NVDIMM both

2018-05-07 Thread Huaisheng HS1 Ye
Create PTE, PMD, PUD and P4D levels page table mapping for physical
addresses of DRAM and NVDIMM both. Here E820_TYPE_PMEM represents
the region of e820_table.

Signed-off-by: Huaisheng Ye 
Signed-off-by: Ocean He 
---
 arch/x86/mm/init_64.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index af11a28..c03c2091 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -420,6 +420,10 @@ void __init cleanup_highmap(void)
if (!after_bootmem &&
!e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+   !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+E820_TYPE_PMEM) &&
+#endif
!e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
set_pte(pte, __pte(0));
@@ -475,6 +479,10 @@ void __init cleanup_highmap(void)
if (!after_bootmem &&
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
 E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+   !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+E820_TYPE_PMEM) &&
+#endif
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
set_pmd(pmd, __pmd(0));
@@ -561,6 +569,10 @@ void __init cleanup_highmap(void)
if (!after_bootmem &&
!e820__mapped_any(paddr & PUD_MASK, paddr_next,
 E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+   !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+E820_TYPE_PMEM) &&
+#endif
!e820__mapped_any(paddr & PUD_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
set_pud(pud, __pud(0));
@@ -647,6 +659,10 @@ void __init cleanup_highmap(void)
if (!after_bootmem &&
!e820__mapped_any(paddr & P4D_MASK, paddr_next,
 E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+   !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+E820_TYPE_PMEM) &&
+#endif
!e820__mapped_any(paddr & P4D_MASK, paddr_next,
 E820_TYPE_RESERVED_KERN))
set_p4d(p4d, __p4d(0));
-- 
1.8.3.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Huaisheng HS1 Ye
>
>Dan Williams  writes:
>
>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox 
>wrote:
>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
 Traditionally, NVDIMMs are treated by mm(memory management)
>subsystem as
 DEVICE zone, which is a virtual zone and both its start and end of pfn
 are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
>uses
 corresponding drivers, which locate at \drivers\nvdimm\ and
 \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
 memory hot plug implementation.
>>>
>>> You probably want to let linux-nvdimm know about this patch set.
>>> Adding to the cc.
>>
>> Yes, thanks for that!
>>
>>> Also, I only received patch 0 and 4.  What happened
>>> to 1-3,5 and 6?
>>>
 With current kernel, many mm’s classical features like the buddy
 system, swap mechanism and page cache couldn’t be supported to
>NVDIMM.
 What we are doing is to expand kernel mm’s capacity to make it to
>handle
 NVDIMM like DRAM. Furthermore we make mm could treat DRAM and
>NVDIMM
 separately, that means mm can only put the critical pages to NVDIMM
>
>Please define "critical pages."
>
 zone, here we created a new zone type as NVM zone. That is to say for
 traditional(or normal) pages which would be stored at DRAM scope like
 Normal, DMA32 and DMA zones. But for the critical pages, which we hope
 them could be recovered from power fail or system crash, we make them
 to be persistent by storing them to NVM zone.
>
>[...]
>
>> I think adding yet one more mm-zone is the wrong direction. Instead,
>> what we have been considering is a mechanism to allow a device-dax
>> instance to be given back to the kernel as a distinct numa node
>> managed by the VM. It seems it times to dust off those patches.
>
>What's the use case?  The above patch description seems to indicate an
>intent to recover contents after a power loss.  Without seeing the whole
>series, I'm not sure how that's accomplished in a safe or meaningful
>way.
>
>Huaisheng, could you provide a bit more background?
>

Currently in our mind, an ideal use scenario is that, we put all page caches to
zone_nvm, without any doubt, page cache is an efficient and common cache
implement, but it has a disadvantage that all dirty data within it would has 
risk
to be missed by power failure or system crash. If we put all page caches to 
NVDIMMs,
all dirty data will be safe. 

And the most important is that, Page cache is different from dm-cache or 
B-cache.
Page cache exists at mm. So, it has much more performance than other Write
caches, which locate at storage level.

At present we have realized NVM zone to be supported by two sockets(NUMA)
product based on Lenovo Purley platform, and we can expand NVM flag into
Page Cache allocation interface, so all Page Caches of system had been stored
to NVDIMM safely.

Now we are focusing how to recover data from Page cache after power on. That is,
The dirty pages could be safe and the time cost of cache training would be 
saved a lot.
Because many pages have already stored to ZONE_NVM before power failture.

Thanks,
Huaisheng Ye

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Matthew Wilcox
On Tue, May 08, 2018 at 02:59:40AM +, Huaisheng HS1 Ye wrote:
> Currently in our mind, an ideal use scenario is that, we put all page caches 
> to
> zone_nvm, without any doubt, page cache is an efficient and common cache
> implement, but it has a disadvantage that all dirty data within it would has 
> risk
> to be missed by power failure or system crash. If we put all page caches to 
> NVDIMMs,
> all dirty data will be safe. 

That's a common misconception.  Some dirty data will still be in the
CPU caches.  Are you planning on building servers which have enough
capacitance to allow the CPU to flush all dirty data from LLC to NV-DIMM?

Then there's the problem of reconnecting the page cache (which is
pointed to by ephemeral data structures like inodes and dentries) to
the new inodes.

And then you have to convince customers that what you're doing is safe
enough for them to trust it ;-)

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Delivery reports about your e-mail

2018-05-07 Thread Automatic Email Delivery Software
The original message was received at Tue, 8 May 2018 11:10:40 +0800
from lists.01.org [140.215.174.77]

- The following addresses had permanent fatal errors -


- Transcript of session follows -
  while talking to lists.01.org.:
>>> MAIL From:"Automatic Email Delivery Software" 
<<< 501 "Automatic Email Delivery Software" ... Refused



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Dan Williams
On Mon, May 7, 2018 at 7:59 PM, Huaisheng HS1 Ye  wrote:
>>
>>Dan Williams  writes:
>>
>>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox 
>>wrote:
 On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management)
>>subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn
> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel
>>uses
> corresponding drivers, which locate at \drivers\nvdimm\ and
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.

 You probably want to let linux-nvdimm know about this patch set.
 Adding to the cc.
>>>
>>> Yes, thanks for that!
>>>
 Also, I only received patch 0 and 4.  What happened
 to 1-3,5 and 6?

> With current kernel, many mm’s classical features like the buddy
> system, swap mechanism and page cache couldn’t be supported to
>>NVDIMM.
> What we are doing is to expand kernel mm’s capacity to make it to
>>handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and
>>NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
>>
>>Please define "critical pages."
>>
> zone, here we created a new zone type as NVM zone. That is to say for
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.
>>
>>[...]
>>
>>> I think adding yet one more mm-zone is the wrong direction. Instead,
>>> what we have been considering is a mechanism to allow a device-dax
>>> instance to be given back to the kernel as a distinct numa node
>>> managed by the VM. It seems it times to dust off those patches.
>>
>>What's the use case?  The above patch description seems to indicate an
>>intent to recover contents after a power loss.  Without seeing the whole
>>series, I'm not sure how that's accomplished in a safe or meaningful
>>way.
>>
>>Huaisheng, could you provide a bit more background?
>>
>
> Currently in our mind, an ideal use scenario is that, we put all page caches 
> to
> zone_nvm, without any doubt, page cache is an efficient and common cache
> implement, but it has a disadvantage that all dirty data within it would has 
> risk
> to be missed by power failure or system crash. If we put all page caches to 
> NVDIMMs,
> all dirty data will be safe.
>
> And the most important is that, Page cache is different from dm-cache or 
> B-cache.
> Page cache exists at mm. So, it has much more performance than other Write
> caches, which locate at storage level.

Can you be more specific? I think the only fundamental performance
difference between page cache and a block caching driver is that page
cache pages can be DMA'ed directly to lower level storage. However, I
believe that problem is solvable, i.e. we can teach dm-cache to
perform the equivalent of in-kernel direct-I/O when transferring data
between the cache and the backing storage when the cache is comprised
of persistent memory.

>
> At present we have realized NVM zone to be supported by two sockets(NUMA)
> product based on Lenovo Purley platform, and we can expand NVM flag into
> Page Cache allocation interface, so all Page Caches of system had been stored
> to NVDIMM safely.
>
> Now we are focusing how to recover data from Page cache after power on. That 
> is,
> The dirty pages could be safe and the time cost of cache training would be 
> saved a lot.
> Because many pages have already stored to ZONE_NVM before power failture.

I don't see how ZONE_NVM fits into a persistent page cache solution.
All of the mm structures to maintain the page cache are built to be
volatile. Once you build the infrastructure to persist and restore the
state of the page cache it is no longer the traditional page cache.
I.e. it will become something much closer to dm-cache or a filesystem.

One nascent idea from Dave Chinner is to teach xfs how to be a block
server for an upper level filesystem. His aim is sub-volume and
snapshot support, but I wonder if caching could be adapted into that
model?

In any event I think persisting and restoring cache state needs to be
designed before deciding if changes to the mm are needed.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE

2018-05-07 Thread Randy Dunlap
On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c782e8f..5fe1f63 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -687,6 +687,22 @@ config ZONE_DEVICE
>  
> +config ZONE_NVM
> + bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> + depends on NUMA && X86_64

Hi,
I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
(i.e., UMA) configs?

Thanks.

> + depends on HAVE_MEMBLOCK_NODE_MAP
> + depends on HAVE_MEMBLOCK
> + depends on !IA32_EMULATION
> + default n
> +
> + help
> +   This option allows you to use memory management subsystem to manage
> +   NVDIMM (pmem). With it mm can arrange NVDIMMs into real physical zones
> +   like NORMAL and DMA32. That means buddy system and swap can be used
> +   directly to NVDIMM zone. This feature is beneficial to recover
> +   dirty pages from power fail or system crash by storing write cache
> +   to NVDIMM zone.



-- 
~Randy
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm