[PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

2021-01-05 Thread Dan Williams
While pfn_to_online_page() is able to determine pfn_valid() at
subsection granularity it is not able to reliably determine if a given
pfn is also online if the section is mixed with ZONE_DEVICE pfns.

Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
section that mixes ZONE_DEVICE pfns with other online pfns. With
SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
back to a slow-path check for ZONE_DEVICE pfns in an online section.

With this implementation of pfn_to_online_page() pfn-walkers mostly only
need to check section metadata to determine pfn validity. In the
rare case of mixed-zone sections the pfn-walker will skip offline
ZONE_DEVICE pfns as expected.

Other notes:

Because the collision case is rare, and for simplicity, the
SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.

pfn_to_online_page() was already borderline too large to be a macro /
inline function, but the additional logic certainly pushed it over that
threshold, and so it is moved to an out-of-line helper.

Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Cc: Andrew Morton 
Reported-by: Michal Hocko 
Reported-by: David Hildenbrand 
Signed-off-by: Dan Williams 
---

This compiles and passes the nvdimm unit tests, but I have not tested
with pfn walkers in the presence of ZONE_DEVICE collisions.


 include/linux/memory_hotplug.h |   17 +-
 include/linux/mmzone.h |   22 
 mm/memory_hotplug.c|   71 
 3 files changed, 87 insertions(+), 23 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 15acce5ab106..3d99de0db2dd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -16,22 +16,7 @@ struct resource;
 struct vmem_altmap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-/*
- * Return page for the valid pfn only if the page is online. All pfn
- * walkers which rely on the fully initialized page->flags and others
- * should use this rather than pfn_valid && pfn_to_page
- */
-#define pfn_to_online_page(pfn)   \
-({\
-   struct page *___page = NULL;   \
-   unsigned long ___pfn = pfn;\
-   unsigned long ___nr = pfn_to_section_nr(___pfn);   \
-  \
-   if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
-   pfn_valid_within(___pfn))  \
-   ___page = pfn_to_page(___pfn); \
-   ___page;   \
-})
+struct page *pfn_to_online_page(unsigned long pfn);
 
 /*
  * Types for free bootmem stored in page->lru.next. These have to be in
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..0b5c44f730b4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1273,13 +1273,14 @@ extern size_t mem_section_usage_size(void);
  *  which results in PFN_SECTION_SHIFT equal 6.
  * To sum it up, at least 6 bits are available.
  */
-#defineSECTION_MARKED_PRESENT  (1UL<<0)
-#define SECTION_HAS_MEM_MAP(1UL<<1)
-#define SECTION_IS_ONLINE  (1UL<<2)
-#define SECTION_IS_EARLY   (1UL<<3)
-#define SECTION_MAP_LAST_BIT   (1UL<<4)
-#define SECTION_MAP_MASK   (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT  3
+#define SECTION_MARKED_PRESENT (1UL<<0)
+#define SECTION_HAS_MEM_MAP(1UL<<1)
+#define SECTION_IS_ONLINE  (1UL<<2)
+#define SECTION_IS_EARLY   (1UL<<3)
+#define SECTION_TAINT_ZONE_DEVICE  (1UL<<4)
+#define SECTION_MAP_LAST_BIT   (1UL<<5)
+#define SECTION_MAP_MASK   (~(SECTION_MAP_LAST_BIT-1))
+#define SECTION_NID_SHIFT  3
 
 static inline struct page *__section_mem_map_addr(struct mem_section *section)
 {
@@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section 
*section)
return (section && (section->section_mem_map & SECTION_IS_ONLINE));
 }
 
+static inline int online_device_section(struct mem_section *section)
+{
+   unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
+
+   return section && ((section->section_mem_map & flags) == flags);
+}
+
 static inline int online_section_nr(unsigned long nr)
 {
return online_section(__nr_to_section(nr));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9be8c7..9f36968e6188 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -300,6 +300,47 @@ static int check_hotplug_memory_addressable(unsigned long 
pfn,
return 0;
 }
 
+/*
+ * Return page for the valid pfn only if t

Re: [PATCH] ACPI: NFIT: Fix flexible_array.cocci warnings

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 1:28 PM Verma, Vishal L  wrote:
>
> On Tue, 2021-01-05 at 13:03 -0800, Dan Williams wrote:
> > Julia and 0day report:
> >
> > Zero-length and one-element arrays are deprecated, see
> > Documentation/process/deprecated.rst
> > Flexible-array members should be used instead.
> >
> > However, a straight conversion to flexible arrays yields:
> >
> > drivers/acpi/nfit/core.c:2276:4: error: flexible array member in a 
> > struct with no named members
> > drivers/acpi/nfit/core.c:2287:4: error: flexible array member in a 
> > struct with no named members
> >
> > Instead, just use plain arrays not embedded a flexible arrays.
>
> This reads a bit awkwardly, maybe:
>
>   "Just use plain arrays instead of embedded flexible arrays."

yeah, umm, I left that extra "a" in there as a test... you passed! :)

>
> Other than that, the patch looks looks good:
> Reviewed-by: Vishal Verma 

Thanks.


[PATCH] ACPI: NFIT: Fix flexible_array.cocci warnings

2021-01-05 Thread Dan Williams
Julia and 0day report:

Zero-length and one-element arrays are deprecated, see
Documentation/process/deprecated.rst
Flexible-array members should be used instead.

However, a straight conversion to flexible arrays yields:

drivers/acpi/nfit/core.c:2276:4: error: flexible array member in a struct 
with no named members
drivers/acpi/nfit/core.c:2287:4: error: flexible array member in a struct 
with no named members

Instead, just use plain arrays not embedded a flexible arrays.

Cc: Denis Efremov 
Reported-by: kernel test robot 
Reported-by: Julia Lawall 
Signed-off-by: Dan Williams 
---
 drivers/acpi/nfit/core.c |   75 +-
 1 file changed, 28 insertions(+), 47 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index b11b08a60684..8c5dde628405 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2269,40 +2269,24 @@ static const struct attribute_group 
*acpi_nfit_region_attribute_groups[] = {
 
 /* enough info to uniquely specify an interleave set */
 struct nfit_set_info {
-   struct nfit_set_info_map {
-   u64 region_offset;
-   u32 serial_number;
-   u32 pad;
-   } mapping[0];
+   u64 region_offset;
+   u32 serial_number;
+   u32 pad;
 };
 
 struct nfit_set_info2 {
-   struct nfit_set_info_map2 {
-   u64 region_offset;
-   u32 serial_number;
-   u16 vendor_id;
-   u16 manufacturing_date;
-   u8  manufacturing_location;
-   u8  reserved[31];
-   } mapping[0];
+   u64 region_offset;
+   u32 serial_number;
+   u16 vendor_id;
+   u16 manufacturing_date;
+   u8 manufacturing_location;
+   u8 reserved[31];
 };
 
-static size_t sizeof_nfit_set_info(int num_mappings)
-{
-   return sizeof(struct nfit_set_info)
-   + num_mappings * sizeof(struct nfit_set_info_map);
-}
-
-static size_t sizeof_nfit_set_info2(int num_mappings)
-{
-   return sizeof(struct nfit_set_info2)
-   + num_mappings * sizeof(struct nfit_set_info_map2);
-}
-
 static int cmp_map_compat(const void *m0, const void *m1)
 {
-   const struct nfit_set_info_map *map0 = m0;
-   const struct nfit_set_info_map *map1 = m1;
+   const struct nfit_set_info *map0 = m0;
+   const struct nfit_set_info *map1 = m1;
 
return memcmp(>region_offset, >region_offset,
sizeof(u64));
@@ -2310,8 +2294,8 @@ static int cmp_map_compat(const void *m0, const void *m1)
 
 static int cmp_map(const void *m0, const void *m1)
 {
-   const struct nfit_set_info_map *map0 = m0;
-   const struct nfit_set_info_map *map1 = m1;
+   const struct nfit_set_info *map0 = m0;
+   const struct nfit_set_info *map1 = m1;
 
if (map0->region_offset < map1->region_offset)
return -1;
@@ -2322,8 +2306,8 @@ static int cmp_map(const void *m0, const void *m1)
 
 static int cmp_map2(const void *m0, const void *m1)
 {
-   const struct nfit_set_info_map2 *map0 = m0;
-   const struct nfit_set_info_map2 *map1 = m1;
+   const struct nfit_set_info2 *map0 = m0;
+   const struct nfit_set_info2 *map1 = m1;
 
if (map0->region_offset < map1->region_offset)
return -1;
@@ -2361,22 +2345,22 @@ static int acpi_nfit_init_interleave_set(struct 
acpi_nfit_desc *acpi_desc,
return -ENOMEM;
import_guid(_set->type_guid, spa->range_guid);
 
-   info = devm_kzalloc(dev, sizeof_nfit_set_info(nr), GFP_KERNEL);
+   info = devm_kcalloc(dev, nr, sizeof(*info), GFP_KERNEL);
if (!info)
return -ENOMEM;
 
-   info2 = devm_kzalloc(dev, sizeof_nfit_set_info2(nr), GFP_KERNEL);
+   info2 = devm_kcalloc(dev, nr, sizeof(*info2), GFP_KERNEL);
if (!info2)
return -ENOMEM;
 
for (i = 0; i < nr; i++) {
struct nd_mapping_desc *mapping = _desc->mapping[i];
-   struct nfit_set_info_map *map = >mapping[i];
-   struct nfit_set_info_map2 *map2 = >mapping[i];
struct nvdimm *nvdimm = mapping->nvdimm;
struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
-   struct acpi_nfit_memory_map *memdev = memdev_from_spa(acpi_desc,
-   spa->range_index, i);
+   struct nfit_set_info *map = [i];
+   struct nfit_set_info2 *map2 = [i];
+   struct acpi_nfit_memory_map *memdev =
+   memdev_from_spa(acpi_desc, spa->range_index, i);
struct acpi_nfit_control_region *dcr = nfit_mem->dcr;
 
if (!memdev || !nfit_mem->dcr) {
@@ -2395,23 +2379,20 @@ static int acpi_nfit_init_interleave_set(struct 
acpi_nfit_desc *acpi_desc,
}
 
/* v1.1 namespaces */
-   sort(>mapping[0], nr, sizeof(stru

[tip: x86/urgent] x86/mm: Fix leak of pmd ptlock

2021-01-05 Thread tip-bot2 for Dan Williams
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: d1c5246e08eb64991001d97a3bd119c93edbc79a
Gitweb:
https://git.kernel.org/tip/d1c5246e08eb64991001d97a3bd119c93edbc79a
Author:Dan Williams 
AuthorDate:Wed, 02 Dec 2020 22:28:12 -08:00
Committer: Borislav Petkov 
CommitterDate: Tue, 05 Jan 2021 11:40:23 +01:00

x86/mm: Fix leak of pmd ptlock

Commit

  28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")

introduced a new location where a pmd was released, but neglected to
run the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:

  c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").

This issue was hidden until recently because the failure mode is silent,
but commit:

  b2b29d6d0119 ("mm: account PMD tables like PTE tables")

turns the failure mode into this signature:

 BUG: Bad page state in process lt-pmem-ns  pfn:15943d
 page:7262ed7b refcount:0 mapcount:-1024 mapping: 
index:0x0 pfn:0x15943d
 flags: 0xa8()
 raw: 00a8 dead0100  
 raw:  913a029bcc08 fbff 
 page dumped because: nonzero mapcount
 [..]
  dump_stack+0x8b/0xb0
  bad_page.cold+0x63/0x94
  free_pcp_prepare+0x224/0x270
  free_unref_page+0x18/0xd0
  pud_free_pmd_page+0x146/0x160
  ioremap_pud_range+0xe3/0x350
  ioremap_page_range+0x108/0x160
  __ioremap_caller.constprop.0+0x174/0x2b0
  ? memremap+0x7a/0x110
  memremap+0x7a/0x110
  devm_memremap+0x53/0xa0
  pmem_attach_disk+0x4ed/0x530 [nd_pmem]
  ? __devm_release_region+0x52/0x80
  nvdimm_bus_probe+0x85/0x210 [libnvdimm]

Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure.

As for a better helper pmd_free() is close, but it is a messy fit
due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call
paravirt_tlb_remove_table() instead of free_page(), so open-coded
pgtable_pmd_page_dtor() seems the best way forward for now.

Debugged together with Matthew Wilcox .

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Dan Williams 
Signed-off-by: Borislav Petkov 
Tested-by: Yi Zhang 
Acked-by: Peter Zijlstra (Intel) 
Cc: 
Link: 
https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.st...@dwillia2-desk3.amr.corp.intel.com
---
 arch/x86/mm/pgtable.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f5..f6a9e2e 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -829,6 +829,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
}
 
free_page((unsigned long)pmd_sv);
+
+   pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);
 
return 1;


[tip: x86/urgent] x86/mm: Fix leak of pmd ptlock

2021-01-05 Thread tip-bot2 for Dan Williams
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 1bd3d9c593211e09771562b464028d3ab7e05b3a
Gitweb:
https://git.kernel.org/tip/1bd3d9c593211e09771562b464028d3ab7e05b3a
Author:Dan Williams 
AuthorDate:Wed, 02 Dec 2020 22:28:12 -08:00
Committer: Borislav Petkov 
CommitterDate: Tue, 05 Jan 2021 11:17:21 +01:00

x86/mm: Fix leak of pmd ptlock

Commit

  28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")

introduced a new location where a pmd was released, but neglected to
run the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:

  c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").

This issue was hidden until recently because the failure mode is silent,
but commit:

  b2b29d6d0119 ("mm: account PMD tables like PTE tables")

turns the failure mode into this signature:

 BUG: Bad page state in process lt-pmem-ns  pfn:15943d
 page:7262ed7b refcount:0 mapcount:-1024 mapping: 
index:0x0 pfn:0x15943d
 flags: 0xa8()
 raw: 00a8 dead0100  
 raw:  913a029bcc08 fbff 
 page dumped because: nonzero mapcount
 [..]
  dump_stack+0x8b/0xb0
  bad_page.cold+0x63/0x94
  free_pcp_prepare+0x224/0x270
  free_unref_page+0x18/0xd0
  pud_free_pmd_page+0x146/0x160
  ioremap_pud_range+0xe3/0x350
  ioremap_page_range+0x108/0x160
  __ioremap_caller.constprop.0+0x174/0x2b0
  ? memremap+0x7a/0x110
  memremap+0x7a/0x110
  devm_memremap+0x53/0xa0
  pmem_attach_disk+0x4ed/0x530 [nd_pmem]
  ? __devm_release_region+0x52/0x80
  nvdimm_bus_probe+0x85/0x210 [libnvdimm]

Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure.

As for a better helper pmd_free() is close, but it is a messy fit
due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call
paravirt_tlb_remove_table() instead of free_page(), so open-coded
pgtable_pmd_page_dtor() seems the best way forward for now.

Debugged together with Matthew Wilcox .

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Signed-off-by: Dan Williams 
Signed-off-by: Borislav Petkov 
Tested-by: Yi Zhang 
Cc: 
Link: 
https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.st...@dwillia2-desk3.amr.corp.intel.com
---
 arch/x86/mm/pgtable.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f5..f6a9e2e 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -829,6 +829,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
}
 
free_page((unsigned long)pmd_sv);
+
+   pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);
 
return 1;


Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 1:37 AM David Hildenbrand  wrote:
>
> >> Yeah, obviously the first one. Being able to add+use PMEM is more
> >> important than using each and every last MB of main memory.
> >>
> >> I wonder if we can just stop adding any system RAM like
> >>
> >> [ Memory Section]
> >> [ RAM ] [  Hole ]
> >>
> >> When there could be the possibility that the hole might actually be
> >> PMEM. (e.g., with CONFIG_ZONE_DEVICE and it being the last section in a
> >> sequence of sections, not just a tiny hole)
> >
> > I like the simplicity of it... I worry that the capacity loss
> > regression is easy to notice by looking at the output of free(1) from
> > one kernel to the next and someone screams.
>
> Well, you can always make it configurable and then simply fail to add
> PMEM later if impossible (trying to sub-section hot-add into early
> section). It's in the hands of the sysadmin then ("max out system ram"
> vs. "support any PMEM device that could eventually be there at
> runtime"). Distros would go for the second.
>
> I agree that it's not optimal, but sometimes simplicity has to win.

Here's where we left it last time, open to pfn_to_online_page hacks...

https://lore.kernel.org/linux-mm/CAPcyv4ivq=epuepxix2ercvyf7+dv9yv215oue7x_y2x_jf...@mail.gmail.com

I don't think a slow-path flag in the mem-section is too onerous, but
I'll withhold judgement until I have the patch I'm thinking of
in-hand. Let me give it a shot, you can always nack the final result.


Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 1:25 AM David Hildenbrand  wrote:
>
> On 05.01.21 06:17, Dan Williams wrote:
> > On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand  wrote:
> >>
> >> On 04.01.21 11:03, Michal Hocko wrote:
> >>> Hi,
> >>> back in March [1] you have recommended 53cdc1cb29e8
> >>> ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> >>> backported to stable trees and that has led to a more general discussion
> >>> about the current state of pfn walkers wrt. uninitialized pmem struct
> >>> pages. We haven't concluded any specific solution for that except for a
> >>> general sentiment that pfn_to_online_page should be able to catch those.
> >>> I might have missed any follow ups on that but I do not think we have
> >>> landed on any actual solution in the end. Have I just missed any 
> >>> followups?
> >>
> >> Thanks for raising this issue. It's still on my list of "broken and
> >> non-trivial to fix" things.
> >>
> >> So, as far as I recall, we still have the following two issues remaining:
> >>
> >> 1. pfn_to_online_page() false positives
> >>
> >> The semantics of pfn_to_online_page() were broken with sub-section
> >> hot-add in corner cases.
> >
> > The motivation for that backport was for pre-subsection kernels. When
> > onlining pmem that collides with the same section as System-RAM we may
> > have a situation like:
> >
> > |--   SECTION   -- |
> > |-- System RAM  --PMEM  -- |
> > |-- pfn_valid() -- PMEM metadata-- |
> >
> > So problem 0. is just System RAM + PMEM collisions independent of
> > sub-section support.
>
> IIRC, you were not able to hot-add the PMEM device before sub-section
> support (which was bad and fixed by sub-section hot-add). How was this a
> problem before?

The details are in the subsection changelog but the tl;dr is that I
clumsily hacked around it by throwing away PMEM, but that breaks when
PMEM shifts around in physical address space from one boot to the
next. The moving around is on 64MB granularity, so subsection hotplug
lets those shifts be covered and enables smaller granularity /
alignment support for smaller things like p2pdma mappings.

>
> >
> >>
> >> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> >> boot memory, this memory section will contain parts ZONE_DEVICE memory
> >> and parts !ZONE_DEVICE memory. This can happen in sub-section
> >> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> >> memory parts as the whole section is marked as online. Bad.
> >>
> >> One instance where this is still an issue is
> >> mm/memory-failure.c:memory_failure() and
> >> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> >> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> >> actually the right approach.
> >
> > This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
> > say "yes" to pfn_to_online_page().
> >
> >>
> >> But worse, before ZONE_DEVICE hot-add
> >> 1. The whole section memmap does already exist (early sections always
> >> have a full memmap for the whole section)
> >> 2. The whole section memmap is initialized (although eventually with
> >> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> >> be accessed by pfn walkers.
> >>
> >> So when hotadding ZONE_DEVICE we are modifying already existing and
> >> visible memmaps. Bad.
> >
> > Where does the rewrite of a dummy page entry matter in practice? It
> > would certainly be exceedingly Bad if in-use 'struct page' instances
> > we're rewritten. You're only alleging the former, correct?
>
> Yes, just another piece of the puzzle.
>
> >> 2. Deferred init of ZONE_DEVICE ranges
> >>
> >> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> >> and outside the memhp lock. I did not follow if the use of
> >> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> >> pagemap_range() actually completed. I don't think it does.
> >>
> >>>
> >>> Is anybody working on that?
> >>>
> >>
> >> I believe Dan mentioned somewhere that he wants to see a real instance
> >> of this producing a BUG before actually moving forward with a fix. I
> >> might be wron

Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 12:42 AM Michal Hocko  wrote:
>
> On Tue 05-01-21 00:27:34, Dan Williams wrote:
> > On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
> > >
> > > On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > > > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > > > >> [...]
> > > > > >>> Do the physical addresses you see fall into the same section as 
> > > > > >>> boot
> > > > > >>> memory? Or what's around these addresses?
> > > > > >>
> > > > > >> Yes I am getting a garbage for the first struct page belonging to 
> > > > > >> the
> > > > > >> pmem section [1]
> > > > > >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 
> > > > > >> 0x1-0x603fff]
> > > > > >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> > > > > >> 0x606000-0x11d5fff] non-volatile
> > > > > >>
> > > > > >> The pfn without the initialized struct page is 0x606. This is a
> > > > > >> first pfn in a section.
> > > > > >
> > > > > > Okay, so we're not dealing with the "early section" mess I 
> > > > > > described,
> > > > > > different story.
> > > > > >
> > > > > > Due to [1], is_mem_section_removable() called
> > > > > > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as 
> > > > > > not
> > > > > > initialized.
> > > > > >
> > > > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > > > actual address of the memmap?
> > > > > >
> > > > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it 
> > > > > > actually
> > > > > > part of the actual altmap (i.e. > 0x606) or maybe even 
> > > > > > self-hosted?
> > > > > >
> > > > > > If it's not self-hosted, initializing the relevant memmaps should 
> > > > > > work
> > > > > > just fine I guess. Otherwise things get more complicated.
> > > > >
> > > > > Oh, I forgot: pfn_to_online_page() should at least in your example 
> > > > > make
> > > > > sure other pfn walkers are safe. It was just an issue of
> > > > > is_mem_section_removable().
> > > >
> > > > Hmm, I suspect you are right. I haven't put this together, thanks! The 
> > > > memory
> > > > section is indeed marked offline so pfn_to_online_page would indeed bail
> > > > out:
> > > > crash> p (0x606>>15)
> > > > $3 = 3084
> > > > crash> p mem_section[3084/128][3084 & 127]
> > > > $4 = {
> > > >   section_mem_map = 18446736128020054019,
> > > >   usage = 0x902dcf956680,
> > > >   page_ext = 0x0,
> > > >   pad = 0
> > > > }
> > > > crash> p 18446736128020054019 & (1UL<<2)
> > > > $5 = 0
> > > >
> > > > That makes it considerably less of a problem than I thought!
> > >
> > > Forgot to add that those who are running kernels without 53cdc1cb29e8
> > > ("drivers/base/memory.c: indicate all memory blocks as removable") for
> > > some reason can fix the crash by the following simple patch.
> > >
> > > Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > ===
> > > --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> > > +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> > > @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> > > goto out;
> > >
> > > for (i = 0; i < sections_per_block; i++) {
> > > -   if (!present_section_nr(mem->start_section_nr + i))
> > > +   unsigned long nr = mem->start_section_nr + i;
> > > +   if (!p

Re: uninitialized pmem struct pages

2021-01-05 Thread Dan Williams
On Tue, Jan 5, 2021 at 12:17 AM Michal Hocko  wrote:
>
> On Tue 05-01-21 09:01:00, Michal Hocko wrote:
> > On Mon 04-01-21 16:44:52, David Hildenbrand wrote:
> > > On 04.01.21 16:43, David Hildenbrand wrote:
> > > > On 04.01.21 16:33, Michal Hocko wrote:
> > > >> On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > > >>> On 04.01.21 16:10, Michal Hocko wrote:
> > > >> [...]
> > > >>> Do the physical addresses you see fall into the same section as boot
> > > >>> memory? Or what's around these addresses?
> > > >>
> > > >> Yes I am getting a garbage for the first struct page belonging to the
> > > >> pmem section [1]
> > > >> [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> > > >> [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 
> > > >> 0x606000-0x11d5fff] non-volatile
> > > >>
> > > >> The pfn without the initialized struct page is 0x606. This is a
> > > >> first pfn in a section.
> > > >
> > > > Okay, so we're not dealing with the "early section" mess I described,
> > > > different story.
> > > >
> > > > Due to [1], is_mem_section_removable() called
> > > > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> > > > initialized.
> > > >
> > > > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > > > actual address of the memmap?
> > > >
> > > > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> > > > part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
> > > >
> > > > If it's not self-hosted, initializing the relevant memmaps should work
> > > > just fine I guess. Otherwise things get more complicated.
> > >
> > > Oh, I forgot: pfn_to_online_page() should at least in your example make
> > > sure other pfn walkers are safe. It was just an issue of
> > > is_mem_section_removable().
> >
> > Hmm, I suspect you are right. I haven't put this together, thanks! The 
> > memory
> > section is indeed marked offline so pfn_to_online_page would indeed bail
> > out:
> > crash> p (0x606>>15)
> > $3 = 3084
> > crash> p mem_section[3084/128][3084 & 127]
> > $4 = {
> >   section_mem_map = 18446736128020054019,
> >   usage = 0x902dcf956680,
> >   page_ext = 0x0,
> >   pad = 0
> > }
> > crash> p 18446736128020054019 & (1UL<<2)
> > $5 = 0
> >
> > That makes it considerably less of a problem than I thought!
>
> Forgot to add that those who are running kernels without 53cdc1cb29e8
> ("drivers/base/memory.c: indicate all memory blocks as removable") for
> some reason can fix the crash by the following simple patch.
>
> Index: linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> ===
> --- linux-5.3-users_mhocko_SLE15-SP2_for-next.orig/drivers/base/memory.c
> +++ linux-5.3-users_mhocko_SLE15-SP2_for-next/drivers/base/memory.c
> @@ -152,9 +152,14 @@ static ssize_t removable_show(struct dev
> goto out;
>
> for (i = 0; i < sections_per_block; i++) {
> -   if (!present_section_nr(mem->start_section_nr + i))
> +   unsigned long nr = mem->start_section_nr + i;
> +   if (!present_section_nr(nr))
> continue;
> -   pfn = section_nr_to_pfn(mem->start_section_nr + i);
> +   if (!online_section_nr()) {

I assume that's onlince_section_nr(nr) in the version that compiles?

This makes sense because the memory block size is larger than the
section size. I suspect you have 1GB memory block size on this system,
but since the System RAM and PMEM collide at a 512MB alignment in a
memory block you end up walking the back end of the last 512MB of the
System RAM memory block and run into the offline PMEM section.

So, I don't think it's pfn_to_online_page that necessarily needs to
know how to disambiguate each page, it's things that walk sections and
memory blocks and expects them to be consistent over the span.


Re: [PATCH] arm64: add pmem module for kernel update

2021-01-04 Thread Dan Williams
Hi Zhuling,

On Wed, Dec 30, 2020 at 10:18 PM Zhuling  wrote:
>
> Category: feature
> Bugzilla: NA
> CVE: NA

These tags can be dropped.

>
> Use reserved memory to create a pmem device to store the
> processes information that dumped before kernel update.
> When you want to use this feature you need to declare by
> "pmemmem=pmem_size:pmem_phystart" in cmdline.
> (exp: pmemmem=100M:0x2020)
>

Interesting. I like the feature, but it's not clear to me that a new
command line based configuration scheme is needed. There is the
existing memmap= parameter that on x86 describes a
IORES_DESC_PERSISTENT_MEMORY_LEGACY range. The driver/nvdimm/e820.c
driver could be reworked to attach to the same thing on ARM64.

Then as far as assigning memory to different kernel usages there is
the existing capability in libnvdimm to attach a "personality" to an
nvdimm namespace. I imagine you could write a special signature to the
namespace that libnvdimm would recognize as a KUP reservation
namespace and work generically across any arch.


Re: uninitialized pmem struct pages

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 7:59 AM Michal Hocko  wrote:
>
> On Mon 04-01-21 16:43:49, David Hildenbrand wrote:
> > On 04.01.21 16:33, Michal Hocko wrote:
> > > On Mon 04-01-21 16:15:23, David Hildenbrand wrote:
> > >> On 04.01.21 16:10, Michal Hocko wrote:
> > > [...]
> > >> Do the physical addresses you see fall into the same section as boot
> > >> memory? Or what's around these addresses?
> > >
> > > Yes I am getting a garbage for the first struct page belonging to the
> > > pmem section [1]
> > > [0.020161] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x603fff]
> > > [0.020163] ACPI: SRAT: Node 4 PXM 4 [mem 0x606000-0x11d5fff] 
> > > non-volatile
> > >
> > > The pfn without the initialized struct page is 0x606. This is a
> > > first pfn in a section.
> >
> > Okay, so we're not dealing with the "early section" mess I described,
> > different story.
> >
> > Due to [1], is_mem_section_removable() called
> > pfn_to_page(PHYS_PFN(0x606)). page_zone(page) made it crash, as not
> > initialized.
>
> Exactly!
>
> > Let's assume this is indeed a reserved pfn in the altmap. What's the
> > actual address of the memmap?
>
> Not sure what exactly you are asking for but crash says
> crash> kmem -p 606
>   PAGE  PHYSICAL  MAPPING   INDEX CNT FLAGS
> f8c600181800 60600  0 fc000
>
> > I do wonder what hosts pfn_to_page(PHYS_PFN(0x606)) - is it actually
> > part of the actual altmap (i.e. > 0x606) or maybe even self-hosted?
>
> I am not really familiar with the pmem so I would need more assistance
> here. I've tried this (shot into the dark):
> crash> struct page.pgmap f8c600181800
>   pgmap = 0xf8c600181808

Does /proc/iomem show an active namespace in the range? You should be
able to skip ahead to the first pfn in that namespace to find the
first dev_pagemap. I would have expected pfn_to_online_page() to have
saved you here. This address range is section aligned.


Re: uninitialized pmem struct pages

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 2:45 AM David Hildenbrand  wrote:
>
> On 04.01.21 11:03, Michal Hocko wrote:
> > Hi,
> > back in March [1] you have recommended 53cdc1cb29e8
> > ("drivers/base/memory.c: indicate all memory blocks as removable") to be
> > backported to stable trees and that has led to a more general discussion
> > about the current state of pfn walkers wrt. uninitialized pmem struct
> > pages. We haven't concluded any specific solution for that except for a
> > general sentiment that pfn_to_online_page should be able to catch those.
> > I might have missed any follow ups on that but I do not think we have
> > landed on any actual solution in the end. Have I just missed any followups?
>
> Thanks for raising this issue. It's still on my list of "broken and
> non-trivial to fix" things.
>
> So, as far as I recall, we still have the following two issues remaining:
>
> 1. pfn_to_online_page() false positives
>
> The semantics of pfn_to_online_page() were broken with sub-section
> hot-add in corner cases.

The motivation for that backport was for pre-subsection kernels. When
onlining pmem that collides with the same section as System-RAM we may
have a situation like:

|--   SECTION   -- |
|-- System RAM  --PMEM  -- |
|-- pfn_valid() -- PMEM metadata-- |

So problem 0. is just System RAM + PMEM collisions independent of
sub-section support.

>
> If we have ZONE_DEVICE hot-added memory that overlaps in a section with
> boot memory, this memory section will contain parts ZONE_DEVICE memory
> and parts !ZONE_DEVICE memory. This can happen in sub-section
> granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE
> memory parts as the whole section is marked as online. Bad.
>
> One instance where this is still an issue is
> mm/memory-failure.c:memory_failure() and
> mm/memory-failure.c:soft_offline_page(). I thought for a while about
> "fixing" these, but to me it felt like fixing pfn_to_online_page() is
> actually the right approach.

This is complicated by MEMORY_DEVICE_PRIVATE that I believe wants to
say "yes" to pfn_to_online_page().

>
> But worse, before ZONE_DEVICE hot-add
> 1. The whole section memmap does already exist (early sections always
> have a full memmap for the whole section)
> 2. The whole section memmap is initialized (although eventually with
> dummy node/zone 0/0 for memory holes until that part is fixed) and might
> be accessed by pfn walkers.
>
> So when hotadding ZONE_DEVICE we are modifying already existing and
> visible memmaps. Bad.

Where does the rewrite of a dummy page entry matter in practice? It
would certainly be exceedingly Bad if in-use 'struct page' instances
we're rewritten. You're only alleging the former, correct?

>
>
> 2. Deferred init of ZONE_DEVICE ranges
>
> memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized
> and outside the memhp lock. I did not follow if the use of
> get_dev_pagemap() actually makes sure that memmap_init_zone_device() in
> pagemap_range() actually completed. I don't think it does.
>
> >
> > Is anybody working on that?
> >
>
> I believe Dan mentioned somewhere that he wants to see a real instance
> of this producing a BUG before actually moving forward with a fix. I
> might be wrong.

I think I'm missing an argument for the user-visible effects of the
"Bad." statements above. I think soft_offline_page() is a candidate
for a local fix because mm/memory-failure.c already has a significant
amount of page-type specific knowledge. So teaching it "yes" for
MEMORY_DEVICE_PRIVATE-ZONE_DEVICE and "no" for other ZONE_DEVICE seems
ok to me.

>
>
> We might tackle 1. by:
>
> a) [worked-around] doing get_dev_pagemap() before pfn_to_online_page() -
> just plain ugly.
>
> b) [worked-around] using a sub-section online-map and extending
> pfn_to_online_page(). IMHO ugly to fix this corner case.
>
> c) [worked-around] extending pfn_to_online_page() by zone checks. IMHO racy.
>
> d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single
> section. In the worst case, don't add partially present sections that
> have big holes in the beginning/end. Like, if there is a 128MB section
> with 126MB of memory followed by a 2MB hole, don't add that memory to
> Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but
> well, it would be the price to pay for simplicity. Being able to hotadd
> PMEM is more important than using each and every last MB of memory.

The collisions that are out there in the wild are 64MB System RAM
followed by 64MB of PMEM. If you're suggesting reducing System RAM
that collides with PMEM that's a consideration. It can't go the other
way because there are deployed configurations that have persistent
data there. Reducing System RAM runs into the problem of how early
does the kernel know that it's bordering ZONE_DEVICE. It's not just
PMEM, it's also EFI_MEMORY_SP (Linux Soft Reserved) memory.

> e) decrease the section size and drop 

Re: [PATCH] x86/mm: Fix leak of pmd ptlock

2021-01-04 Thread Dan Williams
Ping, this bug is still present on v5.11-rc2, need a resend?

On Wed, Dec 2, 2020 at 10:28 PM Dan Williams  wrote:
>
> Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
> introduced a new location where a pmd was released, but neglected to run
> the pmd page destructor. In fact, this happened previously for a
> different pmd release path and was fixed by commit:
>
> c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").
>
> This issue was hidden until recently because the failure mode is silent,
> but commit:
>
> b2b29d6d0119 ("mm: account PMD tables like PTE tables")
>
> ...turns the failure mode into this signature:
>
>  BUG: Bad page state in process lt-pmem-ns  pfn:15943d
>  page:7262ed7b refcount:0 mapcount:-1024 mapping: 
> index:0x0 pfn:0x15943d
>  flags: 0xa8()
>  raw: 00a8 dead0100  
>  raw:  913a029bcc08 fbff 
>  page dumped because: nonzero mapcount
>  [..]
>   dump_stack+0x8b/0xb0
>   bad_page.cold+0x63/0x94
>   free_pcp_prepare+0x224/0x270
>   free_unref_page+0x18/0xd0
>   pud_free_pmd_page+0x146/0x160
>   ioremap_pud_range+0xe3/0x350
>   ioremap_page_range+0x108/0x160
>   __ioremap_caller.constprop.0+0x174/0x2b0
>   ? memremap+0x7a/0x110
>   memremap+0x7a/0x110
>   devm_memremap+0x53/0xa0
>   pmem_attach_disk+0x4ed/0x530 [nd_pmem]
>   ? __devm_release_region+0x52/0x80
>   nvdimm_bus_probe+0x85/0x210 [libnvdimm]
>
> Given this is a repeat occurrence it seemed prudent to look for other
> places where this destructor might be missing and whether a better
> helper is needed. try_to_free_pmd_page() looks like a candidate, but
> testing with setting up and tearing down pmd mappings via the dax unit
> tests is thus far not triggering the failure. As for a better helper
> pmd_free() is close, but it is a messy fit due to requiring an @mm arg.
> Also, ___pmd_free_tlb() wants to call paravirt_tlb_remove_table()
> instead of free_page(), so open-coded pgtable_pmd_page_dtor() seems the
> best way forward for now.
>
> Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
> Cc: 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: "H. Peter Anvin" 
> Co-debugged-by: Matthew Wilcox 
> Tested-by: Yi Zhang 
> Signed-off-by: Dan Williams 
> ---
>  arch/x86/mm/pgtable.c |2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index dfd82f51ba66..f6a9e2e36642 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -829,6 +829,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
> }
>
> free_page((unsigned long)pmd_sv);
> +
> +   pgtable_pmd_page_dtor(virt_to_page(pmd));
> free_page((unsigned long)pmd);
>
> return 1;
>


Re: [PATCH v2] fs/dax: include to fix build error on ARC

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 7:47 PM Randy Dunlap  wrote:
>
> On 1/4/21 12:13 PM, Dan Williams wrote:
> > On Thu, Dec 31, 2020 at 8:29 PM Randy Dunlap  wrote:
> >>
> >> fs/dax.c uses copy_user_page() but ARC does not provide that interface,
> >> resulting in a build error.
> >>
> >> Provide copy_user_page() in  (beside copy_page()) and
> >> add  to fs/dax.c to fix the build error.
> >>
> >> ../fs/dax.c: In function 'copy_cow_page_dax':
> >> ../fs/dax.c:702:2: error: implicit declaration of function 
> >> 'copy_user_page'; did you mean 'copy_to_user_page'? 
> >> [-Werror=implicit-function-declaration]
> >>
> >> Fixes: cccbce671582 ("filesystem-dax: convert to dax_direct_access()")
> >> Reported-by: kernel test robot 
> >> Signed-off-by: Randy Dunlap 
> >> Cc: Vineet Gupta 
> >> Cc: linux-snps-...@lists.infradead.org
> >> Cc: Dan Williams 
> >> Acked-by: Vineet Gupta 
> >> Cc: Andrew Morton 
> >> Cc: Matthew Wilcox 
> >> Cc: Jan Kara 
> >> Cc: linux-fsde...@vger.kernel.org
> >> Cc: linux-nvd...@lists.01.org
> >> ---
> >> v2: rebase, add more Cc:
> >>
> >>  arch/arc/include/asm/page.h |1 +
> >>  fs/dax.c|1 +
> >>  2 files changed, 2 insertions(+)
> >>
> >> --- lnx-511-rc1.orig/fs/dax.c
> >> +++ lnx-511-rc1/fs/dax.c
> >> @@ -25,6 +25,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >
> > I would expect this to come from one of the linux/ includes like
> > linux/mm.h. asm/ headers are implementation linux/ headers are api.
> >
> > Once you drop that then the subject of this patch can just be "arc:
> > add a copy_user_page() implementation", and handled by the arc
> > maintainer (or I can take it with Vineet's ack).
>
> Got it. Thanks.
> Vineet is copied. I expect that he will take the v3 patch.
>
> >>  #include 
> >
> > Yes, this one should have a linux/ api header to front it, but that's
> > a cleanup for another day.
>
> That line is only part of the contextual diff in this patch.
> I guess you are just commenting in general, along with your earlier
> paragraph.

Yes, I was prefetching a "...hey, but, there's an asm/ include right
below this one?"


Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 5:53 PM Jason Gunthorpe  wrote:
>
> On Mon, Jan 04, 2021 at 04:51:51PM -0800, Dan Williams wrote:
> > On Mon, Jan 4, 2021 at 4:14 PM Jason Gunthorpe  wrote:
> > >
> > > On Mon, Jan 04, 2021 at 09:19:30PM +, Mark Brown wrote:
> > >
> > >
> > > > > Regardless of the shortcut to make everything a struct
> > > > > platform_device, I think it was a mistake to put OF devices on
> > > > > platform_bus. Those should have remained on some of_bus even if they
> > > >
> > > > Like I keep saying the same thing applies to all non-enumerable buses -
> > > > exactly the same considerations exist for all the other buses like I2C
> > > > (including the ACPI naming issue you mention below), and for that matter
> > > > with enumerable buses which can have firmware info.
> > >
> > > And most busses do already have their own bus type. ACPI, I2C, PCI,
> > > etc. It is just a few that have been squished into platform, notably
> > > OF.
> > >
> >
> > I'll note that ACPI is an outlier that places devices on 2 buses,
> > where new acpi_driver instances are discouraged [1] in favor of
> > platform_drivers. ACPI scan handlers are awkwardly integrated into the
> > Linux device model.
> >
> > So while I agree with sentiment that an "ACPI bus" should
> > theoretically stand on its own there is legacy to unwind.
> >
> > I only bring that up to keep the focus on how to organize drivers
> > going forward, because trying to map some of these arguments backwards
> > runs into difficulties.
> >
> > [1]: 
> > http://lore.kernel.org/r/cajz5v0j_rek3agddw7flvmw_7kneccg2u_hukgjzqelcy8s...@mail.gmail.com
>
> Well, this is the exact kind of thing I think we are talking about
> here..
>
> > > It should be split up based on the unique naming scheme and any bus
> > > specific API elements - like raw access to ACPI or OF data or what
> > > have you for other FW bus types.
> >
> > I agree that the pendulum may have swung too far towards "reuse
> > existing bus_type", and auxiliary-bus unwinds some of that, but does
> > the bus_type really want to be an indirection for driver apis outside
> > of bus-specific operations?
>
> If the bus is the "enumeration entity" and we define that things like
> name, resources, gpio's, regulators, etc are a generic part of what is
> enumerated, then it makes sense that the bus would have methods
> to handle those things too.
>
> In other words, the only way to learn what GPIO 'resource' is to ask
> the enumeration mechnism that is providing the bus. If the enumeration
> and bus are 1:1 then you can use a function pointer on the bus type
> instead of open coding a dispatch based on an indirect indication.
>

I get that, but I'm fearing a gigantic bus_ops structure that has
narrow helpers like ->gpio_count() that mean nothing to the many other
clients of the bus. Maybe I'm overestimating the pressure there will
be to widen the ops structure at the bus level.


Re: [resend/standalone PATCH v4] Add auxiliary bus support

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 4:14 PM Jason Gunthorpe  wrote:
>
> On Mon, Jan 04, 2021 at 09:19:30PM +, Mark Brown wrote:
>
>
> > > Regardless of the shortcut to make everything a struct
> > > platform_device, I think it was a mistake to put OF devices on
> > > platform_bus. Those should have remained on some of_bus even if they
> >
> > Like I keep saying the same thing applies to all non-enumerable buses -
> > exactly the same considerations exist for all the other buses like I2C
> > (including the ACPI naming issue you mention below), and for that matter
> > with enumerable buses which can have firmware info.
>
> And most busses do already have their own bus type. ACPI, I2C, PCI,
> etc. It is just a few that have been squished into platform, notably
> OF.
>

I'll note that ACPI is an outlier that places devices on 2 buses,
where new acpi_driver instances are discouraged [1] in favor of
platform_drivers. ACPI scan handlers are awkwardly integrated into the
Linux device model.

So while I agree with sentiment that an "ACPI bus" should
theoretically stand on its own there is legacy to unwind.

I only bring that up to keep the focus on how to organize drivers
going forward, because trying to map some of these arguments backwards
runs into difficulties.

[1]: 
http://lore.kernel.org/r/cajz5v0j_rek3agddw7flvmw_7kneccg2u_hukgjzqelcy8s...@mail.gmail.com

> > > are represented by struct platform_device and fiddling in the core
> > > done to make that work OK.
> >
> > What exactly is the fiddling in the core here, I'm a bit unclear?
>
> I'm not sure, but I bet there is a small fall out to making bus_type
> not 1:1 with the struct device type.. Would have to attempt it to see
>
> > > This feels like a good conference topic someday..
> >
> > We should have this discussion *before* we get too far along with trying
> > to implement things, we should at least have some idea where we want to
> > head there.
>
> Well, auxillary bus is clearly following the original bus model
> intention with a dedicated bus type with a controlled naming
> scheme. The debate here seems to be "what about platform bus" and
> "what to do with mfd"?
>
> > Those APIs all take a struct device for lookup so it's the same call for
> > looking things up regardless of the bus the device is on or what
> > firmware the system is using - where there are firmware specific lookup
> > functions they're generally historical and shouldn't be used for new
> > code.  It's generally something in the form
> >
> >   api_type *api_get(struct device *dev, const char *name);
>
> Well, that is a nice improvement since a few years back when I last
> worked on this stuff.
>
> But now it begs the question, why not push harder to make 'struct
> device' the generic universal access point and add some resource_get()
> API along these lines so even a platform_device * isn't needed?
>
> Then the path seems much clearer, add a multi-bus-type device_driver
> that has a probe(struct device *) and uses the 'universal api_get()'
> style interface to find the generic 'resources'.
>
> The actual bus types and bus structs can then be split properly
> without the boilerplate that caused them all to be merged to platform,
> even PCI could be substantially merged like this.
>
> Bonus points to replace the open coded method disptach:
>
> int gpiod_count(struct device *dev, const char *con_id)
> {
> int count = -ENOENT;
>
> if (IS_ENABLED(CONFIG_OF) && dev && dev->of_node)
> count = of_gpio_get_count(dev, con_id);
> else if (IS_ENABLED(CONFIG_ACPI) && dev && ACPI_HANDLE(dev))
> count = acpi_gpio_count(dev, con_id);
>
> if (count < 0)
> count = platform_gpio_count(dev, con_id);
>
> With an actual bus specific virtual function:
>
> return dev->bus->gpio_count(dev);
>
> > ...and then do the same thing for every other bus with firmware
> > bindings.  If it's about the firmware interfaces it really isn't a
> > platform bus specific thing.  It's not clear to me if that's what it is
> > though or if this is just some tangent.
>
> It should be split up based on the unique naming scheme and any bus
> specific API elements - like raw access to ACPI or OF data or what
> have you for other FW bus types.

I agree that the pendulum may have swung too far towards "reuse
existing bus_type", and auxiliary-bus unwinds some of that, but does
the bus_type really want to be an indirection for driver apis outside
of bus-specific operations?


Re: [RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page allocation for __GFP_ZERO

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 12:11 PM David Hildenbrand  wrote:
>
>
> > Am 04.01.2021 um 20:52 schrieb Dave Hansen :
> >
> > On 1/4/21 11:27 AM, Matthew Wilcox wrote:
> >>> On Mon, Jan 04, 2021 at 11:19:13AM -0800, Dave Hansen wrote:
> >>> On 12/21/20 8:30 AM, Liang Li wrote:
>  --- a/include/linux/page-flags.h
>  +++ b/include/linux/page-flags.h
>  @@ -137,6 +137,9 @@ enum pageflags {
>  #endif
>  #ifdef CONFIG_64BIT
> PG_arch_2,
>  +#endif
>  +#ifdef CONFIG_PREZERO_PAGE
>  +PG_zero,
>  #endif
> __NR_PAGEFLAGS,
> >>>
> >>> I don't think this is worth a generic page->flags bit.
> >>>
> >>> There's a ton of space in 'struct page' for pages that are in the
> >>> allocator.  Can't we use some of that space?
> >>
> >> I was going to object to that too, but I think the entire approach is
> >> flawed and needs to be thrown out.  It just nukes the caches in extremely
> >> subtle and hard to measure ways, lowering overall system performance.
> >
> > Yeah, it certainly can't be the default, but it *is* useful for thing
> > where we know that there are no cache benefits to zeroing close to where
> > the memory is allocated.
> >
> > The trick is opting into it somehow, either in a process or a VMA.
> >
>
> The patch set is mostly trying to optimize starting a new process. So 
> process/vma doesn‘t really work.
>
> I still wonder if using tmpfs/shmem cannot somehow be used to cover the 
> original use case of starting a new vm fast (or rebooting an existing one 
> involving restarting the process).

If it's rebooting a VM then file-backed should be able to skip the
zeroing because the stale data exposure is only to itself. If the
memory is being repurposed to a new VM then the file needs to be
reallocated / re-zeroed just like the anonymous case.


Re: [PATCH v2] fs/dax: include to fix build error on ARC

2021-01-04 Thread Dan Williams
On Thu, Dec 31, 2020 at 8:29 PM Randy Dunlap  wrote:
>
> fs/dax.c uses copy_user_page() but ARC does not provide that interface,
> resulting in a build error.
>
> Provide copy_user_page() in  (beside copy_page()) and
> add  to fs/dax.c to fix the build error.
>
> ../fs/dax.c: In function 'copy_cow_page_dax':
> ../fs/dax.c:702:2: error: implicit declaration of function 'copy_user_page'; 
> did you mean 'copy_to_user_page'? [-Werror=implicit-function-declaration]
>
> Fixes: cccbce671582 ("filesystem-dax: convert to dax_direct_access()")
> Reported-by: kernel test robot 
> Signed-off-by: Randy Dunlap 
> Cc: Vineet Gupta 
> Cc: linux-snps-...@lists.infradead.org
> Cc: Dan Williams 
> Acked-by: Vineet Gupta 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Jan Kara 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-nvd...@lists.01.org
> ---
> v2: rebase, add more Cc:
>
>  arch/arc/include/asm/page.h |1 +
>  fs/dax.c|1 +
>  2 files changed, 2 insertions(+)
>
> --- lnx-511-rc1.orig/fs/dax.c
> +++ lnx-511-rc1/fs/dax.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

I would expect this to come from one of the linux/ includes like
linux/mm.h. asm/ headers are implementation linux/ headers are api.

Once you drop that then the subject of this patch can just be "arc:
add a copy_user_page() implementation", and handled by the arc
maintainer (or I can take it with Vineet's ack).

>  #include 

Yes, this one should have a linux/ api header to front it, but that's
a cleanup for another day.

>
>  #define CREATE_TRACE_POINTS
> --- lnx-511-rc1.orig/arch/arc/include/asm/page.h
> +++ lnx-511-rc1/arch/arc/include/asm/page.h
> @@ -10,6 +10,7 @@
>  #ifndef __ASSEMBLY__
>
>  #define clear_page(paddr)  memset((paddr), 0, PAGE_SIZE)
> +#define copy_user_page(to, from, vaddr, pg)copy_page(to, from)
>  #define copy_page(to, from)memcpy((to), (from), PAGE_SIZE)
>
>  struct vm_area_struct;


Re: [RFC v2 PATCH 4/4] mm: pre zero out free pages to speed up page allocation for __GFP_ZERO

2021-01-04 Thread Dan Williams
On Mon, Jan 4, 2021 at 11:28 AM Matthew Wilcox  wrote:
>
> On Mon, Jan 04, 2021 at 11:19:13AM -0800, Dave Hansen wrote:
> > On 12/21/20 8:30 AM, Liang Li wrote:
> > > --- a/include/linux/page-flags.h
> > > +++ b/include/linux/page-flags.h
> > > @@ -137,6 +137,9 @@ enum pageflags {
> > >  #endif
> > >  #ifdef CONFIG_64BIT
> > > PG_arch_2,
> > > +#endif
> > > +#ifdef CONFIG_PREZERO_PAGE
> > > +   PG_zero,
> > >  #endif
> > > __NR_PAGEFLAGS,
> >
> > I don't think this is worth a generic page->flags bit.
> >
> > There's a ton of space in 'struct page' for pages that are in the
> > allocator.  Can't we use some of that space?
>
> I was going to object to that too, but I think the entire approach is
> flawed and needs to be thrown out.  It just nukes the caches in extremely
> subtle and hard to measure ways, lowering overall system performance.

At a minimum the performance analysis should at least try to quantify
that externalized cost. Certainly that overhead went somewhere. Maybe
if this overhead was limited to run when the CPU would otherwise go
idle, but that might mean it never runs in practice...


[GIT PULL] libnvdimm for v5.11

2020-12-24 Thread Dan Williams
Hi Linus,

Twas the day before Christmas and the only thing stirring in libnvdimm
/ device-dax land is a pile of miscellaneous fixups and cleanups. If
this is too late for -rc1 I'll pull out the fixes and resubmit after
the holidays. The bulk of it has appeared in -next save the last two
patches to device-dax that are passed my build and unit tests.

Please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.11

---

The following changes since commit 09162bc32c880a791c6c0668ce0745cf7958f576:

  Linux 5.10-rc4 (2020-11-15 16:44:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.11

for you to fetch changes up to 127c3d2e7e8a79628160e56e54d2be099bdd47c6:

  Merge branch 'for-5.11/dax' into for-5.11/libnvdimm (2020-12-24
10:14:04 -0800)


libnvdimm for 5.11

- Fix a long standing block-window-namespace issue surfaced by the ndctl
  change to attempt to preserve the kernel device name over a
  'reconfigure'

- Fix a few error path memory leaks in nfit and device-dax

- Silence a smatch warning in the ioctl path

- Miscellaneous cleanups


Dan Williams (4):
  libnvdimm/namespace: Fix reaping of invalidated
block-window-namespace labels
  ACPI: NFIT: Fix input validation of bus-family
  device-dax: Fix range release
  Merge branch 'for-5.11/dax' into for-5.11/libnvdimm

Enrico Weigelt (1):
  libnvdimm: Cleanup include of badblocks.h

Wang Hai (1):
  device-dax/core: Fix memory leak when rmmod dax.ko

Zhang Qilong (1):
  libnvdimm/label: Return -ENXIO for no slot in __blk_label_update

Zhen Lei (3):
  ACPI/nfit: avoid accessing uninitialized memory in acpi_nfit_ctl()
  device-dax: delete a redundancy check in dev_dax_validate_align()
  device-dax: Avoid an unnecessary check in alloc_dev_dax_range()

Zheng Yongjun (1):
  device-dax/pmem: Convert comma to semicolon

 drivers/acpi/nfit/core.c | 15 ++
 drivers/dax/bus.c| 71 ++--
 drivers/dax/pmem/core.c  |  2 +-
 drivers/dax/super.c  |  1 +
 drivers/nvdimm/btt.h |  3 +-
 drivers/nvdimm/claim.c   |  1 +
 drivers/nvdimm/core.c|  1 -
 drivers/nvdimm/label.c   | 13 -
 8 files changed, 54 insertions(+), 53 deletions(-)


Re: [PATCH] ACPI: NFIT: fix flexible_array.cocci warnings

2020-12-24 Thread Dan Williams
On Tue, Dec 22, 2020 at 1:10 PM Julia Lawall  wrote:
>
> From: kernel test robot 
>
> Zero-length and one-element arrays are deprecated, see
> Documentation/process/deprecated.rst
> Flexible-array members should be used instead.
>
> Generated by: scripts/coccinelle/misc/flexible_array.cocci
>
> Fixes: 7b36c1398fb6 ("coccinelle: misc: add flexible_array.cocci script")
> CC: Denis Efremov 
> Reported-by: kernel test robot 
> Signed-off-by: kernel test robot 
> Signed-off-by: Julia Lawall 

Hmm, this triggers:

drivers/acpi/nfit/core.c:2276:4: error: flexible array member in a
struct with no named members
drivers/acpi/nfit/core.c:2287:4: error: flexible array member in a
struct with no named members

$ gcc --version
gcc (GCC) 10.2.1 20201016 (Red Hat 10.2.1-6)

I'll need to circle back to this later.


Re: [PATCH] device-dax: Fix range release

2020-12-19 Thread Dan Williams
On Fri, Dec 18, 2020 at 11:46 PM Leizhen (ThunderTown)
 wrote:
>
>
>
> On 2020/12/19 10:41, Dan Williams wrote:
> > There are multiple locations that open-code the release of the last
> > range in a device-dax instance. Consolidate this into a new
> > dev_dax_trim_range() helper.
> >
> > This also addresses a kmemleak report:
> >
> > # cat /sys/kernel/debug/kmemleak
> > [..]
> > unreferenced object 0x976bd46f6240 (size 64):
> >comm "ndctl", pid 23556, jiffies 4299514316 (age 5406.733s)
> >hex dump (first 32 bytes):
> >  00 00 00 00 00 00 00 00 00 00 20 c3 37 00 00 00  .. .7...
> >  ff ff ff 7f 38 00 00 00 00 00 00 00 00 00 00 00  8...
> >backtrace:
> >  [<064003cf>] __kmalloc_track_caller+0x136/0x379
> >  [<d85e3c52>] krealloc+0x67/0x92
> >  [<d7d3ba8a>] __alloc_dev_dax_range+0x73/0x25c
> >  [<27d58626>] devm_create_dev_dax+0x27d/0x416
> >  [<434abd43>] __dax_pmem_probe+0x1c9/0x1000 [dax_pmem_core]
> >  [<83726c1c>] dax_pmem_probe+0x10/0x1f [dax_pmem]
> >  [<b5f2319c>] nvdimm_bus_probe+0x9d/0x340 [libnvdimm]
> >  [<c055e544>] really_probe+0x230/0x48d
> >  [<6cabd38e>] driver_probe_device+0x122/0x13b
> >  [<29c7b95a>] device_driver_attach+0x5b/0x60
> >  [<53e5659b>] bind_store+0xb7/0xc3
> >  [<d3bdaadc>] drv_attr_store+0x27/0x31
> >  [<949069c5>] sysfs_kf_write+0x4a/0x57
> >  [<4a8b5adf>] kernfs_fop_write+0x150/0x1e5
> >  [<bded60f0>] __vfs_write+0x1b/0x34
> >  [<b92900f0>] vfs_write+0xd8/0x1d1
> >
> > Reported-by: Jane Chu 
> > Cc: Zhen Lei 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/dax/bus.c |   44 +---
> >  1 file changed, 21 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 9761cb40d4bb..720cd140209f 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -367,19 +367,28 @@ void kill_dev_dax(struct dev_dax *dev_dax)
> >  }
> >  EXPORT_SYMBOL_GPL(kill_dev_dax);
> >
> > -static void free_dev_dax_ranges(struct dev_dax *dev_dax)
> > +static void trim_dev_dax_range(struct dev_dax *dev_dax)
> >  {
> > + int i = dev_dax->nr_range - 1;
> > + struct range *range = _dax->ranges[i].range;
> >   struct dax_region *dax_region = dev_dax->region;
> > - int i;
> >
> >   device_lock_assert(dax_region->dev);
> > - for (i = 0; i < dev_dax->nr_range; i++) {
> > - struct range *range = _dax->ranges[i].range;
> > -
> > - __release_region(_region->res, range->start,
> > - range_len(range));
> > + dev_dbg(_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
> > + (unsigned long long)range->start,
> > + (unsigned long long)range->end);
> > +
> > + __release_region(_region->res, range->start, range_len(range));
> > + if (--dev_dax->nr_range == 0) {
> > + kfree(dev_dax->ranges);
> > + dev_dax->ranges = NULL;
> >   }
> > - dev_dax->nr_range = 0;
> > +}
> > +
> > +static void free_dev_dax_ranges(struct dev_dax *dev_dax)
> > +{
> > + while (dev_dax->nr_range)
> It's better to use READ_ONCE to get the value of dev_dax->nr_range,
> to prevent compiler optimization.

...only in the case where the compiler might try to turn this into an
infinite loop, but I don't think that can happen here outside of a
compiler bug. Usually READ_ONCE() is contending with SMP effects that
the compiler can't see.


[PATCH] device-dax: Fix range release

2020-12-18 Thread Dan Williams
There are multiple locations that open-code the release of the last
range in a device-dax instance. Consolidate this into a new
dev_dax_trim_range() helper.

This also addresses a kmemleak report:

# cat /sys/kernel/debug/kmemleak
[..]
unreferenced object 0x976bd46f6240 (size 64):
   comm "ndctl", pid 23556, jiffies 4299514316 (age 5406.733s)
   hex dump (first 32 bytes):
 00 00 00 00 00 00 00 00 00 00 20 c3 37 00 00 00  .. .7...
 ff ff ff 7f 38 00 00 00 00 00 00 00 00 00 00 00  8...
   backtrace:
 [<064003cf>] __kmalloc_track_caller+0x136/0x379
 [<d85e3c52>] krealloc+0x67/0x92
 [<d7d3ba8a>] __alloc_dev_dax_range+0x73/0x25c
 [<27d58626>] devm_create_dev_dax+0x27d/0x416
 [<434abd43>] __dax_pmem_probe+0x1c9/0x1000 [dax_pmem_core]
 [<83726c1c>] dax_pmem_probe+0x10/0x1f [dax_pmem]
 [<b5f2319c>] nvdimm_bus_probe+0x9d/0x340 [libnvdimm]
 [<c055e544>] really_probe+0x230/0x48d
 [<6cabd38e>] driver_probe_device+0x122/0x13b
 [<29c7b95a>] device_driver_attach+0x5b/0x60
 [<53e5659b>] bind_store+0xb7/0xc3
 [<d3bdaadc>] drv_attr_store+0x27/0x31
 [<949069c5>] sysfs_kf_write+0x4a/0x57
 [<4a8b5adf>] kernfs_fop_write+0x150/0x1e5
 [<bded60f0>] __vfs_write+0x1b/0x34
     [<0000b92900f0>] vfs_write+0xd8/0x1d1

Reported-by: Jane Chu 
Cc: Zhen Lei 
Signed-off-by: Dan Williams 
---
 drivers/dax/bus.c |   44 +---
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 9761cb40d4bb..720cd140209f 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -367,19 +367,28 @@ void kill_dev_dax(struct dev_dax *dev_dax)
 }
 EXPORT_SYMBOL_GPL(kill_dev_dax);
 
-static void free_dev_dax_ranges(struct dev_dax *dev_dax)
+static void trim_dev_dax_range(struct dev_dax *dev_dax)
 {
+   int i = dev_dax->nr_range - 1;
+   struct range *range = _dax->ranges[i].range;
struct dax_region *dax_region = dev_dax->region;
-   int i;
 
device_lock_assert(dax_region->dev);
-   for (i = 0; i < dev_dax->nr_range; i++) {
-   struct range *range = _dax->ranges[i].range;
-
-   __release_region(_region->res, range->start,
-   range_len(range));
+   dev_dbg(_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
+   (unsigned long long)range->start,
+   (unsigned long long)range->end);
+
+   __release_region(_region->res, range->start, range_len(range));
+   if (--dev_dax->nr_range == 0) {
+   kfree(dev_dax->ranges);
+   dev_dax->ranges = NULL;
}
-   dev_dax->nr_range = 0;
+}
+
+static void free_dev_dax_ranges(struct dev_dax *dev_dax)
+{
+   while (dev_dax->nr_range)
+   trim_dev_dax_range(dev_dax);
 }
 
 static void unregister_dev_dax(void *dev)
@@ -804,15 +813,10 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, 
u64 start,
return 0;
 
rc = devm_register_dax_mapping(dev_dax, dev_dax->nr_range - 1);
-   if (rc) {
-   dev_dbg(dev, "delete range[%d]: %pa:%pa\n", dev_dax->nr_range - 
1,
-   >start, >end);
-   dev_dax->nr_range--;
-   __release_region(res, alloc->start, resource_size(alloc));
-   return rc;
-   }
+   if (rc)
+   trim_dev_dax_range(dev_dax);
 
-   return 0;
+   return rc;
 }
 
 static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, 
resource_size_t size)
@@ -885,12 +889,7 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, 
resource_size_t size)
if (shrink >= range_len(range)) {
devm_release_action(dax_region->dev,
unregister_dax_mapping, >dev);
-   __release_region(_region->res, range->start,
-   range_len(range));
-   dev_dax->nr_range--;
-   dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i,
-   (unsigned long long) range->start,
-   (unsigned long long) range->end);
+   trim_dev_dax_range(dev_dax);
to_shrink -= shrink;
if (!to_shrink)
break;
@@ -1267,7 +1266,6 @@ static void dev_dax_release(struct device *dev)
put_dax(dax_dev);
free_dev_dax_id(dev_dax);
dax_region_put(dax_region);
-   kfree(dev_dax->ranges);
kfree(dev_dax->pgmap);
kfree(dev_dax);
 }



Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-18 Thread Dan Williams
On Fri, Dec 18, 2020 at 1:17 PM Alexandre Belloni
 wrote:
>
> On 18/12/2020 16:58:56-0400, Jason Gunthorpe wrote:
> > On Fri, Dec 18, 2020 at 08:32:11PM +, Mark Brown wrote:
> >
> > > > So, I strongly suspect, MFD should create mfd devices on a MFD bus
> > > > type.
> > >
> > > Historically people did try to create custom bus types, as I have
> > > pointed out before there was then pushback that these were duplicating
> > > the platform bus so everything uses platform bus.
> >
> > Yes, I vaugely remember..
> >
> > I don't know what to say, it seems Greg doesn't share this view of
> > platform devices as a universal device.
> >
> > Reading between the lines, I suppose things would have been happier
> > with some kind of inheritance scheme where platform device remained as
> > only instantiated directly in board files, while drivers could bind to
> > OF/DT/ACPI/FPGA/etc device instantiations with minimal duplication &
> > boilerplate.
> >
> > And maybe that is exactly what we have today with platform devices,
> > though the name is now unfortunate.
> >
> > > I can't tell the difference between what it's doing and what SOF is
> > > doing, the code I've seen is just looking at the system it's running
> > > on and registering a fixed set of client devices.  It looks slightly
> > > different because it's registering a device at a time with some wrapper
> > > functions involved but that's what the code actually does.
> >
> > SOF's aux bus usage in general seems weird to me, but if you think
> > it fits the mfd scheme of primarily describing HW to partition vs
> > describing a SW API then maybe it should use mfd.
> >
> > The only problem with mfd as far as SOF is concerned was Greg was not
> > happy when he saw PCI stuff in the MFD subsystem.
> >
>
> But then again, what about non-enumerable devices on the PCI device? I
> feel this would exactly fit MFD. This is a collection of IPs that exist
> as standalone but in this case are grouped in a single device.
>
> Note that I then have another issue because the kernel doesn't support
> irq controllers on PCI and this is exactly what my SoC has. But for now,
> I can just duplicate the irqchip driver in the MFD driver.
>
> > This whole thing started when Intel first proposed to directly create
> > platform_device's in their ethernet driver and Greg had a quite strong
> > NAK to that.
>
> Let me point to drivers/net/ethernet/cadence/macb_pci.c which is a
> fairly recent example. It does exactly that and I'm not sure you could
> do it otherwise while still not having to duplicate most of macb_probe.
>

This still feels an orthogonal example to the problem auxiliary-bus is
solving. If a platform-device and a pci-device surface an IP with a
shared programming model that's an argument for a shared library, like
libata to house the commonality. In contrast auxiliary-bus is a
software model for software-defined sub-functionality to be wrapped in
a driver model. It assumes a parent-device / parent-driver hierarchy
that platform-bus and pci-bus do not imply.


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-18 Thread Dan Williams
On Fri, Dec 18, 2020 at 1:06 PM Thomas Gleixner  wrote:
>
> On Fri, Dec 18 2020 at 11:20, Dan Williams wrote:
> > On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner  wrote:
> > [..]
> >>   5) The DAX case which you made "work" with dev_access_enable() and
> >>  dev_access_disable(), i.e. with yet another lazy approach of
> >>  avoiding to change a handful of usage sites.
> >>
> >>  The use cases are strictly context local which means the global
> >>  magic is not used at all. Why does it exist in the first place?
> >>
> >>  Aside of that this global thing would never work at all because the
> >>  refcounting is per thread and not global.
> >>
> >>  So that DAX use case is just a matter of:
> >>
> >> grant/revoke_access(DEV_PKS_KEY, READ/WRITE)
> >>
> >>  which is effective for the current execution context and really
> >>  wants to be a distinct READ/WRITE protection and not the magic
> >>  global thing which just has on/off. All usage sites know whether
> >>  they want to read or write.
> >
> > I was tracking and nodding until this point. Yes, kill the global /
> > kmap() support, but if grant/revoke_access is not integrated behind
> > kmap_{local,atomic}() then it's not a "handful" of sites that need to
> > be instrumented it's 100s. Are you suggesting that "relaxed" mode
> > enforcement is a way to distribute the work of teaching driver writers
> > that they need to incorporate explicit grant/revoke-read/write in
> > addition to kmap? The entire reason PTE_DEVMAP exists was to allow
> > get_user_pages() for PMEM and not require every downstream-GUP code
> > path to specifically consider whether it was talking to PMEM or RAM
> > pages, and certainly not whether they were reading or writing to it.
>
> kmap_local() is fine. That can work automatically because it's strict
> local to the context which does the mapping.
>
> kmap() is dubious because it's a 'global' mapping as dictated per
> HIGHMEM. So doing the RELAXED mode for kmap() is sensible I think to
> identify cases where the mapped address is really handed to a different
> execution context. We want to see those cases and analyse whether this
> can't be solved in a different way. That's why I suggested to do a
> warning in that case.
>
> Also vs. the DAX use case I really meant the code in fs/dax and
> drivers/dax/ itself which is handling this via dax_read_[un]lock.
>
> Does that make more sense?

Yup, got it. The dax code can be precise wrt to PKS in a way that
kmap_local() cannot.


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-18 Thread Dan Williams
On Fri, Dec 18, 2020 at 5:58 AM Thomas Gleixner  wrote:
[..]
>   5) The DAX case which you made "work" with dev_access_enable() and
>  dev_access_disable(), i.e. with yet another lazy approach of
>  avoiding to change a handful of usage sites.
>
>  The use cases are strictly context local which means the global
>  magic is not used at all. Why does it exist in the first place?
>
>  Aside of that this global thing would never work at all because the
>  refcounting is per thread and not global.
>
>  So that DAX use case is just a matter of:
>
> grant/revoke_access(DEV_PKS_KEY, READ/WRITE)
>
>  which is effective for the current execution context and really
>  wants to be a distinct READ/WRITE protection and not the magic
>  global thing which just has on/off. All usage sites know whether
>  they want to read or write.

I was tracking and nodding until this point. Yes, kill the global /
kmap() support, but if grant/revoke_access is not integrated behind
kmap_{local,atomic}() then it's not a "handful" of sites that need to
be instrumented it's 100s. Are you suggesting that "relaxed" mode
enforcement is a way to distribute the work of teaching driver writers
that they need to incorporate explicit grant/revoke-read/write in
addition to kmap? The entire reason PTE_DEVMAP exists was to allow
get_user_pages() for PMEM and not require every downstream-GUP code
path to specifically consider whether it was talking to PMEM or RAM
pages, and certainly not whether they were reading or writing to it.


Re: [PATCH V3 10/10] x86/pks: Add PKS test code

2020-12-18 Thread Dan Williams
On Thu, Dec 17, 2020 at 8:05 PM Ira Weiny  wrote:
>
> On Thu, Dec 17, 2020 at 12:55:39PM -0800, Dave Hansen wrote:
> > On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
> > > +   /* Arm for context switch test */
> > > +   write(fd, "1", 1);
> > > +
> > > +   /* Context switch out... */
> > > +   sleep(4);
> > > +
> > > +   /* Check msr restored */
> > > +   write(fd, "2", 1);
> >
> > These are always tricky.  What you ideally want here is:
> >
> > 1. Switch away from this task to a non-PKS task, or
> > 2. Switch from this task to a PKS-using task, but one which has a
> >different PKS value
>
> Or both...
>
> >
> > then, switch back to this task and make sure PKS maintained its value.
> >
> > *But*, there's no absolute guarantee that another task will run.  It
> > would not be totally unreasonable to have the kernel just sit in a loop
> > without context switching here if no other tasks can run.
> >
> > The only way you *know* there is a context switch is by having two tasks
> > bound to the same logical CPU and make sure they run one after another.
>
> Ah...  We do that.
>
> ...
> +   CPU_ZERO();
> +   CPU_SET(0, );
> +   /* Two processes run on CPU 0 so that they go through context switch. 
>  */
> +   sched_setaffinity(getpid(), sizeof(cpu_set_t), );
> ...
>
> I think this should be ensuring that both the parent and the child are
> running on CPU 0.  At least according to the man page they should be.
>
> 
> A child created via fork(2) inherits its parent's CPU affinity mask.
> 
>
> Perhaps a better method would be to synchronize the 2 threads more to ensure
> that we are really running at the 'same time' and forcing the context switch.
>
> >  This just gets itself into a state where it *CAN* context switch and
> > prays that one will happen.
>
> Not sure what you mean by 'This'?  Do you mean that running on the same CPU
> will sometimes not force a context switch?  Or do you mean that the sleeps
> could be badly timed and the 2 threads could run 1 after the other on the same
> CPU?  The latter is AFAICT the most likely case.
>

One way to guarantee that both threads run is to just pass a message
between them over a pipe and wait for the submitter to receive an ack
from the other end.


Re: [RFC PATCH] badblocks: Improvement badblocks_set() for handling multiple ranges

2020-12-17 Thread Dan Williams
[ add Neil, original gooodguy who wrote badblocks ]


On Thu, Dec 3, 2020 at 9:16 AM Coly Li  wrote:
>
> Recently I received a bug report that current badblocks code does not
> properly handle multiple ranges. For example,
> badblocks_set(bb, 32, 1, true);
> badblocks_set(bb, 34, 1, true);
> badblocks_set(bb, 36, 1, true);
> badblocks_set(bb, 32, 12, true);
> Then indeed badblocks_show() reports,
> 32 3
> 36 1
> But the expected bad blocks table should be,
> 32 12
> Obviously only the first 2 ranges are merged and badblocks_set() returns
> and ignores the rest setting range.
>
> This behavior is improper, if the caller of badblocks_set() wants to set
> a range of blocks into bad blocks table, all of the blocks in the range
> should be handled even the previous part encountering failure.
>
> The desired way to set bad blocks range by badblocks_set() is,
> - Set as many as blocks in the setting range into bad blocks table.
> - Merge the bad blocks ranges and occupy as less as slots in the bad
>   blocks table.
> - Fast.
>
> Indeed the above proposal is complicated, especially with the following
> restrictions,
> - The setting bad blocks range can be ackknowledged or not acknowledged.

s/ackknowledged/acknowledged/

I'd run checkpatch --codespell for future versions...

> - The bad blocks table size is limited.
> - Memory allocation should be avoided.
>
> This patch is an initial effort to improve badblocks_set() for setting
> bad blocks range when it covers multiple already set bad ranges in the
> bad blocks table, and to do it as fast as possible.
>
> The basic idea of the patch is to categorize all possible bad blocks
> range setting combinationsinto to much less simplified and more less
> special conditions. Inside badblocks_set() there is an implicit loop
> composed by jumping between labels 're_insert' and 'update_sectors'. No
> matter how large the setting bad blocks range is, in every loop just a
> minimized range from the head is handled by a pre-defined behavior from
> one of the categorized conditions. The logic is simple and code flow is
> manageable.
>
> This patch is unfinished yet, it only improves badblocks_set() and not
> touch badblocks_clear() and badblocks_show() yet. I post it earlier
> because this patch will be large (more then 1000 lines of change), I
> want more people to give me comments earlier before I go too far away.
>

I wonder if this isn't indication that the base data structure should
be replaced... but I have not had a chance to devote deeper thought to
this.


> The code logic is tested as user space programmer, this patch passes
> compiling but not tested in kernel mode yet. Right now it is only for
> RFC purpose. I will post tested patch in further versions.
>
> Thank you in advance for any review or comments on this patch.
>
> Signed-off-by: Coly Li 
> ---
>  block/badblocks.c | 1041 ++---
>  include/linux/badblocks.h |   33 ++
>  2 files changed, 881 insertions(+), 193 deletions(-)
>
> diff --git a/block/badblocks.c b/block/badblocks.c
> index d39056630d9c..04ccae95777d 100644
> --- a/block/badblocks.c
> +++ b/block/badblocks.c
> @@ -5,6 +5,8 @@
>   * - Heavily based on MD badblocks code from Neil Brown
>   *
>   * Copyright (c) 2015, Intel Corporation.
> + *
> + * Improvement for handling multiple ranges by Coly Li 
>   */
>
>  #include 
> @@ -16,114 +18,612 @@
>  #include 
>  #include 
>
> -/**
> - * badblocks_check() - check a given range for bad sectors
> - * @bb:the badblocks structure that holds all badblock 
> information
> - * @s: sector (start) at which to check for badblocks
> - * @sectors:   number of sectors to check for badblocks
> - * @first_bad: pointer to store location of the first badblock
> - * @bad_sectors: pointer to store number of badblocks after @first_bad
> +/*
> + * The purpose of badblocks set/clear is to manage bad blocks ranges which 
> are
> + * identified by LBA addresses.
>   *
> - * We can record which blocks on each device are 'bad' and so just
> - * fail those blocks, or that stripe, rather than the whole device.
> - * Entries in the bad-block table are 64bits wide.  This comprises:
> - * Length of bad-range, in sectors: 0-511 for lengths 1-512
> - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> - *  A 'shift' can be set so that larger blocks are tracked and
> - *  consequently larger devices can be covered.
> - * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + * When the caller of badblocks_set() wants to set a range of bad blocks, the
> + * setting range can be acked or unacked. And the setting range may merge,
> + * overwrite, skip the overlaypped already set range, depends on who they are
> + * overlapped or adjacent, and the acknowledgment type of the ranges. It can 
> be
> + * more complicated when the setting range covers multiple already set bad 
> block
> + * ranges, with 

Re: [PATCH 1/1] device-dax: avoid an unnecessary check in alloc_dev_dax_range()

2020-12-17 Thread Dan Williams
On Fri, Nov 20, 2020 at 1:23 AM Zhen Lei  wrote:
>
> Swap the calling sequence of krealloc() and __request_region(), call the
> latter first. In this way, the value of dev_dax->nr_range does not need to
> be considered when __request_region() failed.

This looks ok, but I think I want to see another cleanup go in first
before this to add a helper for trimming the last range off the set of
ranges:

static void dev_dax_trim_range(struct dev_dax *dev_dax)
{
int i = dev_dax->nr_range - 1;
struct range *range = _dax->ranges[i].range;
struct dax_region *dax_region = dev_dax->region;

dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i,
(unsigned long long)range->start,
(unsigned long long)range->end);

__release_region(_region->res, range->start, range_len(range));
if (--dev_dax->nr_range == 0) {
kfree(dev_dax->ranges);
dev_dax->ranges = NULL;
}
}

Care to do a lead in patch with that cleanup, then do this one?

I think that might also cleanup a memory leak report from Jane in
addition to not needing the "goto" as well.

http://lore.kernel.org/r/c8a8a260-34c6-dbfc-1f19-25c23d01c...@oracle.com


Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-17 Thread Dan Williams
On Thu, Dec 17, 2020 at 1:20 PM Alexandre Belloni
 wrote:
>
> Hello,
>
> On 05/12/2020 16:51:36+0100, Greg KH wrote:
> > > To me, the documentation was written, and reviewed, more from the
> > > perspective of "why not open code a custom bus instead". So I can see
> > > after the fact how that is a bit too much theory and justification and
> > > not enough practical application. Before the fact though this was a
> > > bold mechanism to propose and it was not clear that everyone was
> > > grokking the "why" and the tradeoffs.
> >
> > Understood, I guess I read this from the "of course you should do this,
> > now how do I use it?" point of view.  Which still needs to be addressed
> > I feel.
> >
> > > I also think it was a bit early to identify consistent design patterns
> > > across the implementations and codify those. I expect this to evolve
> > > convenience macros just like other parts of the driver-core gained
> > > over time. Now that it is in though, another pass through the
> > > documentation to pull in more examples seems warranted.
> >
> > A real, working, example would be great to have, so that people can know
> > how to use this.  Trying to dig through the sound or IB patches to view
> > how it is being used is not a trivial thing to do, which is why
> > reviewing this took so much work.  Having a simple example test module,
> > that creates a number of devices on a bus, ideally tied into the ktest
> > framework, would be great.  I'll attach below a .c file that I used for
> > some basic local testing to verify some of this working, but it does not
> > implement a aux bus driver, which needs to be also tested.
> >
>
> There is something I don't get from the documentation and it is what is
> this introducing that couldn't already be done using platform drivers
> and platform devices?

There is room for documentation improvement here. I realize reading it
back now that much of the justification for "why not platform bus?"
happened on the list, but only a small mention made it into the
document. It turns out that platform-bus has some special integrations
and hacks with platform-firmware implementations. For example, the
ACPI companion magic and specific platform firmware integrations in
platform_match(). It's also an awkward bus name to use because these
devices do not belong to the platform. The platform bus is for devices
that do not have an enumeration mechanism besides board files or
firmware descriptions.

So while many of the auxiliary device use cases might be able to be
squeezed into a platform-bus scheme it further overloads what is
already a wide responsibility.

In comparison, the auxiliary-bus is tailored to the "sub-function of a
parent device/driver" use case. It lets the host driver be the root of
a namespace of sub-functionality in a standard template way.

> We already have a bunch of drivers in tree that have to share a state
> and register other drivers from other subsystems for the same device.
> How is the auxiliary bus different?

There's also custom subsystem buses that do this. Why not other
alternatives? They didn't capture the simultaneous mindshare of RDMA,
SOF, and NETDEV developers. Personally my plans for using
auxiliary-bus do not map cleanly to anything else in the tree. I want
to use it for attaching an NPEM driver (Native PCIE Enclosure
Management) to any PCI device driver that opts-in, but it would be
overkill to go create an "npem" bus for this.


Re: [PATCH 1/1] device-dax: delete a redundancy check in dev_dax_validate_align()

2020-12-17 Thread Dan Williams
On Fri, Nov 20, 2020 at 1:27 AM Leizhen (ThunderTown)
 wrote:
>
>
>
> On 2020/11/20 17:20, Zhen Lei wrote:
> > After we have done the alignment check for the length of each range, the
> > alignment check for dev_dax_size(dev_dax) is no longer needed, because it
> > get the sum of the length of each range.
>
> For example:
> x/M + y/M = (x + y)/M
> If x/M is a integer and y/M is also a integer, then (x + y)/M must be a 
> integer.
>

True... I was going to say that the different error messages might be
useful, but those are debug statements anyways, so I'll apply this.


Re: [RFC PATCH 0/8] x86: Support Intel Key Locker

2020-12-17 Thread Dan Williams
On Thu, Dec 17, 2020 at 11:11 AM Eric Biggers  wrote:
>
> On Wed, Dec 16, 2020 at 09:41:38AM -0800, Chang S. Bae wrote:
> > [1] Intel Architecture Instruction Set Extensions Programming Reference:
> > 
> > https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-$

https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

> > [2] Intel Key Locker Specification:
> > 
> > https://software.intel.com/content/dam/develop/external/us/en/documents/343965-intel-key-locker-speci$

https://software.intel.com/content/dam/develop/external/us/en/documents/343965-intel-key-locker-specification.pdf


Re: [PATCH] dax: fix memory leak when rmmod dax.ko

2020-12-15 Thread Dan Williams
On Tue, Dec 1, 2020 at 5:54 AM Wang Hai  wrote:
>
> When I repeatedly modprobe and rmmod dax.ko, kmemleak report a
> memory leak as follows:
>
> unreferenced object 0x9a5588c05088 (size 8):
>   comm "modprobe", pid 261, jiffies 4294693644 (age 42.063s)
> ...
>   backtrace:
> [] kstrdup+0x35/0x70
> [<2ae73897>] kstrdup_const+0x3d/0x50
> [<2b00c9c3>] kvasprintf_const+0xbc/0xf0
> [<8023282f>] kobject_set_name_vargs+0x3b/0xd0
> [] kobject_set_name+0x62/0x90
> [<202e7a22>] bus_register+0x7f/0x2b0
> [<0b77792c>] 0xc02840f7
> [<2d5be5ac>] 0xc02840b4
> [] do_one_initcall+0x58/0x240
> [<049fe480>] do_init_module+0x56/0x1e2
> [<22671491>] load_module+0x2517/0x2840
> [<1a2201cb>] __do_sys_finit_module+0x9c/0xe0
> [<3eb304e7>] do_syscall_64+0x33/0x40
> [<51c5fd06>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> When rmmod dax is executed, dax_bus_exit() is missing. This patch
> can fix this bug.
>
> Fixes: 9567da0b408a ("device-dax: Introduce bus + driver model")
> Reported-by: Hulk Robot 
> Signed-off-by: Wang Hai 

Looks good, applied.

...with some fixups to the changelog to add Cc: stable and change the
title to "device-dax/core: Fix..."


Re: [PATCH -next] dax: pmem: convert comma to semicolon

2020-12-15 Thread Dan Williams
On Mon, Dec 14, 2020 at 5:45 AM Zheng Yongjun  wrote:
>
> Replace a comma between expression statements by a semicolon.
>
> Signed-off-by: Zheng Yongjun 

Thanks, applied.


Re: [PATCH] drivers: nvdimm: cleanup include of badblocks.h

2020-12-15 Thread Dan Williams
On Tue, Dec 15, 2020 at 8:36 AM Enrico Weigelt, metux IT consult
 wrote:
>
> * drivers/nvdimm/core.c doesn't use anything from badblocks.h on its own,
>   thus including it isn't needed. There's indeed indirect use, via funcs
>   in nd.h, but this one already includes badblocks.h.
>
> * drivers/nvdimm/claim.c calls stuff from badblocks.h and therefore should
>   include it on its own (instead of relying any other header doing that)
>
> * drivers/nvdimm/btt.h doesn't really need anything from badblocks.h and
>   can easily live with a forward declaration of struct badblocks (just having
>   pointers to it, but not dereferencing it anywhere)

Thanks, looks ok to me.

It was commit aa9ad44a42b4 ("libnvdimm: move poison list functions to
a new 'badrange' file") that left the straggling include in core.c.


Re: [PATCH] x86/mm: Fix leak of pmd ptlock

2020-12-15 Thread Dan Williams
Might I tempt an x86/mm maintainer to ack this, or a x86-tip
maintainer to apply it outright?

On Wed, Dec 2, 2020 at 10:28 PM Dan Williams  wrote:
>
> Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
> introduced a new location where a pmd was released, but neglected to run
> the pmd page destructor. In fact, this happened previously for a
> different pmd release path and was fixed by commit:
>
> c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").
>
> This issue was hidden until recently because the failure mode is silent,
> but commit:
>
> b2b29d6d0119 ("mm: account PMD tables like PTE tables")
>
> ...turns the failure mode into this signature:
>
>  BUG: Bad page state in process lt-pmem-ns  pfn:15943d
>  page:7262ed7b refcount:0 mapcount:-1024 mapping: 
> index:0x0 pfn:0x15943d
>  flags: 0xa8()
>  raw: 00a8 dead0100  
>  raw:  913a029bcc08 fbff 
>  page dumped because: nonzero mapcount
>  [..]
>   dump_stack+0x8b/0xb0
>   bad_page.cold+0x63/0x94
>   free_pcp_prepare+0x224/0x270
>   free_unref_page+0x18/0xd0
>   pud_free_pmd_page+0x146/0x160
>   ioremap_pud_range+0xe3/0x350
>   ioremap_page_range+0x108/0x160
>   __ioremap_caller.constprop.0+0x174/0x2b0
>   ? memremap+0x7a/0x110
>   memremap+0x7a/0x110
>   devm_memremap+0x53/0xa0
>   pmem_attach_disk+0x4ed/0x530 [nd_pmem]
>   ? __devm_release_region+0x52/0x80
>   nvdimm_bus_probe+0x85/0x210 [libnvdimm]
>
> Given this is a repeat occurrence it seemed prudent to look for other
> places where this destructor might be missing and whether a better
> helper is needed. try_to_free_pmd_page() looks like a candidate, but
> testing with setting up and tearing down pmd mappings via the dax unit
> tests is thus far not triggering the failure. As for a better helper
> pmd_free() is close, but it is a messy fit due to requiring an @mm arg.
> Also, ___pmd_free_tlb() wants to call paravirt_tlb_remove_table()
> instead of free_page(), so open-coded pgtable_pmd_page_dtor() seems the
> best way forward for now.
>
> Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
> Cc: 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: "H. Peter Anvin" 
> Co-debugged-by: Matthew Wilcox 
> Tested-by: Yi Zhang 
> Signed-off-by: Dan Williams 
> ---
>  arch/x86/mm/pgtable.c |2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index dfd82f51ba66..f6a9e2e36642 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -829,6 +829,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
> }
>
> free_page((unsigned long)pmd_sv);
> +
> +   pgtable_pmd_page_dtor(virt_to_page(pmd));
> free_page((unsigned long)pmd);
>
> return 1;
>


Re: [RFC PATCH 10/14] cxl/mem: Add send command

2020-12-15 Thread Dan Williams
On Tue, Dec 15, 2020 at 1:44 PM Ben Widawsky  wrote:
[..]
> > > +static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
> > > +   const struct cxl_mem_command *cmd,
> > > +   struct cxl_send_command __user *u)
> > > +{
> > > +   struct mbox_cmd mbox_cmd;
> > > +   ssize_t payload_size;
> > > +   void *payload;
> > > +   u32 size_out;
> > > +   int rc;
> > > +
> > > +   if (get_user(size_out, >size_out))
> > > +   return -EFAULT;
> > > +
> > > +   payload_size = max_t(ssize_t, cmd->info.size_in, size_out);
> > > +   if (payload_size) {
> > > +   payload =
> > > +   memdup_user(u64_to_user_ptr(u->payload), 
> > > payload_size);
> >
> > Me thinks this should be vmemdup_user() for payloads that exceed the
> > kmalloc() max, and I think it would be worthwhile to clamp @size_out
> > to some maximum and not let userspace ask for gigantic payloads.
> > Return EINVAL for payloads greater than... 4MB? At least 4MB is the
> > arbitrary max that libnvdimm picked.
> >
>
> This is handled in cxl_validate_cmd_from_user() currently. The current API
> allows the user to specify as large as they like with @size_out but the kernel
> only writes out as much as the hardware returns More on this below [1].
[..]
> > > + *  * %-EPERM  - Protected command used by the RAW interface.
> > > + *  * %-ENOMEM - Input or output buffer wasn't large enough.
> > > + *
> > > + */
> > > +static int cxl_validate_cmd_from_user(struct cxl_send_command __user 
> > > *user_cmd,
> > > + struct cxl_mem_command *out_cmd)
> > > +{
> > > +   const struct cxl_command_info *info;
> > > +   struct cxl_send_command cmd;
> > > +   struct cxl_mem_command *c;
> > > +
> > > +   if (copy_from_user(, user_cmd, sizeof(cmd)))
> > > +   return -EFAULT;
> > > +
> > > +   if (cmd.id == 0 || cmd.id >= CXL_MEM_COMMAND_ID_MAX)
> > > +   return -EINVAL;
> >
> > I wonder if the "cmd.id >= CXL_MEM_COMMAND_ID_MAX" case should return
> > -ENOTTY. The command might be perfectly valid, just the kernel does
> > not have that command specified which would be the case with older
> > kernel with newer userspace.
> >
>
> I think ENOTTY could be a bit confusing here as it could be confused for the
> SEND ioctl not being present.
>
> What about ENOENT, or something else?

"No such file or directory" seems more confusing to me than
"Inappropriate I/O control operation". What version of this driver,
once it's upstream, will not support the SEND ioctl? I think it is
safe to assume that the mere fact that the driver is attached means
SEND is present.

> > > +
> > > +   c = _commands[cmd.id];
> > > +   info = >info;
> > > +
> > > +   if (cmd.flags & CXL_MEM_COMMAND_FLAG_MASK)
> > > +   return -EINVAL;
> > > +
> > > +   if (cmd.rsvd)
> > > +   return -EINVAL;
> > > +
> > > +   /* Check the input buffer is the expected size */
> > > +   if (info->size_in >= 0 && info->size_in != cmd.size_in)
> > > +   return -ENOMEM;
> > > +
> > > +   /* Check the output buffer is at least large enough */
> > > +   if (info->size_out >= 0 && cmd.size_out < info->size_out)
> > > +   return -ENOMEM;
> > > +
> > > +   memcpy(out_cmd, c, sizeof(*c));
> >
> > Why not do the copy_from_user() directly into out_cmd to save a copy 
> > operation?
> >
>
> [1]
> cxl_validate_cmd_from_user() essentially translates a user's command into the
> internal kernel representation of the command via the lookup into the array.
> This will do things like adjust the output size or flags, to prevent userspace
> from doing nefarious things. An actual command, especially now that you had me
> remove the name[32] will be relatively small and I see the extra copy as being
> well worth being able to isolate this command sanitization.

I can buy that, but it does not answer my other question above about
cases where a valid command has size_{in,out} larger than the kmalloc
max. vmemdup_user() lets you support arbitrary command payload sizes
larger than KMALLOC_MAX_SIZE. For example, I might want to overwrite
the entire label area at once in a single write.

>
> > > +
> > > +   return 0;
> > >  }
> > >
> > >  static long cxl_mem_ioctl(struct file *file, unsigned int cmd, unsigned 
> > > long arg)
> > > @@ -357,6 +471,19 @@ static long cxl_mem_ioctl(struct file *file, 
> > > unsigned int cmd, unsigned long arg
> > >
> > > j++;
> > > }
> > > +
> > > +   return 0;
> > > +   } else if (cmd == CXL_MEM_SEND_COMMAND) {
> > > +   struct cxl_send_command __user *u = (void __user *)arg;
> > > +   struct cxl_memdev *cxlmd = file->private_data;
> > > +   struct cxl_mem_command c;
> > > +   int rc;
> > > +
> > > +

Re: [PATCH v17 3/3] bus: mhi: Add userspace client interface driver

2020-12-11 Thread Dan Williams
On Fri, 2020-12-11 at 08:44 +0100, Greg KH wrote:
> On Thu, Dec 10, 2020 at 11:04:11PM -0800, Hemant Kumar wrote:
> > This MHI client driver allows userspace clients to transfer
> > raw data between MHI device and host using standard file
> > operations.
> > Driver instantiates UCI device object which is associated to device
> > file node. UCI device object instantiates UCI channel object when
> > device
> > file node is opened. UCI channel object is used to manage MHI
> > channels
> > by calling MHI core APIs for read and write operations. MHI
> > channels
> > are started as part of device open(). MHI channels remain in start
> > state until last release() is called on UCI device file node.
> > Device
> > file node is created with format
> > 
> > /dev/
> > 
> > Currently it supports QMI channel. libqmi is userspace MHI client
> > which
> > communicates to a QMI service using QMI channel. libqmi is a glib-
> > based
> > library for talking to WWAN modems and devices which speaks QMI
> > protocol.
> > For more information about libqmi please refer
> > https://www.freedesktop.org/wiki/Software/libqmi/
> 
> This says _what_ this is doing, but not _why_.
> 
> Why do you want to circumvent the normal user/kernel apis for this
> type
> of device and move the normal network handling logic out to
> userspace?
> What does that help with?  What does the current in-kernel api lack
> that
> this userspace interface is going to solve, and why can't the in-
> kernel
> api solve it instead?
> 
> You are pushing a common user/kernel api out of the kernel here, to
> become very device-specific, with no apparent justification as to why
> this is happening.
> 
> Also, because you are going around the existing network api, I will
> need
> the networking maintainers to ack this type of patch.

Just to re-iterate: QMI ~= AT commands ~= MBIM (not quite, but same
level)

We already do QMI-over-USB, or AT-over-CDC-ACM. This is QMI-over-MHI.

It's not networking data plane. It's WWAN device configuration.

There are no current kernel APIs for this, and I really don't think we
want there to be. The API surface is *huge* and we definitely don't
want that in-kernel.

Dan



Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL

2020-12-09 Thread Dan Williams
On Wed, Dec 9, 2020 at 6:07 PM Logan Gunthorpe  wrote:
>
>
>
> On 2020-12-09 6:22 p.m., Dan Williams wrote:
> > On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe  wrote:
> >>
> >>
> >>
> >> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> >>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >>>> We make use of the top bit of the dma_length to indicate a P2PDMA
> >>>> segment.
> >>>
> >>> I don't think "we" can.  There is nothing limiting the size of a SGL
> >>> segment.
> >>
> >> Yes, I expected this would be the unacceptable part. Any alternative ideas?
> >
> > Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
> > segment-pages for is_pci_p2pdma_page()?
>
> Because the DMA and page segments in the SGL aren't necessarily aligned...
>
> The IOMMU implementations can coalesce multiple pages into fewer DMA
> address ranges, so the page pointed to by sg->page_link may not be the
> one that corresponds to the address in sg->dma_address for a given segment.
>
> If that makes sense -- it's not the easiest thing to explain.

It does...

Did someone already grab, or did you already consider the 3rd
available bit in page_link? AFAICS only SG_CHAIN and SG_END are
reserved. However, if you have a CONFIG_64BIT dependency for
user-directed p2pdma that would seem to allow SG_P2PDMA_FLAG to be
(0x4) in page_link.


Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL

2020-12-09 Thread Dan Williams
On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe  wrote:
>
>
>
> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> > On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >> We make use of the top bit of the dma_length to indicate a P2PDMA
> >> segment.
> >
> > I don't think "we" can.  There is nothing limiting the size of a SGL
> > segment.
>
> Yes, I expected this would be the unacceptable part. Any alternative ideas?

Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
segment-pages for is_pci_p2pdma_page()?


Re: [RFC PATCH 14/14] WIP/cxl/mem: Add get firmware for testing

2020-12-09 Thread Dan Williams
On Tue, Dec 8, 2020 at 4:25 PM Ben Widawsky  wrote:
>
> This also serves as an example how to add a new command
>
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/mem.c| 22 ++
>  include/uapi/linux/cxl_mem.h |  3 ++-
>  2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 6b2f8d3776b5..76aa1e6e4117 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -116,6 +116,7 @@ struct cxl_mem_command {
>  static struct cxl_mem_command mem_commands[] = {
> CXL_CMD(INVALID, NONE, 0, 0, "Reserved", false, 0),
> CXL_CMD(RAW, TAINT, ~0, ~0, "Raw", true, 0),
> +   CXL_CMD(GET_FW_INFO, NONE, 0, 0x50, "Get FW Info", false, 0x0200),

I think CXL_CMD() arguments can be put on a diet if the
mem-command-id-to-name was moved to its own table, and the opcode was
defined as something like:

#define CXL_MBOX_OP_GET_FW_INFO 0x200

...so that CXL_CMD can just do:

.opcode = CXL_MBOX_OP_##_id

That @_enable arg wants a flag #define like CMD_ENABLE which reads
better than 'true'. I would then add CMD_KERNEL_EXCL alongside that
flag to indicate the commands that may be exclusive to the kernel when
the device is active in a PMEM memory region, or ones that have an
alternate ABI wrapped around them like the keys subsystem for security
or the firwmare activation sysfs interface.

>  };
>
>  static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> @@ -827,6 +828,23 @@ static int cxl_mem_add_memdev(struct cxl_mem *cxlm)
> return cxl_register(dev);
>  }
>
> +static int cxl_mem_enable_commands(struct cxl_mem *cxlm)
> +{
> +   struct cxl_mem_command *c;
> +
> +   /*
> +* For now we pretend Get FW info is supported.
> +*
> +* FIXME: Invoke GET LOG to get the Command Effect Logs (CEL).
> +*/
> +   c = cxl_mem_find_command(0x200);
> +   if (!c)
> +   return -ENOENT;
> +
> +   c->enable = true;
> +   return 0;
> +}
> +
>  /**
>   * cxl_mem_identify() - Send the IDENTIFY command to the device.
>   * @cxlm: The device to identify.
> @@ -936,6 +954,10 @@ static int cxl_mem_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
> if (rc)
> return rc;
>
> +   rc = cxl_mem_enable_commands(cxlm);
> +   if (rc)
> +   return rc;
> +
> rc = cxl_mem_identify(cxlm);
> if (rc)
> return rc;
> diff --git a/include/uapi/linux/cxl_mem.h b/include/uapi/linux/cxl_mem.h
> index f2fbb0dcda06..3ac39acf8fa7 100644
> --- a/include/uapi/linux/cxl_mem.h
> +++ b/include/uapi/linux/cxl_mem.h
> @@ -50,7 +50,8 @@ struct cxl_command_info {
> __u32 id;
>  #define CXL_MEM_COMMAND_ID_INVALID 0
>  #define CXL_MEM_COMMAND_ID_RAW 1
> -#define CXL_MEM_COMMAND_ID_MAX (CXL_MEM_COMMAND_ID_RAW + 1)
> +#define CXL_MEM_COMMAND_ID_GET_FW_INFO 2
> +#define CXL_MEM_COMMAND_ID_MAX (CXL_MEM_COMMAND_ID_GET_FW_INFO + 1)

Seems like CXL_MEM_COMMAND_ID definitions want to be an enum so that
CXL_MEM_COMMAND_ID_MAX does not need to be touched every time.

>
> __u32 flags;
>  #define CXL_MEM_COMMAND_FLAG_NONE 0
> --
> 2.29.2
>


Re: [RFC PATCH 11/14] cxl/mem: Add a "RAW" send command

2020-12-09 Thread Dan Williams
On Tue, Dec 8, 2020 at 4:24 PM Ben Widawsky  wrote:
>
> The CXL memory device send interface will have a number of supported
> commands. The raw command is not such a command. Raw commands allow
> userspace to send a specified opcode to the underlying hardware and
> bypass all driver checks on the command. This is useful for a couple of
> usecases, mainly:
> 1. Undocumented vendor specific hardware commands
> 2. Prototyping new hardware commands not yet supported by the driver
>
> While this all sounds very powerful it comes with a couple of caveats:
> 1. Bug reports using raw commands will not get the same level of
>attention as bug reports using supported commands (via taint).
> 2. Supported commands will be rejected by the RAW command.
>
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/mem.c| 32 
>  include/uapi/linux/cxl_mem.h | 14 --
>  2 files changed, 44 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 0bf03afc0c80..a2cea7ac7cc6 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -115,6 +115,7 @@ struct cxl_mem_command {
>
>  static struct cxl_mem_command mem_commands[] = {
> CXL_CMD(INVALID, NONE, 0, 0, "Reserved", false, 0),
> +   CXL_CMD(RAW, TAINT, ~0, ~0, "Raw", true, 0),

Why is the taint indication in the ABI? It seems like it only needs to
be documented.

>  };
>
>  static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> @@ -326,6 +327,20 @@ static int cxl_mem_count_commands(void)
> return n;
>  };
>
> +static struct cxl_mem_command *cxl_mem_find_command(u16 opcode)
> +{
> +   int i;
> +
> +   for (i = 0; i < ARRAY_SIZE(mem_commands); i++) {
> +   struct cxl_mem_command *c = _commands[i];
> +
> +   if (c->opcode == opcode)
> +   return c;
> +   }
> +
> +   return NULL;
> +};
> +
>  /**
>   * handle_mailbox_cmd_from_user() - Dispatch a mailbox command.
>   * @cxlmd: The CXL memory device to communicate with.
> @@ -421,6 +436,23 @@ static int cxl_validate_cmd_from_user(struct 
> cxl_send_command __user *user_cmd,
> c = _commands[cmd.id];
> info = >info;
>
> +   /* Checks are bypassed for raw commands but along comes the taint! */
> +   if (cmd.id == CXL_MEM_COMMAND_ID_RAW) {
> +   struct cxl_mem_command temp =
> +   CXL_CMD(RAW, NONE, cmd.size_in, cmd.size_out, "Raw",
> +   true, cmd.raw.opcode);

Oh, I thought CXL_CMD() was only used to populate the mem_commands
array. Feels out of place to use it here when all it is doing is
updating the size_{in,out} and opcode fields. Mainly I'm interested in
CXL_CMD() enforcing that the command-id is the mem_commands index.

> +
> +   if (cmd.raw.rsvd)
> +   return -EINVAL;
> +
> +   if (cxl_mem_find_command(cmd.raw.opcode))
> +   return -EPERM;
> +
> +   add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

TAINT_WARN seems the wrong value, especially since no WARN has
occurred. I feel that this is more in the spirit of
TAINT_PROPRIETARY_MODULE, TAINT_OVERRIDDEN_ACPI_TABLE, and
TAINT_OOT_MODULE. How about a new TAINT_RAW_PASSTHROUGH? I could use
this for the acpi/nfit driver as well to disclaim responsibility for
system errors that can result from not using the nominal
kernel-provided commands.


Re: [RFC PATCH 10/14] cxl/mem: Add send command

2020-12-09 Thread Dan Williams
On Tue, Dec 8, 2020 at 4:24 PM Ben Widawsky  wrote:
>
> The send command allows userspace to issue mailbox commands directly to
> the hardware. The driver will verify basic properties of the command but
> otherwise pass any input payload untouched to the hardware, and return
> the output payload to userspace.
>
> The caller of this IOCTL is required to allocate enough space for
> max(size_in, size_out) of the payload. The payload input data will be
> wiped out if any output payload exists.

Seems awkward for the kernel to overwrite input payloads. I can see
that potentially complicating userspace if it has the same payload to
send to multiple devices. What's the rationale for overwriting input?

>
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/mem.c| 127 +++
>  include/uapi/linux/cxl_mem.h |  35 ++
>  2 files changed, 162 insertions(+)
>
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 2c4aadcea0e4..0bf03afc0c80 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -324,6 +324,120 @@ static int cxl_mem_count_commands(void)
> }
>
> return n;
> +};
> +
> +/**
> + * handle_mailbox_cmd_from_user() - Dispatch a mailbox command.
> + * @cxlmd: The CXL memory device to communicate with.
> + * @cmd: The validated command
> + * @u: The command submitted by userspace. Only useful for RAW commands.
> + *
> + * Return: 0 on success.
> + *
> + * This function packages up a  mbox_cmd on behalf of userspace,
> + * dispatches the command, and returns the results.
> + */
> +static int handle_mailbox_cmd_from_user(struct cxl_memdev *cxlmd,
> +   const struct cxl_mem_command *cmd,
> +   struct cxl_send_command __user *u)
> +{
> +   struct mbox_cmd mbox_cmd;
> +   ssize_t payload_size;
> +   void *payload;
> +   u32 size_out;
> +   int rc;
> +
> +   if (get_user(size_out, >size_out))
> +   return -EFAULT;
> +
> +   payload_size = max_t(ssize_t, cmd->info.size_in, size_out);
> +   if (payload_size) {
> +   payload =
> +   memdup_user(u64_to_user_ptr(u->payload), 
> payload_size);

Me thinks this should be vmemdup_user() for payloads that exceed the
kmalloc() max, and I think it would be worthwhile to clamp @size_out
to some maximum and not let userspace ask for gigantic payloads.
Return EINVAL for payloads greater than... 4MB? At least 4MB is the
arbitrary max that libnvdimm picked.

> +   if (IS_ERR(payload))
> +   return PTR_ERR(payload);
> +   }
> +
> +   rc = cxl_mem_mbox_get(cxlmd->cxlm);
> +   if (rc)
> +   return rc;
> +
> +   mbox_cmd = (struct mbox_cmd){

I'd expect a space between: ){ ...looks like clang-format does not
have that rule.

> +   .opcode = cmd->opcode,
> +   .payload = payload,
> +   .size_in = cmd->info.size_in,
> +   .size_out = size_out,
> +   };
> +   rc = cxl_mem_mbox_send_cmd(cxlmd->cxlm, _cmd);
> +   cxl_mem_mbox_put(cxlmd->cxlm);
> +   if (rc)
> +   goto out;
> +
> +   rc = put_user(mbox_cmd.return_code, >retval);
> +   if (rc)
> +   goto out;
> +
> +   rc = put_user(mbox_cmd.size_out, >size_out);
> +   if (rc)
> +   goto out;
> +
> +   if (mbox_cmd.size_out)
> +   if (copy_to_user(u64_to_user_ptr(u->payload), payload,
> +mbox_cmd.size_out))
> +   rc = -EFAULT;
> +
> +out:
> +   if (payload_size)
> +   kfree(payload);
> +   return rc;
> +}
> +
> +/**
> + * cxl_validate_cmd_from_user() - Check fields for CXL_MEM_SEND_COMMAND.
> + * @user_cmd:  cxl_send_command from userspace.

Ah cool, I did not realize kernel doc recognized type specifiers like
that, nice.

> + * @out_cmd: Sanitized and populared  cxl_mem_command.

s/populared/populated/

> + *
> + * Return:
> + *  * %0   - Command dispatched successfully.
> + *  * %-EFAULT - Something happened with copy_to/from_user.
> + *  * %-EINVAL - Rerserved fields were used.

s/Rerserved/Reserved/

> + *  * %-EPERM  - Protected command used by the RAW interface.
> + *  * %-ENOMEM - Input or output buffer wasn't large enough.
> + *
> + */
> +static int cxl_validate_cmd_from_user(struct cxl_send_command __user 
> *user_cmd,
> + struct cxl_mem_command *out_cmd)
> +{
> +   const struct cxl_command_info *info;
> +   struct cxl_send_command cmd;
> +   struct cxl_mem_command *c;
> +
> +   if (copy_from_user(, user_cmd, sizeof(cmd)))
> +   return -EFAULT;
> +
> +   if (cmd.id == 0 || cmd.id >= CXL_MEM_COMMAND_ID_MAX)
> +   return -EINVAL;

I wonder if the "cmd.id >= CXL_MEM_COMMAND_ID_MAX" case should return
-ENOTTY. The command might be perfectly valid, just the kernel does
not have 

Re: [PATCH V2 2/2] mm/highmem: Lift memcpy_[to|from]_page to core

2020-12-09 Thread Dan Williams
On Wed, Dec 9, 2020 at 12:14 PM Matthew Wilcox  wrote:
[..]
> If we put in into a separate patch, someone will suggest backing out the
> patch which tells us that there's a problem.  You know, like this guy ...
> https://lore.kernel.org/linux-mm/CAPcyv4jNVroYmirzKw_=cseixoeacdl3m1wc4xjd_tfv3h+...@mail.gmail.com/

...and that person, like in this case, will be told 'no' and go on to
find / fix the real problem.


Re: [PATCH V2 2/2] mm/highmem: Lift memcpy_[to|from]_page to core

2020-12-09 Thread Dan Williams
On Tue, Dec 8, 2020 at 8:03 PM Matthew Wilcox  wrote:
>
> On Tue, Dec 08, 2020 at 06:22:50PM -0800, Ira Weiny wrote:
> > Right now we have a mixed bag.  zero_user() [and it's variants, circa 2008]
> > does a BUG_ON.[0]  While the other ones do nothing; clear_highpage(),
> > clear_user_highpage(), copy_user_highpage(), and copy_highpage().
>
> Erm, those functions operate on the entire PAGE_SIZE.  There's nothing
> for them to check.
>
> > While continuing to audit the code I don't see any users who would violating
> > the API with a simple conversion of the code.  The calls which I have 
> > worked on
> > [which is many at this point] all have checks in place which are well aware 
> > of
> > page boundaries.
>
> Oh good, then this BUG_ON won't trigger.
>
> > Therefore, I tend to agree with Dan that if anything is to be done it 
> > should be
> > a WARN_ON() which is only going to throw an error that something has 
> > probably
> > been wrong all along and should be fixed but continue running as before.
>
> Silent data corruption is for ever.  Are you absolutely sure nobody has
> done:
>
> page = alloc_pages(GFP_HIGHUSER_MOVABLE, 3);
> memcpy_to_page(page, PAGE_SIZE * 2, p, PAGE_SIZE * 2);
>
> because that will work fine if the pages come from ZONE_NORMAL and fail
> miserably if they came from ZONE_HIGHMEM.

...and violently regress with the BUG_ON.

The question to me is: which is more likely that any bad usages have
been covered up by being limited to ZONE_NORMAL / 64-bit only, or that
silent data corruption has been occurring with no ill effects?

> > FWIW I think this is a 'bad BUG_ON' use because we are "checking something 
> > that
> > we know we might be getting wrong".[1]  And because, "BUG() is only good for
> > something that never happens and that we really have no other option 
> > for".[2]
>
> BUG() is our only option here.  Both limiting how much we copy or
> copying the requested amount result in data corruption or leaking
> information to a process that isn't supposed to see it.

At a minimum I think this should be debated in a follow on patch to
add assertion checking where there was none before. There is no
evidence of a page being overrun in the audit Ira performed.


Re: [Ksummit-discuss] crediting bug reports and fixes folded into original patch

2020-12-09 Thread Dan Williams
On Wed, Dec 9, 2020 at 2:31 AM Dan Carpenter  wrote:
>
> On Wed, Dec 09, 2020 at 12:54:30AM -0800, Joe Perches wrote:
> > On Wed, 2020-12-09 at 10:58 +0300, Dan Carpenter wrote:
> > > On Tue, Dec 08, 2020 at 09:01:49PM -0800, Joe Perches wrote:
> > > > On Tue, 2020-12-08 at 16:34 -0800, Kees Cook wrote:
> > > >
> > > > > If not "Adjusted-by", what about "Tweaked-by", "Helped-by",
> > > > > "Corrected-by"?
> > > >
> > > > Improved-by: / Enhanced-by: / Revisions-by:
> > > >
> > >
> > > I don't think we should give any credit for improvements or enhancements,
> > > only for fixes.
> >
> > Hey Dan.
> >
> > I do.  If a patch isn't comprehensive and a reviewer notices some
> > missing coverage or algorithmic performance enhancement, I think that
> > should be noted.
> >
> > > Complaining about style is its own reward.
> >
> > 
> >
> > I believe I've said multiple times that style changes shouldn't require
> > additional commentary added to a patch.
> >
> > I'm not making any suggestion to comment for style, only logic or defect
> > reduction/improvements as described above.
>
> How about we make the standard, "Would this fix be backported to stable?"
>
> >
> > > Having to redo a patch is already a huge headache.  Normally, I already
> > > considered the reviewer's prefered style and decided I didn't like it.
> >
> > Example please.  We both seem to prefer consistent style.
> >
>
> For example, if you have a signedness bugs:
>
> ret = frob(unsigned_long_size);
> -   if (ret < unsigned_long_size)
> +   if (ret < 0 || ret < unsigned_long_size)
> vs:
> +   if (ret < (int)unsigned_long_size)
> goto whatever;
>
> To me, whoever fixes the bug gets to choose their prefered style but
> maybe some reviewers have strong feelings one way or the other.
>
> > > Then to make me redo the patch in an ugly style and say thank you on
> > > top of that???  Forget about it.
> >
> > Not a thing I've asked for.
> >
> > >  Plus, as a reviewer I hate reviewing patches over and over.
> >
> > interdiff could be improved.
> >
> > > I've argued for years that we should have a Fixes-from: tag.  The zero
> > > day bot is already encouraging people to add Reported-by tags for this
> > > and a lot of people do.
> >
> > It's still a question of what fixes means in any context.
> >
> > https://www.google.com/search?q=%27fixes-from%3A%27%20carpenter%20site%3Alore.kernel.org
> > gives:
> > It looks like there aren't many great matches for your search
> >
>
> No, I mean people add Reported-by tags for fixes to the original commit
> like in commit f026d8ca2904 ("igc: add support to eeprom, registers and
> link self-tests").

For after the fact post-processing of tags to generate summary
reports, is there a significant difference between Reported-by for
"original commit motivation" and Reported-by for "follow-on fixups"?
Especially since this practice of Reported-by for fixups is already
deployed in the tree (at least kbuild-robot credit reports and my
subsystems operate this way).

If the fix is a slightly late and needs to come as a follow-on patch
the tag will be Reported-by on that fix. I fail to perceive a benefit
in augmenting the tag to indicate the resolution of the race condition
of the commit making it upstream.


Re: [RFC PATCH 09/14] cxl/mem: Add basic IOCTL interface

2020-12-08 Thread Dan Williams
On Tue, Dec 8, 2020 at 6:13 PM Ben Widawsky  wrote:
>
> On 20-12-08 17:37:50, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 4:24 PM Ben Widawsky  wrote:
> > >
> > > Add a straightforward IOCTL that provides a mechanism for userspace to
> > > query the supported memory device commands.
> > >
> > > Memory device commands are specified in 8.2.9 of the CXL 2.0
> > > specification. They are submitted through a mailbox mechanism specified
> > > in 8.2.8.4.
> > >
> > > Signed-off-by: Ben Widawsky 
> > >
> > > ---
> > >
> > > I did attempt to use the same struct for querying commands as well as
> > > sending commands (upcoming patch). The number of unused fields between
> > > the two made for a bad fit IMO.
> > >
> > > Signed-off-by: Ben Widawsky 
> > > ---
> > >  Documentation/cxl/memory-devices.rst |   9 +++
> > >  drivers/cxl/mem.c|  89 +++
> > >  include/uapi/linux/cxl_mem.h | 102 +++
> > >  3 files changed, 200 insertions(+)
> > >  create mode 100644 include/uapi/linux/cxl_mem.h
> > >
> > > diff --git a/Documentation/cxl/memory-devices.rst 
> > > b/Documentation/cxl/memory-devices.rst
> > > index 5f723c25382b..ec54674b3822 100644
> > > --- a/Documentation/cxl/memory-devices.rst
> > > +++ b/Documentation/cxl/memory-devices.rst
> > > @@ -32,6 +32,15 @@ CXL Memory Device
> > >  .. kernel-doc:: drivers/cxl/mem.c
> > > :internal:
> > >
> > > +CXL IOCTL Interface
> > > +---
> > > +
> > > +.. kernel-doc:: include/uapi/linux/cxl_mem.h
> > > +   :doc: UAPI
> > > +
> > > +.. kernel-doc:: include/uapi/linux/cxl_mem.h
> > > +   :internal:
> > > +
> > >  External Interfaces
> > >  ===
> > >
> > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > index bb6ea58f6c7b..2c4aadcea0e4 100644
> > > --- a/drivers/cxl/mem.c
> > > +++ b/drivers/cxl/mem.c
> > > @@ -7,6 +7,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include "acpi.h"
> > >  #include "pci.h"
> > >  #include "cxl.h"
> > > @@ -73,6 +74,49 @@ static DEFINE_IDR(cxl_mem_idr);
> > >  /* protect cxl_mem_idr allocations */
> > >  static DEFINE_MUTEX(cxl_memdev_lock);
> > >
> > > +/*
> > > + * This table defines the supported mailboxes commands for the driver. 
> > > The id is
> > > + * ordinal and thus gaps in this table aren't allowed. This table is 
> > > made up of
> > > + * a UAPI structure. Non-negative values in the table will be validated 
> > > against
> > > + * the user's input. For example, if size_in is 0, and the user passed 
> > > in 1, it
> > > + * is an error.
> > > + */
> > > +#define CXL_CMD(_id, _flags, sin, sout, _name, _enable, op)  
> > >   \
> > > +   { 
> > >  \
> > > +   { .id = CXL_MEM_COMMAND_ID_##_id, 
> > >  \
> > > + .flags = CXL_MEM_COMMAND_FLAG_##_flags, 
> > >  \
> > > + .size_in = sin, 
> > >  \
> > > + .size_out = sout,   
> > >  \
> > > + .name = _name },
> > >  \
> > > +   .enable = _enable, .opcode = op   
> > >  \
> > > +   }
> >
> > Seems the ordinality requirement could be dropped if the definition was:
> >
> > #define CXL_CMD(_id, _flags, sin, sout, _name, _enable, op) 
> >\
> >[CXL_MEM_COMMAND_ID_##_id] = {
> >  \
> >{ .id = CXL_MEM_COMMAND_ID_##_id,
> >   \
> > ...
> >
> > Then command 0 and 42 could be defined out of order in the table.
> > Especially if we need to config-disable or deprecate commands, I think
> > it would be useful if this table was tolerant to being sparse.
> >
>
> How sparse are we talking? The current form does support sparseness, but
> obviously gets quite la

Re: [RFC PATCH 12/14] cxl: Add basic debugging

2020-12-08 Thread Dan Williams
On Tue, Dec 8, 2020 at 6:04 PM Ben Widawsky  wrote:
>
> On 20-12-08 17:17:36, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 4:24 PM Ben Widawsky  wrote:
> > >
> > > Provide a standard debug function for use throughout the driver.
> > >
> > > Signed-off-by: Ben Widawsky 
> > > ---
> > >  drivers/cxl/cxl.h |  3 +++
> > >  drivers/cxl/mem.c | 26 +-
> > >  2 files changed, 28 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > > index 77c2dee6843c..e5afb89dab0b 100644
> > > --- a/drivers/cxl/cxl.h
> > > +++ b/drivers/cxl/cxl.h
> > > @@ -9,6 +9,9 @@
> > >  #include 
> > >  #include 
> > >
> > > +#define cxl_debug(fmt, ...)  
> > >   \
> > > +   pr_debug("CXL DEBUG: %s: " fmt, __func__, ##__VA_ARGS__)
> > > +
> >
> > This should be dev_dbg(), then you don't need the CXL DEBUG prefix. In
> > fact you don't need a cxl_debug() macro at all in that case. cxl_mem
> > might need a ->dev attribute for this purpose.
> >
>
> I really like the ability to turn specific messages on and off at will. (FWIW,
> __func__ is also redundant because pr_debug allows you to specify a flag to
> always print the function name). While it's not very frequent events here, in
> the future it likely will be and I think it can be really helpful to be able 
> to
> have that level of control.
>
> If you want to avoid creating a new debug functionality, I'm okay with that, 
> but
> I'd really like to use pr_debug instead of dev_dbg for those messages going
> forward. Once you take that step, it seems giving contributors a macro named
> 'cxl_debug' so they don't have to figure out when to use what, makes sense. My
> mental separation is, dev_* is useful primarily for errors and initialization
> debug messaging, pr_debug/trace_printk is for runtime things.
>
> I probably should have put that in the commit message...

I suspect you haven't taken a look at the backed and of pr_debug() and
dev_dbg() in a while? They both use _dynamic_func_call and enjoy all
the benefits afforded by /sys/debug/dynamic_debug/control and the
"dyndbg" module option for adding __func__ and enable / disable by
line number or format message . pr_debug() and this cxl_debug() macro
are completely superseded by dev_dbg(). Even if a driver wanted a
common prefix on all prints there is the "#define dev_fmt ..."
mechanism for that.

>
> > >  #define CXL_SET_FIELD(value, field)  
> > >   \
> > > ({
> > >  \
> > > WARN_ON(!FIELD_FIT(field##_MASK, value)); 
> > >  \
> > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > index a2cea7ac7cc6..6b2f8d3776b5 100644
> > > --- a/drivers/cxl/mem.c
> > > +++ b/drivers/cxl/mem.c
> > > @@ -122,9 +122,12 @@ static int cxl_mem_wait_for_doorbell(struct cxl_mem 
> > > *cxlm)
> > >  {
> > > const int timeout = msecs_to_jiffies(2000);
> > > const unsigned long start = jiffies;
> > > +   unsigned long end = start;
> > >
> > > while (cxl_doorbell_busy(cxlm)) {
> > > -   if (time_after(jiffies, start + timeout)) {
> > > +   end = jiffies;
> > > +
> > > +   if (time_after(end, start + timeout)) {
> > > /* Check again in case preempted before timeout 
> > > test */
> > > if (!cxl_doorbell_busy(cxlm))
> > > break;
> > > @@ -133,6 +136,8 @@ static int cxl_mem_wait_for_doorbell(struct cxl_mem 
> > > *cxlm)
> > > cpu_relax();
> > > }
> > >
> > > +   cxl_debug("Doorbell wait took %dms",
> > > + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> > > return 0;
> > >  }
> > >
> > > @@ -180,6 +185,8 @@ static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
> > > }
> > >
> > > /* #4 */
> > > +   cxl_debug("Sending command to %s\n",
> > > + dev_driver_string(>pdev->dev));
> >
> > dev_dbg() already includes dev_driver_string().
> >
> > > cxl_write_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET,
> > >   

Re: [RFC PATCH 09/14] cxl/mem: Add basic IOCTL interface

2020-12-08 Thread Dan Williams
On Tue, Dec 8, 2020 at 4:24 PM Ben Widawsky  wrote:
>
> Add a straightforward IOCTL that provides a mechanism for userspace to
> query the supported memory device commands.
>
> Memory device commands are specified in 8.2.9 of the CXL 2.0
> specification. They are submitted through a mailbox mechanism specified
> in 8.2.8.4.
>
> Signed-off-by: Ben Widawsky 
>
> ---
>
> I did attempt to use the same struct for querying commands as well as
> sending commands (upcoming patch). The number of unused fields between
> the two made for a bad fit IMO.
>
> Signed-off-by: Ben Widawsky 
> ---
>  Documentation/cxl/memory-devices.rst |   9 +++
>  drivers/cxl/mem.c|  89 +++
>  include/uapi/linux/cxl_mem.h | 102 +++
>  3 files changed, 200 insertions(+)
>  create mode 100644 include/uapi/linux/cxl_mem.h
>
> diff --git a/Documentation/cxl/memory-devices.rst 
> b/Documentation/cxl/memory-devices.rst
> index 5f723c25382b..ec54674b3822 100644
> --- a/Documentation/cxl/memory-devices.rst
> +++ b/Documentation/cxl/memory-devices.rst
> @@ -32,6 +32,15 @@ CXL Memory Device
>  .. kernel-doc:: drivers/cxl/mem.c
> :internal:
>
> +CXL IOCTL Interface
> +---
> +
> +.. kernel-doc:: include/uapi/linux/cxl_mem.h
> +   :doc: UAPI
> +
> +.. kernel-doc:: include/uapi/linux/cxl_mem.h
> +   :internal:
> +
>  External Interfaces
>  ===
>
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index bb6ea58f6c7b..2c4aadcea0e4 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "acpi.h"
>  #include "pci.h"
>  #include "cxl.h"
> @@ -73,6 +74,49 @@ static DEFINE_IDR(cxl_mem_idr);
>  /* protect cxl_mem_idr allocations */
>  static DEFINE_MUTEX(cxl_memdev_lock);
>
> +/*
> + * This table defines the supported mailboxes commands for the driver. The 
> id is
> + * ordinal and thus gaps in this table aren't allowed. This table is made up 
> of
> + * a UAPI structure. Non-negative values in the table will be validated 
> against
> + * the user's input. For example, if size_in is 0, and the user passed in 1, 
> it
> + * is an error.
> + */
> +#define CXL_CMD(_id, _flags, sin, sout, _name, _enable, op)  
>   \
> +   { 
>  \
> +   { .id = CXL_MEM_COMMAND_ID_##_id, 
>  \
> + .flags = CXL_MEM_COMMAND_FLAG_##_flags, 
>  \
> + .size_in = sin, 
>  \
> + .size_out = sout,   
>  \
> + .name = _name },
>  \
> +   .enable = _enable, .opcode = op   
>  \
> +   }

Seems the ordinality requirement could be dropped if the definition was:

#define CXL_CMD(_id, _flags, sin, sout, _name, _enable, op)\
   [CXL_MEM_COMMAND_ID_##_id] = {
 \
   { .id = CXL_MEM_COMMAND_ID_##_id,  \
...

Then command 0 and 42 could be defined out of order in the table.
Especially if we need to config-disable or deprecate commands, I think
it would be useful if this table was tolerant to being sparse.

> +
> +/**
> + * struct cxl_mem_command - Driver representation of a memory device command
> + * @info: Command information as it exists for the UAPI
> + * @opcode: The actual bits used for the mailbox protocol
> + * @enable: Whether the command is enabled. The driver may support a large 
> set
> + * of commands that may not be enabled. The primary reason a command
> + * would not be enabled is for commands that are specified as 
> optional
> + * and the hardware doesn't support the command.
> + *
> + * The cxl_mem_command is the driver's internal representation of commands 
> that
> + * are supported by the driver. Some of these commands may not be supported 
> by
> + * the hardware (!@enable). The driver will use @info to validate the fields
> + * passed in by the user then submit the @opcode to the hardware.
> + *
> + * See struct cxl_command_info.
> + */
> +struct cxl_mem_command {
> +   const struct cxl_command_info info;
> +   const u16 opcode;
> +   bool enable;
> +};
> +
> +static struct cxl_mem_command mem_commands[] = {
> +   CXL_CMD(INVALID, NONE, 0, 0, "Reserved", false, 0),
> +};
> +
>  static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
>  {
> const int timeout = msecs_to_jiffies(2000);
> @@ -268,8 +312,53 @@ static int cxl_mem_open(struct inode *inode, struct file 
> *file)
> return 0;
>  }
>
> +static int cxl_mem_count_commands(void)
> +{
> +   int i, n = 0;
> +
> +   for (i = 0; i < ARRAY_SIZE(mem_commands); i++) {
> +   struct 

Re: [RFC PATCH 12/14] cxl: Add basic debugging

2020-12-08 Thread Dan Williams
On Tue, Dec 8, 2020 at 4:24 PM Ben Widawsky  wrote:
>
> Provide a standard debug function for use throughout the driver.
>
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/cxl.h |  3 +++
>  drivers/cxl/mem.c | 26 +-
>  2 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 77c2dee6843c..e5afb89dab0b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -9,6 +9,9 @@
>  #include 
>  #include 
>
> +#define cxl_debug(fmt, ...)  
>   \
> +   pr_debug("CXL DEBUG: %s: " fmt, __func__, ##__VA_ARGS__)
> +

This should be dev_dbg(), then you don't need the CXL DEBUG prefix. In
fact you don't need a cxl_debug() macro at all in that case. cxl_mem
might need a ->dev attribute for this purpose.

>  #define CXL_SET_FIELD(value, field)  
>   \
> ({
>  \
> WARN_ON(!FIELD_FIT(field##_MASK, value)); 
>  \
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index a2cea7ac7cc6..6b2f8d3776b5 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -122,9 +122,12 @@ static int cxl_mem_wait_for_doorbell(struct cxl_mem 
> *cxlm)
>  {
> const int timeout = msecs_to_jiffies(2000);
> const unsigned long start = jiffies;
> +   unsigned long end = start;
>
> while (cxl_doorbell_busy(cxlm)) {
> -   if (time_after(jiffies, start + timeout)) {
> +   end = jiffies;
> +
> +   if (time_after(end, start + timeout)) {
> /* Check again in case preempted before timeout test 
> */
> if (!cxl_doorbell_busy(cxlm))
> break;
> @@ -133,6 +136,8 @@ static int cxl_mem_wait_for_doorbell(struct cxl_mem *cxlm)
> cpu_relax();
> }
>
> +   cxl_debug("Doorbell wait took %dms",
> + jiffies_to_msecs(end) - jiffies_to_msecs(start));
> return 0;
>  }
>
> @@ -180,6 +185,8 @@ static int cxl_mem_mbox_send_cmd(struct cxl_mem *cxlm,
> }
>
> /* #4 */
> +   cxl_debug("Sending command to %s\n",
> + dev_driver_string(>pdev->dev));

dev_dbg() already includes dev_driver_string().

> cxl_write_mbox_reg32(cxlm, CXLDEV_MB_CTRL_OFFSET,
>  CXLDEV_MB_CTRL_DOORBELL);
>
> @@ -308,6 +315,8 @@ static int cxl_mem_open(struct inode *inode, struct file 
> *file)
> if (!cxlmd)
> return -ENXIO;
>
> +   cxl_debug("Opened %pD\n", file);
> +
> file->private_data = cxlmd;
>
> return 0;
> @@ -383,6 +392,10 @@ static int handle_mailbox_cmd_from_user(struct 
> cxl_memdev *cxlmd,
> .size_in = cmd->info.size_in,
> .size_out = size_out,
> };
> +   cxl_debug("Submitting command for user\n"
> + "\topcode: %x\n"
> + "\tsize: %zub/%zub\n",
> + mbox_cmd.opcode, mbox_cmd.size_in, mbox_cmd.size_out);
> rc = cxl_mem_mbox_send_cmd(cxlmd->cxlm, _cmd);
> cxl_mem_mbox_put(cxlmd->cxlm);
> if (rc)
> @@ -479,6 +492,8 @@ static long cxl_mem_ioctl(struct file *file, unsigned int 
> cmd, unsigned long arg
> u32 n_commands;
> int i, j;
>
> +   cxl_debug("Query IOCTL\n");
> +
> if (get_user(n_commands, (u32 __user *)arg))
> return -EFAULT;
>
> @@ -511,6 +526,8 @@ static long cxl_mem_ioctl(struct file *file, unsigned int 
> cmd, unsigned long arg
> struct cxl_mem_command c;
> int rc;
>
> +   cxl_debug("Send IOCTL\n");
> +
> rc = cxl_validate_cmd_from_user(u, );
> if (rc)
> return rc;
> @@ -843,6 +860,13 @@ static int cxl_mem_identify(struct cxl_mem *cxlm)
>
> id = (struct cxl_mbox_identify *)mbox_cmd.payload;
>
> +   cxl_debug("Driver identify command\n"
> + "\tFirmware Version: %s\n"
> + "\tTotal Capacity: %llu (%llu persistent)\n"
> + "\tLSA size: %u\n",
> + id->fw_revision, id->total_capacity, 
> id->persistent_capacity,
> + id->lsa_size);
> +

Seems not necessary for details that are published in sysfs?


Re: [PATCH V2 2/2] mm/highmem: Lift memcpy_[to|from]_page to core

2020-12-08 Thread Dan Williams
On Tue, Dec 8, 2020 at 2:49 PM Darrick J. Wong  wrote:
[..]
> > So what's your preferred poison?
> >
> > 1. Corrupt random data in whatever's been mapped into the next page (which
> >is what the helpers currently do)
>
> Please no.

My assertion is that the kernel can't know it's corruption, it can
only know that the driver is abusing the API. So over-copy and WARN
seems better than violently regress by crashing what might have been
working silently before.


Re: [PATCH V2 2/2] mm/highmem: Lift memcpy_[to|from]_page to core

2020-12-08 Thread Dan Williams
On Tue, Dec 8, 2020 at 1:51 PM Matthew Wilcox  wrote:
>
> On Tue, Dec 08, 2020 at 01:32:55PM -0800, Ira Weiny wrote:
> > On Mon, Dec 07, 2020 at 03:49:55PM -0800, Dan Williams wrote:
> > > On Mon, Dec 7, 2020 at 3:40 PM Matthew Wilcox  wrote:
> > > >
> > > > On Mon, Dec 07, 2020 at 03:34:44PM -0800, Dan Williams wrote:
> > > > > On Mon, Dec 7, 2020 at 3:27 PM Matthew Wilcox  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Dec 07, 2020 at 02:57:03PM -0800, ira.we...@intel.com wrote:
> > > > > > > +static inline void memcpy_page(struct page *dst_page, size_t 
> > > > > > > dst_off,
> > > > > > > +struct page *src_page, size_t 
> > > > > > > src_off,
> > > > > > > +size_t len)
> > > > > > > +{
> > > > > > > + char *dst = kmap_local_page(dst_page);
> > > > > > > + char *src = kmap_local_page(src_page);
> > > > > >
> > > > > > I appreciate you've only moved these, but please add:
> > > > > >
> > > > > > BUG_ON(dst_off + len > PAGE_SIZE || src_off + len > 
> > > > > > PAGE_SIZE);
> > > > >
> > > > > I imagine it's not outside the realm of possibility that some driver
> > > > > on CONFIG_HIGHMEM=n is violating this assumption and getting away with
> > > > > it because kmap_atomic() of contiguous pages "just works (TM)".
> > > > > Shouldn't this WARN rather than BUG so that the user can report the
> > > > > buggy driver and not have a dead system?
> > > >
> > > > As opposed to (on a HIGHMEM=y system) silently corrupting data that
> > > > is on the next page of memory?
> > >
> > > Wouldn't it fault in HIGHMEM=y case? I guess not necessarily...
> > >
> > > > I suppose ideally ...
> > > >
> > > > if (WARN_ON(dst_off + len > PAGE_SIZE))
> > > > len = PAGE_SIZE - dst_off;
> > > > if (WARN_ON(src_off + len > PAGE_SIZE))
> > > > len = PAGE_SIZE - src_off;
> > > >
> > > > and then we just truncate the data of the offending caller instead of
> > > > corrupting innocent data that happens to be adjacent.  Although that's
> > > > not ideal either ... I dunno, what's the least bad poison to drink here?
> > >
> > > Right, if the driver was relying on "corruption" for correct operation.
> > >
> > > If corruption actual were happening in practice wouldn't there have
> > > been screams by now? Again, not necessarily...
> > >
> > > At least with just plain WARN the kernel will start screaming on the
> > > user's behalf, and if it worked before it will keep working.
> >
> > So I decided to just sleep on this because I was recently told to not 
> > introduce
> > new WARN_ON's[1]
> >
> > I don't think that truncating len is worth the effort.  The conversions 
> > being
> > done should all 'work'  At least corrupting users data in the same way as it
> > used to...  ;-)  I'm ok with adding the WARN_ON's and I have modified the 
> > patch
> > to do so while I work through the 0-day issues.  (not sure what is going on
> > there.)
> >
> > However, are we ok with adding the WARN_ON's given what Greg KH told me?  
> > This
> > is a bit more critical than the PKS API in that it could result in corrupt
> > data.
>
> zero_user_segments contains:
>
> BUG_ON(end1 > page_size(page) || end2 > page_size(page));
>
> These should be consistent.  I think we've demonstrated that there is
> no good option here.

True, but these helpers are being deployed to many new locations where
they were not used before.


Re: [PATCH V2 2/2] mm/highmem: Lift memcpy_[to|from]_page to core

2020-12-08 Thread Dan Williams
On Tue, Dec 8, 2020 at 1:33 PM Ira Weiny  wrote:
>
> On Mon, Dec 07, 2020 at 03:49:55PM -0800, Dan Williams wrote:
> > On Mon, Dec 7, 2020 at 3:40 PM Matthew Wilcox  wrote:
> > >
> > > On Mon, Dec 07, 2020 at 03:34:44PM -0800, Dan Williams wrote:
> > > > On Mon, Dec 7, 2020 at 3:27 PM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Mon, Dec 07, 2020 at 02:57:03PM -0800, ira.we...@intel.com wrote:
> > > > > > +static inline void memcpy_page(struct page *dst_page, size_t 
> > > > > > dst_off,
> > > > > > +struct page *src_page, size_t src_off,
> > > > > > +size_t len)
> > > > > > +{
> > > > > > + char *dst = kmap_local_page(dst_page);
> > > > > > + char *src = kmap_local_page(src_page);
> > > > >
> > > > > I appreciate you've only moved these, but please add:
> > > > >
> > > > > BUG_ON(dst_off + len > PAGE_SIZE || src_off + len > 
> > > > > PAGE_SIZE);
> > > >
> > > > I imagine it's not outside the realm of possibility that some driver
> > > > on CONFIG_HIGHMEM=n is violating this assumption and getting away with
> > > > it because kmap_atomic() of contiguous pages "just works (TM)".
> > > > Shouldn't this WARN rather than BUG so that the user can report the
> > > > buggy driver and not have a dead system?
> > >
> > > As opposed to (on a HIGHMEM=y system) silently corrupting data that
> > > is on the next page of memory?
> >
> > Wouldn't it fault in HIGHMEM=y case? I guess not necessarily...
> >
> > > I suppose ideally ...
> > >
> > > if (WARN_ON(dst_off + len > PAGE_SIZE))
> > > len = PAGE_SIZE - dst_off;
> > > if (WARN_ON(src_off + len > PAGE_SIZE))
> > > len = PAGE_SIZE - src_off;
> > >
> > > and then we just truncate the data of the offending caller instead of
> > > corrupting innocent data that happens to be adjacent.  Although that's
> > > not ideal either ... I dunno, what's the least bad poison to drink here?
> >
> > Right, if the driver was relying on "corruption" for correct operation.
> >
> > If corruption actual were happening in practice wouldn't there have
> > been screams by now? Again, not necessarily...
> >
> > At least with just plain WARN the kernel will start screaming on the
> > user's behalf, and if it worked before it will keep working.
>
> So I decided to just sleep on this because I was recently told to not 
> introduce
> new WARN_ON's[1]
>
> I don't think that truncating len is worth the effort.  The conversions being
> done should all 'work'  At least corrupting users data in the same way as it
> used to...  ;-)  I'm ok with adding the WARN_ON's and I have modified the 
> patch
> to do so while I work through the 0-day issues.  (not sure what is going on
> there.)
>
> However, are we ok with adding the WARN_ON's given what Greg KH told me?  This
> is a bit more critical than the PKS API in that it could result in corrupt
> data.

That situation was a bit different in my mind because the default
fallback stub path has typically never had WARN_ON even if it's never
supposed to be called.


Re: [PATCH V2 2/2] mm/highmem: Lift memcpy_[to|from]_page to core

2020-12-07 Thread Dan Williams
On Mon, Dec 7, 2020 at 3:40 PM Matthew Wilcox  wrote:
>
> On Mon, Dec 07, 2020 at 03:34:44PM -0800, Dan Williams wrote:
> > On Mon, Dec 7, 2020 at 3:27 PM Matthew Wilcox  wrote:
> > >
> > > On Mon, Dec 07, 2020 at 02:57:03PM -0800, ira.we...@intel.com wrote:
> > > > +static inline void memcpy_page(struct page *dst_page, size_t dst_off,
> > > > +struct page *src_page, size_t src_off,
> > > > +size_t len)
> > > > +{
> > > > + char *dst = kmap_local_page(dst_page);
> > > > + char *src = kmap_local_page(src_page);
> > >
> > > I appreciate you've only moved these, but please add:
> > >
> > > BUG_ON(dst_off + len > PAGE_SIZE || src_off + len > PAGE_SIZE);
> >
> > I imagine it's not outside the realm of possibility that some driver
> > on CONFIG_HIGHMEM=n is violating this assumption and getting away with
> > it because kmap_atomic() of contiguous pages "just works (TM)".
> > Shouldn't this WARN rather than BUG so that the user can report the
> > buggy driver and not have a dead system?
>
> As opposed to (on a HIGHMEM=y system) silently corrupting data that
> is on the next page of memory?

Wouldn't it fault in HIGHMEM=y case? I guess not necessarily...

> I suppose ideally ...
>
> if (WARN_ON(dst_off + len > PAGE_SIZE))
> len = PAGE_SIZE - dst_off;
> if (WARN_ON(src_off + len > PAGE_SIZE))
> len = PAGE_SIZE - src_off;
>
> and then we just truncate the data of the offending caller instead of
> corrupting innocent data that happens to be adjacent.  Although that's
> not ideal either ... I dunno, what's the least bad poison to drink here?

Right, if the driver was relying on "corruption" for correct operation.

If corruption actual were happening in practice wouldn't there have
been screams by now? Again, not necessarily...

At least with just plain WARN the kernel will start screaming on the
user's behalf, and if it worked before it will keep working.


Re: [PATCH V2 2/2] mm/highmem: Lift memcpy_[to|from]_page to core

2020-12-07 Thread Dan Williams
On Mon, Dec 7, 2020 at 3:27 PM Matthew Wilcox  wrote:
>
> On Mon, Dec 07, 2020 at 02:57:03PM -0800, ira.we...@intel.com wrote:
> > +static inline void memcpy_page(struct page *dst_page, size_t dst_off,
> > +struct page *src_page, size_t src_off,
> > +size_t len)
> > +{
> > + char *dst = kmap_local_page(dst_page);
> > + char *src = kmap_local_page(src_page);
>
> I appreciate you've only moved these, but please add:
>
> BUG_ON(dst_off + len > PAGE_SIZE || src_off + len > PAGE_SIZE);

I imagine it's not outside the realm of possibility that some driver
on CONFIG_HIGHMEM=n is violating this assumption and getting away with
it because kmap_atomic() of contiguous pages "just works (TM)".
Shouldn't this WARN rather than BUG so that the user can report the
buggy driver and not have a dead system?


Re: [RFC V2 00/37] Enhance memory utilization with DMEMFS

2020-12-07 Thread Dan Williams
On Mon, Dec 7, 2020 at 4:03 AM David Hildenbrand  wrote:
>
> On 07.12.20 12:30, yulei.ker...@gmail.com wrote:
> > From: Yulei Zhang 
> >
> > In current system each physical memory page is assocaited with
> > a page structure which is used to track the usage of this page.
> > But due to the memory usage rapidly growing in cloud environment,
> > we find the resource consuming for page structure storage becomes
> > more and more remarkable. So is it possible that we could reclaim
> > such memory and make it reusable?
> >
> > This patchset introduces an idea about how to save the extra
> > memory through a new virtual filesystem -- dmemfs.
> >
> > Dmemfs (Direct Memory filesystem) is device memory or reserved
> > memory based filesystem. This kind of memory is special as it
> > is not managed by kernel and most important it is without 'struct page'.
> > Therefore we can leverage the extra memory from the host system
> > to support more tenants in our cloud service.
>
> "is not managed by kernel" well, it's obviously is managed by the
> kernel. It's not managed by the buddy ;)
>
> How is this different to using "mem=X" and mapping the relevant memory
> directly into applications? Is this "simply" a control instance on top
> that makes sure unprivileged process can access it and not step onto
> each others feet? Is that the reason why it's called  a "file system"?
> (an example would have helped here, showing how it's used)
>
> It's worth noting that memory hotunplug, memory poisoning and probably
> more is currently fundamentally incompatible with this approach - which
> should better be pointed out in the cover letter.
>
> Also, I think something similar can be obtained by using dax/hmat
> infrastructure with "memmap=", at least I remember a talk where this was
> discussed (but not sure if they modified the firmware to expose selected
> memory as soft-reserved - we would only need a cmdline parameter to
> achieve the same - Dan might know more).

There is currently the efi_fake_mem parameter that can add the
"EFI_MEMORY_SP" attribute on EFI platforms:

efi_fake_mem=4G@9G:0x4

...this results in a /dev/dax instance that can be further partitioned
via the device-dax sub-division facility merged for 5.10. That could
be generalized to something else for non-EFI platforms, but there has
not been a justification to go that route yet.

Joao pointed this out in a previous posting of DMEMFS, and I have yet
to see an explanation of incremental benefit the kernel gains from
having yet another parallel memory management interface.


Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-05 Thread Dan Williams
On Sat, Dec 5, 2020 at 4:24 PM David Ahern  wrote:
>
> On 12/2/20 5:54 PM, Dan Williams wrote:
> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> > index 8d7001712062..040be48ce046 100644
> > --- a/drivers/base/Kconfig
> > +++ b/drivers/base/Kconfig
> > @@ -1,6 +1,9 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  menu "Generic Driver Options"
> >
> > +config AUXILIARY_BUS
> > + bool
> > +
> >  config UEVENT_HELPER
> >   bool "Support for uevent helper"
> >   help
>
> Missing a description and without it does not appear in menuconfig or in
> the config file.
>
> Could use a blurb in the help as well.

It doesn't have a description or help because it is a select-only
symbol, but a comment to that effect and a pointer to the
documentation would help.


Re: [RFC PATCH 0/9] CXL 2.0 Support

2020-12-04 Thread Dan Williams
On Fri, Dec 4, 2020 at 11:26 AM Verma, Vishal L
 wrote:
>
> On Fri, 2020-12-04 at 10:12 -0800, Ben Widawsky wrote:
> > Hi Chris.
> >
> > On 20-12-04 12:40:03, Chris Browy wrote:
> [..]
> >
> > >acpidump indicates the CXL0 and CXLM devices but no SRAT or HMAT 
> > > tables are
> > >in the dump which is curious.
> >
> > I don't typically use HMAT, but I do have an SRAT in mine, so that's 
> > strange.
> > You should also have a CEDT.
> >
> I suspect an SRAT is only added if you have distinct numa nodes. Adding
> a few '-numa node' bits to the qemu command line should be enough to
> make that happen.

For CXL-2.0-Type-3, BIOS is responsible for retrieving CDATs and
synthesizing SRAT/SLIT/HMAT tables for the CXL.mem that is mapped by
platform firmware. For CXL.mem that is mapped by the OS, there is no
requirement to publish updated ACPI tables. CXL.mem mapped by the OS
need only support native CXL memory enumeration and leave ACPI only
for static platform resources.


Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-04 Thread Dan Williams
On Fri, Dec 4, 2020 at 3:41 AM Greg KH  wrote:
>
> On Wed, Dec 02, 2020 at 04:54:24PM -0800, Dan Williams wrote:
> > From: Dave Ertman 
> >
> > Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
> > It enables drivers to create an auxiliary_device and bind an
> > auxiliary_driver to it.
> >
> > The bus supports probe/remove shutdown and suspend/resume callbacks.
> > Each auxiliary_device has a unique string based id; driver binds to
> > an auxiliary_device based on this id through the bus.
> >
> > Co-developed-by: Kiran Patil 
> > Co-developed-by: Ranjani Sridharan 
> > Co-developed-by: Fred Oh 
> > Co-developed-by: Leon Romanovsky 
> > Signed-off-by: Kiran Patil 
> > Signed-off-by: Ranjani Sridharan 
> > Signed-off-by: Fred Oh 
> > Signed-off-by: Leon Romanovsky 
> > Signed-off-by: Dave Ertman 
> > Reviewed-by: Pierre-Louis Bossart 
> > Reviewed-by: Shiraz Saleem 
> > Reviewed-by: Parav Pandit 
> > Reviewed-by: Dan Williams 
> > Reviewed-by: Martin Habets 
> > Link: 
> > https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com
> > Signed-off-by: Dan Williams 
> > ---
> > This patch is "To:" the maintainers that have a pending backlog of
> > driver updates dependent on this facility, and "Cc:" Greg. Greg, I
> > understand you have asked for more time to fully review this and apply
> > it to driver-core.git, likely for v5.12, but please consider Acking it
> > for v5.11 instead. It looks good to me and several other stakeholders.
> > Namely, stakeholders that have pressure building up behind this facility
> > in particular Mellanox RDMA, but also SOF, Intel Ethernet, and later on
> > Compute Express Link.
> >
> > I will take the blame for the 2 months of silence that made this awkward
> > to take through driver-core.git, but at the same time I do not want to
> > see that communication mistake inconvenience other parties that
> > reasonably thought this was shaping up to land in v5.11.
> >
> > I am willing to host this version at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux 
> > tags/auxiliary-bus-for-5.11
> >
> > ...for all the independent drivers to have a common commit baseline. It
> > is not there yet pending Greg's Ack.
> >
> > For example implementations incorporating this patch, see Dave Ertman's
> > SOF series:
> >
> > https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com
> >
> > ...and Leon's mlx5 series:
> >
> > http://lore.kernel.org/r/20201026111849.1035786-1-l...@kernel.org
> >
> > PS: Greg I know I promised some review on newcomer patches to help with
> > your queue, unfortunately Intel-internal review is keeping my plate
> > full. Again, I do not want other stakeholder to be waiting on me to
> > resolve that backlog.
>
> Ok, I spent some hours today playing around with this.  I wrote up a
> small test-patch for this (how did anyone test this thing???) and while
> it feels awkward in places, and it feels like there is still way too
> much "boilerplate" code that a user has to write and manage, I don't
> have the time myself to fix it up right now.
>
> So I'll go apply this to my tree, and provide a tag for everyone else to
> be able to pull from for their different development trees so they can
> work on.
>
> I do have 3 follow-on patches that I will send to the list in response
> to this message that I will be applying on top of this patch.  They do
> some minor code formatting changes, as well as change the return type of
> the remove function to make it more future-proof.  That last change will
> require users of this code to change their implementations, but it will
> be obvious what to do as you will get a build warning.
>
> Note, I'm still not comfortable with a few things here.  The
> documentation feels odd, and didn't really help me out in writing any
> test code, which doesn't seem right.  Also the use of strings and '.' as
> part of the api feels awkward, and messy, and of course, totally
> undocumented.
>
> But, as the use of '.' is undocumented, that means we can change it in
> the future!  Because no driver or device name should ever be a user api
> reliant thing, if we come up with a better way to do all of this in the
> future, that shouldn't be a problem to change existing users over to
> this.  So this is a warning to everyone, you CAN NOT depend on the sysfs
> name of a device or bus name for any tool.  If so, your userspace tool
> is broken.
>
> Thanks for everyone in sticking with th

Re: [RFC PATCH 5/9] cxl/mem: Find device capabilities

2020-12-03 Thread Dan Williams
On Tue, Nov 10, 2020 at 9:44 PM Ben Widawsky  wrote:
>
> CXL devices contain an array of capabilities that describe the
> interactions software can interact with the device, or firmware running
> on the device. A CXL compliant device must implement the device status
> and the mailbox capability. A CXL compliant memory device must implement
> the memory device capability.
>
> Each of the capabilities can [will] provide an offset within the MMIO
> region for interacting with the CXL device.
>
> Signed-off-by: Ben Widawsky 
> ---
>  drivers/cxl/cxl.h | 89 +++
>  drivers/cxl/mem.c | 58 +++---
>  2 files changed, 143 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/cxl/cxl.h
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> new file mode 100644
> index ..02858ae63d6d
> --- /dev/null
> +++ b/drivers/cxl/cxl.h
[..]
> +static inline u32 __cxl_raw_read_reg32(struct cxl_mem *cxlm, u32 reg)

Going through my reworks and the "raw" jumped out at me. My typical
interpretation of "raw" in respect to register access macros is the
difference between readl() and __raw_readl()  which means "don't do
bus endian swizzling, and don't do a memory clobber barrier". Any
heartburn to drop the "raw"?

...is it only me that reacts that way?


Re: [RFC PATCH 5/9] cxl/mem: Find device capabilities

2020-12-03 Thread Dan Williams
On Wed, Nov 25, 2020 at 10:06 PM Jon Masters  wrote:
>
> On 11/11/20 12:43 AM, Ben Widawsky wrote:
>
> > + case CXL_CAPABILITIES_CAP_ID_SECONDARY_MAILBOX:
> > + dev_dbg(>pdev->dev,
> > +"found UNSUPPORTED Secondary Mailbox 
> > capability\n");
>
> Per spec, the secondary mailbox is intended for use by platform
> firmware, so Linux should never be using it anyway. Maybe that message
> is slightly misleading?
>
> Jon.
>
> P.S. Related - I've severe doubts about the mailbox approach being
> proposed by CXL and have begun to push back through the spec org.

The more Linux software voices the better. At the same time the spec
is released so we're into xkcd territory [1] of what the driver will
be expected to support for any future improvements.

[1]: https://xkcd.com/927/


Re: [RFC PATCH 3/9] cxl/mem: Add a driver for the type-3 mailbox

2020-12-03 Thread Dan Williams
On Thu, Dec 3, 2020 at 11:22 PM Dan Williams  wrote:
>
> On Tue, Nov 17, 2020 at 6:50 AM Jonathan Cameron
>  wrote:
> >
> > On Tue, 10 Nov 2020 21:43:50 -0800
> > Ben Widawsky  wrote:
> >
> > > From: Dan Williams 
> > >
> > > The CXL.mem protocol allows a device to act as a provider of "System
> > > RAM" and/or "Persistent Memory" that is fully coherent as if the memory
> > > was attached to the typical CPU memory controller.
> > >
> > > The memory range exported by the device may optionally be described by
> > > the platform firmware memory map, or by infrastructure like LIBNVDIMM to
> > > provision persistent memory capacity from one, or more, CXL.mem devices.
> > >
> > > A pre-requisite for Linux-managed memory-capacity provisioning is this
> > > cxl_mem driver that can speak the "type-3 mailbox" protocol.
> > >
> > > For now just land the driver boiler-plate and fill it in with
> > > functionality in subsequent commits.
> > >
> > > Signed-off-by: Dan Williams 
> > > Signed-off-by: Ben Widawsky 
> >
> > I've tried to avoid repeats, so mostly this is me moaning about naming!
> >
> > Jonathan
> >
> > > ---
> > >  drivers/cxl/Kconfig  | 20 +++
> > >  drivers/cxl/Makefile |  2 ++
> > >  drivers/cxl/mem.c| 82 
> > >  drivers/cxl/pci.h| 15 
> > >  4 files changed, 119 insertions(+)
> > >  create mode 100644 drivers/cxl/mem.c
> > >  create mode 100644 drivers/cxl/pci.h
> > >
> > > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > > index dd724bd364df..15548f5c77ff 100644
> > > --- a/drivers/cxl/Kconfig
> > > +++ b/drivers/cxl/Kconfig
> > > @@ -27,4 +27,24 @@ config CXL_ACPI
> > > resources described by the CEDT (CXL Early Discovery Table)
> > >
> > > Say 'y' to enable CXL (Compute Express Link) drivers.
> > > +
> > > +config CXL_MEM
> > > +tristate "CXL.mem Device Support"
> > > +depends on PCI && CXL_BUS_PROVIDER != n
> > > +default m if CXL_BUS_PROVIDER
> > > +help
> > > +  The CXL.mem protocol allows a device to act as a provider of
> > > +  "System RAM" and/or "Persistent Memory" that is fully coherent
> > > +  as if the memory was attached to the typical CPU memory
> > > +  controller.
> > > +
> > > +  Say 'y/m' to enable a driver named "cxl_mem.ko" that will 
> > > attach
> > > +  to CXL.mem devices for configuration, provisioning, and health
> > > +  monitoring, the so called "type-3 mailbox". Note, this driver
> > > +  is required for dynamic provisioning of CXL.mem attached
> > > +  memory, a pre-requisite for persistent memory support, but
> > > +  devices that provide volatile memory may be fully described by
> > > +  existing platform firmware memory enumeration.
> > > +
> > > +  If unsure say 'n'.
> > >  endif
> > > diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> > > index d38cd34a2582..97fdffb00f2d 100644
> > > --- a/drivers/cxl/Makefile
> > > +++ b/drivers/cxl/Makefile
> > > @@ -1,5 +1,7 @@
> > >  # SPDX-License-Identifier: GPL-2.0
> > >  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
> > > +obj-$(CONFIG_CXL_MEM) += cxl_mem.o
> > >
> > >  ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=CXL
> > >  cxl_acpi-y := acpi.o
> > > +cxl_mem-y := mem.o
> > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > new file mode 100644
> > > index ..aa7d881fa47b
> > > --- /dev/null
> > > +++ b/drivers/cxl/mem.c
> > > @@ -0,0 +1,82 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include "acpi.h"
> > > +#include "pci.h"
> > > +
> > > +struct cxl_mem {
> > > + void __iomem *regs;
> > > +};
> > > +
> > > +static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> > > +{
> > > + int pos;
> > > +
> > > + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC);
> 

Re: [RFC PATCH 3/9] cxl/mem: Add a driver for the type-3 mailbox

2020-12-03 Thread Dan Williams
On Tue, Nov 17, 2020 at 6:50 AM Jonathan Cameron
 wrote:
>
> On Tue, 10 Nov 2020 21:43:50 -0800
> Ben Widawsky  wrote:
>
> > From: Dan Williams 
> >
> > The CXL.mem protocol allows a device to act as a provider of "System
> > RAM" and/or "Persistent Memory" that is fully coherent as if the memory
> > was attached to the typical CPU memory controller.
> >
> > The memory range exported by the device may optionally be described by
> > the platform firmware memory map, or by infrastructure like LIBNVDIMM to
> > provision persistent memory capacity from one, or more, CXL.mem devices.
> >
> > A pre-requisite for Linux-managed memory-capacity provisioning is this
> > cxl_mem driver that can speak the "type-3 mailbox" protocol.
> >
> > For now just land the driver boiler-plate and fill it in with
> > functionality in subsequent commits.
> >
> > Signed-off-by: Dan Williams 
> > Signed-off-by: Ben Widawsky 
>
> I've tried to avoid repeats, so mostly this is me moaning about naming!
>
> Jonathan
>
> > ---
> >  drivers/cxl/Kconfig  | 20 +++
> >  drivers/cxl/Makefile |  2 ++
> >  drivers/cxl/mem.c| 82 
> >  drivers/cxl/pci.h| 15 
> >  4 files changed, 119 insertions(+)
> >  create mode 100644 drivers/cxl/mem.c
> >  create mode 100644 drivers/cxl/pci.h
> >
> > diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> > index dd724bd364df..15548f5c77ff 100644
> > --- a/drivers/cxl/Kconfig
> > +++ b/drivers/cxl/Kconfig
> > @@ -27,4 +27,24 @@ config CXL_ACPI
> > resources described by the CEDT (CXL Early Discovery Table)
> >
> > Say 'y' to enable CXL (Compute Express Link) drivers.
> > +
> > +config CXL_MEM
> > +tristate "CXL.mem Device Support"
> > +depends on PCI && CXL_BUS_PROVIDER != n
> > +default m if CXL_BUS_PROVIDER
> > +help
> > +  The CXL.mem protocol allows a device to act as a provider of
> > +  "System RAM" and/or "Persistent Memory" that is fully coherent
> > +  as if the memory was attached to the typical CPU memory
> > +  controller.
> > +
> > +  Say 'y/m' to enable a driver named "cxl_mem.ko" that will attach
> > +  to CXL.mem devices for configuration, provisioning, and health
> > +  monitoring, the so called "type-3 mailbox". Note, this driver
> > +  is required for dynamic provisioning of CXL.mem attached
> > +  memory, a pre-requisite for persistent memory support, but
> > +  devices that provide volatile memory may be fully described by
> > +  existing platform firmware memory enumeration.
> > +
> > +  If unsure say 'n'.
> >  endif
> > diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> > index d38cd34a2582..97fdffb00f2d 100644
> > --- a/drivers/cxl/Makefile
> > +++ b/drivers/cxl/Makefile
> > @@ -1,5 +1,7 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
> > +obj-$(CONFIG_CXL_MEM) += cxl_mem.o
> >
> >  ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=CXL
> >  cxl_acpi-y := acpi.o
> > +cxl_mem-y := mem.o
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > new file mode 100644
> > index ..aa7d881fa47b
> > --- /dev/null
> > +++ b/drivers/cxl/mem.c
> > @@ -0,0 +1,82 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> > +#include 
> > +#include 
> > +#include 
> > +#include "acpi.h"
> > +#include "pci.h"
> > +
> > +struct cxl_mem {
> > + void __iomem *regs;
> > +};
> > +
> > +static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> > +{
> > + int pos;
> > +
> > + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC);
> > + if (!pos)
> > + return 0;
> > +
> > + while (pos) {
> > + u16 vendor, id;
> > +
> > + pci_read_config_word(pdev, pos + PCI_DVSEC_VENDOR_OFFSET, 
> > );
> > + pci_read_config_word(pdev, pos + PCI_DVSEC_ID_OFFSET, );
> > + if (vendor == PCI_DVSEC_VENDOR_CXL && dvsec == id)
> > + return pos;
> > +
> > + pos = pci_find_next_ext_capability(pdev, pos, 
> > PCI_EXT_CAP_ID_DVSEC);
>
> Thi

Re: [resend/standalone PATCH v4] Add auxiliary bus support

2020-12-03 Thread Dan Williams
On Thu, Dec 3, 2020 at 6:34 PM Jason Gunthorpe  wrote:
>
> On Thu, Dec 03, 2020 at 04:06:24PM +0100, Greg KH wrote:
>
> > > ...for all the independent drivers to have a common commit baseline. It
> > > is not there yet pending Greg's Ack.
> >
> > I have been trying to carve out some time to review this.  At my initial
> > glance, I still have objections, so please, give me a few more days to
> > get this done...
>
> There are still several more days till the merge window, but I am
> going to ask Leon to get the mlx5 series, and this version of the
> auxbus patch it depends on, into linux-next with the intention to
> forward it to Linus if there are no substantive comments.
>
> Regardless of fault or reason this whole 1.5 year odyssey seems to have
> brought misery to everyone involved and it really is time to move on.
>
> Leon and his team did a good deed 6 weeks ago to quickly turn around
> and build another user example. For their efforts they have been
> rewarded with major merge conflicts and alot of delayed work due to
> the invasive nature of the mlx5 changes. To continue to push this out
> is disrespectful to him and his team's efforts.
>
> A major part of my time as RDMA maintainer has been to bring things
> away from vendor trees and into a common opensource community.  Intel
> shipping a large out of tree RDMA driver and abandoning their intree
> driver is really harmful. This auxbus is a substantial blocker to them
> normalizing their operations, thus I view it as important to
> resolve. Even after this it is going to take a long time and alot of
> effort to review their new RDMA driver.

When you have 3 independent driver teams (mlx5, i40e, sof) across 2
companies (NVIDIA and Intel), and multiple subsystem maintainers with
a positive track record of upstream engagement all agreeing on a piece
of infrastructure, I struggle to imagine a stronger case for merging.
I did get word of a fixup needed in the shutdown code, I'll get that
folded. Then, barring a concrete objection, I'll look to publish a
commit that others can pull in to start building soak time in -next
this time tomorrow.


[PATCH] x86/mm: Fix leak of pmd ptlock

2020-12-02 Thread Dan Williams
Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
introduced a new location where a pmd was released, but neglected to run
the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:

c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").

This issue was hidden until recently because the failure mode is silent,
but commit:

b2b29d6d0119 ("mm: account PMD tables like PTE tables")

...turns the failure mode into this signature:

 BUG: Bad page state in process lt-pmem-ns  pfn:15943d
 page:7262ed7b refcount:0 mapcount:-1024 mapping: 
index:0x0 pfn:0x15943d
 flags: 0xa8()
 raw: 00a8 dead0100  
 raw:  913a029bcc08 fbff 
 page dumped because: nonzero mapcount
 [..]
  dump_stack+0x8b/0xb0
  bad_page.cold+0x63/0x94
  free_pcp_prepare+0x224/0x270
  free_unref_page+0x18/0xd0
  pud_free_pmd_page+0x146/0x160
  ioremap_pud_range+0xe3/0x350
  ioremap_page_range+0x108/0x160
  __ioremap_caller.constprop.0+0x174/0x2b0
  ? memremap+0x7a/0x110
  memremap+0x7a/0x110
  devm_memremap+0x53/0xa0
  pmem_attach_disk+0x4ed/0x530 [nd_pmem]
  ? __devm_release_region+0x52/0x80
  nvdimm_bus_probe+0x85/0x210 [libnvdimm]

Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure. As for a better helper
pmd_free() is close, but it is a messy fit due to requiring an @mm arg.
Also, ___pmd_free_tlb() wants to call paravirt_tlb_remove_table()
instead of free_page(), so open-coded pgtable_pmd_page_dtor() seems the
best way forward for now.

Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Cc: 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Co-debugged-by: Matthew Wilcox 
Tested-by: Yi Zhang 
Signed-off-by: Dan Williams 
---
 arch/x86/mm/pgtable.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..f6a9e2e36642 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -829,6 +829,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
}
 
free_page((unsigned long)pmd_sv);
+
+   pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);
 
return 1;



Re: [Ksummit-discuss] crediting bug reports and fixes folded into original patch

2020-12-02 Thread Dan Williams
On Wed, Dec 2, 2020 at 3:44 PM Vlastimil Babka  wrote:
>
> Hi,
>
> there was a bit of debate on Twitter about this, so I thought I would bring it
> here. Imagine a scenario where patch sits as a commit in -next and there's a 
> bug
> report or fix, possibly by a bot or with some static analysis. The maintainer
> decides to fold it into the original patch, which makes sense for e.g.
> bisectability. But there seem to be no clear rules about attribution in this
> case, which looks like there should be, probably in
> Documentation/maintainer/modifying-patches.rst
>
> The original bug fix might include a From: $author, a Reported-by: (e.g.
> syzbot), Fixes: $next-commit, some tag such as Addresses-Coverity: to credit 
> the
> static analysis tool, and an SoB. After folding, all that's left might be a 
> line
> as "include fix from $author" in the SoB area. This is a loss of
> metadata/attribution just due to folding, and might make contributors unhappy.
> Had they sent the fix after the original commit was mainline and immutable, 
> all
> the info above would "survive" in the form of new commit.
>
> So I think we could decide what the proper format would be, and document it
> properly. I personally wouldn't mind just copy/pasting the whole commit 
> message
> of the fix (with just a short issue description, no need to include 
> stacktraces
> etc if the fix is folded), we could just standardize where, and how to delimit
> it from the main commit message. If it's a report (person or bot) of a bug 
> that
> the main author then fixed, preserve the Reported-by in the same way (making
> clear it's not a Reported-By for the "main thing" addressed by the commit).
>
> In the debate one less verbose alternatve proposed was a SoB with comment
> describing it's for a fix and not whole patch, as some see SoB as the main 
> mark
> of contribution, that can be easily found and counted etc. I'm not so sure 
> about
> it myself, as AFAIK SoB is mainly a DCO thing, and for a maintainer it means
> something else ("passed through my tree") than for a patch author. And this
> approach would still lose the other tags.
>
> Thoughts?

How about a convention to add a Reported-by: and a Link: to the
incremental fixup discussion? It's just polite to credit helpful
feedback, not sure it needs a more formal process.

Along those lines, how is this situation different than the feedback
that helps improve a patch that does not necessarily get credited by
Reviewed-by:? Links to thank you notes in cover letters seems more
appealing than moving more review / fix logs into the main history.


[resend/standalone PATCH v4] Add auxiliary bus support

2020-12-02 Thread Dan Williams
From: Dave Ertman 

Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
It enables drivers to create an auxiliary_device and bind an
auxiliary_driver to it.

The bus supports probe/remove shutdown and suspend/resume callbacks.
Each auxiliary_device has a unique string based id; driver binds to
an auxiliary_device based on this id through the bus.

Co-developed-by: Kiran Patil 
Co-developed-by: Ranjani Sridharan 
Co-developed-by: Fred Oh 
Co-developed-by: Leon Romanovsky 
Signed-off-by: Kiran Patil 
Signed-off-by: Ranjani Sridharan 
Signed-off-by: Fred Oh 
Signed-off-by: Leon Romanovsky 
Signed-off-by: Dave Ertman 
Reviewed-by: Pierre-Louis Bossart 
Reviewed-by: Shiraz Saleem 
Reviewed-by: Parav Pandit 
Reviewed-by: Dan Williams 
Reviewed-by: Martin Habets 
Link: 
https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com
Signed-off-by: Dan Williams 
---
This patch is "To:" the maintainers that have a pending backlog of
driver updates dependent on this facility, and "Cc:" Greg. Greg, I
understand you have asked for more time to fully review this and apply
it to driver-core.git, likely for v5.12, but please consider Acking it
for v5.11 instead. It looks good to me and several other stakeholders.
Namely, stakeholders that have pressure building up behind this facility
in particular Mellanox RDMA, but also SOF, Intel Ethernet, and later on
Compute Express Link.

I will take the blame for the 2 months of silence that made this awkward
to take through driver-core.git, but at the same time I do not want to
see that communication mistake inconvenience other parties that
reasonably thought this was shaping up to land in v5.11.

I am willing to host this version at:

git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux 
tags/auxiliary-bus-for-5.11

...for all the independent drivers to have a common commit baseline. It
is not there yet pending Greg's Ack.

For example implementations incorporating this patch, see Dave Ertman's
SOF series:

https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ert...@intel.com

...and Leon's mlx5 series:

http://lore.kernel.org/r/20201026111849.1035786-1-l...@kernel.org

PS: Greg I know I promised some review on newcomer patches to help with
your queue, unfortunately Intel-internal review is keeping my plate
full. Again, I do not want other stakeholder to be waiting on me to
resolve that backlog.

 Documentation/driver-api/auxiliary_bus.rst |  234 
 Documentation/driver-api/index.rst |1 
 drivers/base/Kconfig   |3 
 drivers/base/Makefile  |1 
 drivers/base/auxiliary.c   |  268 
 include/linux/auxiliary_bus.h  |   78 
 include/linux/mod_devicetable.h|8 +
 scripts/mod/devicetable-offsets.c  |3 
 scripts/mod/file2alias.c   |8 +
 9 files changed, 604 insertions(+)
 create mode 100644 Documentation/driver-api/auxiliary_bus.rst
 create mode 100644 drivers/base/auxiliary.c
 create mode 100644 include/linux/auxiliary_bus.h

diff --git a/Documentation/driver-api/auxiliary_bus.rst 
b/Documentation/driver-api/auxiliary_bus.rst
new file mode 100644
index ..5dd7804631ef
--- /dev/null
+++ b/Documentation/driver-api/auxiliary_bus.rst
@@ -0,0 +1,234 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+=
+Auxiliary Bus
+=
+
+In some subsystems, the functionality of the core device (PCI/ACPI/other) is
+too complex for a single device to be managed by a monolithic driver
+(e.g. Sound Open Firmware), multiple devices might implement a common
+intersection of functionality (e.g. NICs + RDMA), or a driver may want to
+export an interface for another subsystem to drive (e.g. SIOV Physical Function
+export Virtual Function management).  A split of the functinoality into child-
+devices representing sub-domains of functionality makes it possible to
+compartmentalize, layer, and distribute domain-specific concerns via a Linux
+device-driver model.
+
+An example for this kind of requirement is the audio subsystem where a single
+IP is handling multiple entities such as HDMI, Soundwire, local devices such as
+mics/speakers etc. The split for the core's functionality can be arbitrary or
+be defined by the DSP firmware topology and include hooks for test/debug. This
+allows for the audio core device to be minimal and focused on hardware-specific
+control and communication.
+
+Each auxiliary_device represents a part of its parent functionality. The
+generic behavior can be extended and specialized as needed by encapsulating an
+auxiliary_device within other domain-specific structures and the use of .ops
+callbacks. Devices on the auxiliary bus do not share any structures and the use
+of a communication channel with the parent is domain-specific.
+
+Note that ops are intended as a way to augment instance behavior within a

Re: mapcount corruption regression

2020-12-02 Thread Dan Williams
On Tue, Dec 1, 2020 at 9:07 PM Dan Williams  wrote:
>
> On Tue, Dec 1, 2020 at 7:43 PM Matthew Wilcox  wrote:
> >
> > On Tue, Dec 01, 2020 at 06:28:45PM -0800, Dan Williams wrote:
> > > On Tue, Dec 1, 2020 at 12:49 PM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Tue, Dec 01, 2020 at 12:42:39PM -0800, Dan Williams wrote:
> > > > > On Mon, Nov 30, 2020 at 6:24 PM Matthew Wilcox  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Nov 30, 2020 at 05:20:25PM -0800, Dan Williams wrote:
> > > > > > > Kirill, Willy, compound page experts,
> > > > > > >
> > > > > > > I am seeking some debug ideas about the following splat:
> > > > > > >
> > > > > > > BUG: Bad page state in process lt-pmem-ns  pfn:121a12
> > > > > > > page:51ef73f7 refcount:0 mapcount:-1024
> > > > > > > mapping: index:0x0 pfn:0x121a12
> > > > > >
> > > > > > Mapcount of -1024 is the signature of:
> > > > > >
> > > > > > #define PG_guard0x0400
> > > > >
> > > > > Oh, thanks for that. I overlooked how mapcount is overloaded. Although
> > > > > in v5.10-rc4 that value is:
> > > > >
> > > > > #define PG_table0x0400
> > > >
> > > > Ah, I was looking at -next, where Roman renumbered it.
> > > >
> > > > I know UML had a problem where it was not clearing PG_table, but you
> > > > seem to be running on bare metal.  SuperH did too, but again, you're
> > > > not using SuperH.
> > > >
> > > > > >
> > > > > > (the bits are inverted, so this turns into 0xfbff which is 
> > > > > > reported
> > > > > > as -1024)
> > > > > >
> > > > > > I assume you have debug_pagealloc enabled?
> > > > >
> > > > > Added it, but no extra spew. I'll dig a bit more on how PG_table is
> > > > > not being cleared in this case.
> > > >
> > > > I only asked about debug_pagealloc because that sets PG_guard.  Since
> > > > the problem is actually PG_table, it's not relevant.
> > >
> > > As a shot in the dark I reverted:
> > >
> > > b2b29d6d0119 mm: account PMD tables like PTE tables
> > >
> > > ...and the test passed.
> >
> > That's not really surprising ... you're still freeing PMD tables without
> > calling the destructor, which means that you're leaking ptlocks on
> > configs that can't embed the ptlock in the struct page.
>
> Ok, so potentially this new tracking is highlighting a long standing
> bug that was previously silent. That would explain the ambiguous
> bisect results.
>
> > I suppose it shows that you're leaking a PMD table rather than a PTE
> > table, so that might help track it down.  Checking for PG_table in
> > free_unref_page() and calling show_stack() will probably help more.
>
> Will do.

Thanks for the pointers Willy this fix below tests ok and looks
correct to me given the history:

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..7ed99314dcdf 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -829,6 +829,7 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
}

free_page((unsigned long)pmd_sv);
+   pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);

return 1;

In 2013 Kirill noticed that he missed a pmd page table free site:

c283610e44ec x86, mm: do not leak page->ptl for pmd page tables

In 2018 Toshi added a new pmd page table free site without the destructor:

28ee90fe6048 x86/mm: implement free pmd/pte page interfaces

In 2020 Willy adds PG_table accounting that flags the missing
pgtable_pmd_page_dtor()

Yi, I would appreciate a confirmation that the fix works for you.


Re: mapcount corruption regression

2020-12-01 Thread Dan Williams
On Tue, Dec 1, 2020 at 7:43 PM Matthew Wilcox  wrote:
>
> On Tue, Dec 01, 2020 at 06:28:45PM -0800, Dan Williams wrote:
> > On Tue, Dec 1, 2020 at 12:49 PM Matthew Wilcox  wrote:
> > >
> > > On Tue, Dec 01, 2020 at 12:42:39PM -0800, Dan Williams wrote:
> > > > On Mon, Nov 30, 2020 at 6:24 PM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Mon, Nov 30, 2020 at 05:20:25PM -0800, Dan Williams wrote:
> > > > > > Kirill, Willy, compound page experts,
> > > > > >
> > > > > > I am seeking some debug ideas about the following splat:
> > > > > >
> > > > > > BUG: Bad page state in process lt-pmem-ns  pfn:121a12
> > > > > > page:51ef73f7 refcount:0 mapcount:-1024
> > > > > > mapping: index:0x0 pfn:0x121a12
> > > > >
> > > > > Mapcount of -1024 is the signature of:
> > > > >
> > > > > #define PG_guard0x0400
> > > >
> > > > Oh, thanks for that. I overlooked how mapcount is overloaded. Although
> > > > in v5.10-rc4 that value is:
> > > >
> > > > #define PG_table0x0400
> > >
> > > Ah, I was looking at -next, where Roman renumbered it.
> > >
> > > I know UML had a problem where it was not clearing PG_table, but you
> > > seem to be running on bare metal.  SuperH did too, but again, you're
> > > not using SuperH.
> > >
> > > > >
> > > > > (the bits are inverted, so this turns into 0xfbff which is 
> > > > > reported
> > > > > as -1024)
> > > > >
> > > > > I assume you have debug_pagealloc enabled?
> > > >
> > > > Added it, but no extra spew. I'll dig a bit more on how PG_table is
> > > > not being cleared in this case.
> > >
> > > I only asked about debug_pagealloc because that sets PG_guard.  Since
> > > the problem is actually PG_table, it's not relevant.
> >
> > As a shot in the dark I reverted:
> >
> > b2b29d6d0119 mm: account PMD tables like PTE tables
> >
> > ...and the test passed.
>
> That's not really surprising ... you're still freeing PMD tables without
> calling the destructor, which means that you're leaking ptlocks on
> configs that can't embed the ptlock in the struct page.

Ok, so potentially this new tracking is highlighting a long standing
bug that was previously silent. That would explain the ambiguous
bisect results.

> I suppose it shows that you're leaking a PMD table rather than a PTE
> table, so that might help track it down.  Checking for PG_table in
> free_unref_page() and calling show_stack() will probably help more.

Will do.


Re: mapcount corruption regression

2020-12-01 Thread Dan Williams
On Tue, Dec 1, 2020 at 12:49 PM Matthew Wilcox  wrote:
>
> On Tue, Dec 01, 2020 at 12:42:39PM -0800, Dan Williams wrote:
> > On Mon, Nov 30, 2020 at 6:24 PM Matthew Wilcox  wrote:
> > >
> > > On Mon, Nov 30, 2020 at 05:20:25PM -0800, Dan Williams wrote:
> > > > Kirill, Willy, compound page experts,
> > > >
> > > > I am seeking some debug ideas about the following splat:
> > > >
> > > > BUG: Bad page state in process lt-pmem-ns  pfn:121a12
> > > > page:51ef73f7 refcount:0 mapcount:-1024
> > > > mapping: index:0x0 pfn:0x121a12
> > >
> > > Mapcount of -1024 is the signature of:
> > >
> > > #define PG_guard0x0400
> >
> > Oh, thanks for that. I overlooked how mapcount is overloaded. Although
> > in v5.10-rc4 that value is:
> >
> > #define PG_table0x0400
>
> Ah, I was looking at -next, where Roman renumbered it.
>
> I know UML had a problem where it was not clearing PG_table, but you
> seem to be running on bare metal.  SuperH did too, but again, you're
> not using SuperH.
>
> > >
> > > (the bits are inverted, so this turns into 0xfbff which is reported
> > > as -1024)
> > >
> > > I assume you have debug_pagealloc enabled?
> >
> > Added it, but no extra spew. I'll dig a bit more on how PG_table is
> > not being cleared in this case.
>
> I only asked about debug_pagealloc because that sets PG_guard.  Since
> the problem is actually PG_table, it's not relevant.

As a shot in the dark I reverted:

b2b29d6d0119 mm: account PMD tables like PTE tables

...and the test passed.

Yi, do you see the same?


Re: mapcount corruption regression

2020-12-01 Thread Dan Williams
On Mon, Nov 30, 2020 at 6:24 PM Matthew Wilcox  wrote:
>
> On Mon, Nov 30, 2020 at 05:20:25PM -0800, Dan Williams wrote:
> > Kirill, Willy, compound page experts,
> >
> > I am seeking some debug ideas about the following splat:
> >
> > BUG: Bad page state in process lt-pmem-ns  pfn:121a12
> > page:51ef73f7 refcount:0 mapcount:-1024
> > mapping: index:0x0 pfn:0x121a12
>
> Mapcount of -1024 is the signature of:
>
> #define PG_guard0x0400

Oh, thanks for that. I overlooked how mapcount is overloaded. Although
in v5.10-rc4 that value is:

#define PG_table0x0400

>
> (the bits are inverted, so this turns into 0xfbff which is reported
> as -1024)
>
> I assume you have debug_pagealloc enabled?

Added it, but no extra spew. I'll dig a bit more on how PG_table is
not being cleared in this case.


Re: mapcount corruption regression

2020-11-30 Thread Dan Williams
On Mon, Nov 30, 2020 at 5:20 PM Dan Williams  wrote:
>
> Kirill, Willy, compound page experts,
>
> I am seeking some debug ideas about the following splat:
>
> BUG: Bad page state in process lt-pmem-ns  pfn:121a12

Looks to be a similar signature that Yi Zhang is seeing:

http://lore.kernel.org/r/51e938d1-aff7-0fa4-1a79-f77ac8bb2...@redhat.com


mapcount corruption regression

2020-11-30 Thread Dan Williams
Kirill, Willy, compound page experts,

I am seeking some debug ideas about the following splat:

BUG: Bad page state in process lt-pmem-ns  pfn:121a12
page:51ef73f7 refcount:0 mapcount:-1024
mapping: index:0x0 pfn:0x121a12
flags: 0x28()
raw: 0028 dead0100  
raw:  8a6914886b48 fbff 
page dumped because: nonzero mapcount
[..]
CPU: 26 PID: 6127 Comm: lt-pmem-ns Tainted: G   OE 5.10.0-rc4+ #450
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
 dump_stack+0x8b/0xb0
 bad_page.cold+0x63/0x94
 free_pcp_prepare+0x224/0x270
 free_unref_page+0x18/0xd0
 pud_free_pmd_page+0x146/0x160
 ioremap_pud_range+0xe3/0x350
 ioremap_page_range+0x108/0x160
 __ioremap_caller.constprop.0+0x174/0x2b0
 ? memremap+0x7a/0x110
 memremap+0x7a/0x110
 devm_memremap+0x53/0xa0
 pmem_attach_disk+0x4ed/0x530 [nd_pmem]

It triggers on v5.10-rc4 not on v5.9, but the bisect comes up with an
ambiguous result. I've run the bisect 3 times and landed on:

032c7ed95817 Merge tag 'arm64-upstream' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

...which does not touch anything near _mapcount. I suspect there is
something unique about the build that lines up the corruption to
happen or not happen.

The test is a simple namespace creation test that results in an
memremap() / ioremap() over several gigabytes of memory capacity. The
-1024 was interesting because that's the GUP_PIN_COUNTING_BIAS, but
that's the _refcount, I did not see any questionable changes to how
_mapcount is manipulated post v5.9. Problem should be reproducible by
running:

make -j TESTS="pmem-ns" check

...in qemu-kvm with some virtual pmem defined:

-object 
memory-backend-file,id=mem1,share,mem-path=${mem}1,size=$((mem_size+label_size))
-device nvdimm,memdev=mem1,id=nv1,label-size=${label_size}

...where ${mem}1 is a 128GB sparse file $mem_size is 127GB and
$label_size is 128KB.


[PATCH] ACPI: NFIT: Fix input validation of bus-family

2020-11-23 Thread Dan Williams
Dan reports that smatch thinks userspace can craft an out-of-bound bus
family number. However, nd_cmd_clear_to_send() blocks all non-zero
values of bus-family since only the kernel can initiate these commands.
However, in the speculation path, family is a user controlled array
index value so mask it for speculation safety. Also, since the
nd_cmd_clear_to_send() safety is non-obvious and possibly may change in
the future include input validation is if userspace could get past the
nd_cmd_clear_to_send() gatekeeper.

Link: http://lore.kernel.org/r/2020113000.GA1237157@mwanda
Reported-by: Dan Carpenter 
Fixes: 6450ddbd5d8e ("ACPI: NFIT: Define runtime firmware activation commands")
Cc: 
Signed-off-by: Dan Williams 
---
 drivers/acpi/nfit/core.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index cda7b6c52504..b11b08a60684 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -479,8 +480,11 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
struct nvdimm *nvdimm,
cmd_mask = nd_desc->cmd_mask;
if (cmd == ND_CMD_CALL && call_pkg->nd_family) {
family = call_pkg->nd_family;
-   if (!test_bit(family, _desc->bus_family_mask))
+   if (family > NVDIMM_BUS_FAMILY_MAX ||
+   !test_bit(family, _desc->bus_family_mask))
return -EINVAL;
+   family = array_index_nospec(family,
+   NVDIMM_BUS_FAMILY_MAX + 1);
dsm_mask = acpi_desc->family_dsm_mask[family];
guid = to_nfit_bus_uuid(family);
} else {



Re: arch/powerpc/mm/book3s64/pgtable.c:174:15: error: no previous prototype for 'create_section_mapping'

2020-11-23 Thread Dan Williams
On Mon, Nov 23, 2020 at 6:13 PM kernel test robot  wrote:
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   d5beb3140f91b1c8a3d41b14d729aefa4dcc58bc
> commit: a927bd6ba952d13c52b8b385030943032f659a3e mm: fix 
> phys_to_target_node() and memory_add_physaddr_to_nid() exports
> date:   31 hours ago
> config: powerpc-randconfig-r024-20201123 (attached as .config)
> compiler: powerpc64le-linux-gcc (GCC) 9.3.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a927bd6ba952d13c52b8b385030943032f659a3e
> git remote add linus 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git fetch --no-tags linus master
> git checkout a927bd6ba952d13c52b8b385030943032f659a3e
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> ARCH=powerpc
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All errors (new ones prefixed by >>):
>
> >> arch/powerpc/mm/book3s64/pgtable.c:174:15: error: no previous prototype 
> >> for 'create_section_mapping' [-Werror=missing-prototypes]
>  174 | int __meminit create_section_mapping(unsigned long start, unsigned 
> long end,
>  |   ^~
>arch/powerpc/mm/book3s64/pgtable.c:405:6: error: no previous prototype for 
> 'arch_report_meminfo' [-Werror=missing-prototypes]
>  405 | void arch_report_meminfo(struct seq_file *m)
>  |  ^~~
>arch/powerpc/mm/book3s64/pgtable.c:461:5: error: no previous prototype for 
> 'pmd_move_must_withdraw' [-Werror=missing-prototypes]
>  461 | int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
>  | ^~
>cc1: all warnings being treated as errors
>
> vim +/create_section_mapping +174 arch/powerpc/mm/book3s64/pgtable.c

The whack-a-mole continues... the kbuild-robot success report I
received gave me too much confidence.


Re: [RFC PATCH 4/9] cxl/mem: Map memory device registers

2020-11-23 Thread Dan Williams
On Mon, Nov 23, 2020 at 11:20 AM Ben Widawsky  wrote:
[..]
> > -ENXIO is fine with me.  I just don't see it as often so I don't
> > really know what it is.
> >
> > Bjorn
>
> Dan, Bjorn, I did a fairly randomized look at various probe functions and 
> ENODEV
> seems to be more common. My sort of historical use has been
> - ENODEV: General, couldn't establish device presence
> - ENXIO: Device was there but something is totally misconfigured
> - E*: A matching errno for exactly what went wrong
>
> My question though is, would it be useful to propagate the error up through
> probe?

The error from probe becomes the modprobe exit code, or the write to
the 'bind' attribute errno. So, it's a choice between "No such device
or address", or "No such device". The "or address" mention makes a
small bit more sense to me since the device is obviously present as it
is visible in lspci, but either error code clearly indicates a driver
problem so ENODEV is fine.

For the other error codes I think it would be confusing to return
something like EINVAL from probe as that would be mistaken as an
invalid argument to the modprobe without stracing to see that it came
from the result of a sysfs write


[PATCH] libnvdimm/namespace: Fix reaping of invalidated block-window-namespace labels

2020-11-20 Thread Dan Williams
A recent change to ndctl to attempt to reconfigure namespaces in place
uncovered a label accounting problem in block-window-type namespaces.
The ndctl "create.sh" test is able to trigger this signature:

 WARNING: CPU: 34 PID: 9167 at drivers/nvdimm/label.c:1100 
__blk_label_update+0x9a3/0xbc0 [libnvdimm]
 [..]
 RIP: 0010:__blk_label_update+0x9a3/0xbc0 [libnvdimm]
 [..]
 Call Trace:
  uuid_store+0x21b/0x2f0 [libnvdimm]
  kernfs_fop_write+0xcf/0x1c0
  vfs_write+0xcc/0x380
  ksys_write+0x68/0xe0

When allocated capacity for a namespace is renamed (new UUID) the labels
with the old UUID need to be deleted. The ndctl behavior to always
destroy namespaces on reconfiguration hid this problem.

The immediate impact of this bug is limited since block-window-type
namespaces only seem to exist in the specification and not in any
shipping products. However, the label handling code is being reused for
other technologies like CXL region labels, so there is a benefit to
making sure both vertical labels sets (block-window) and horizontal
label sets (pmem) have a functional reference implementation in
libnvdimm.

Fixes: c4703ce11c23 ("libnvdimm/namespace: Fix label tracking error")
Cc: 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/label.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index 47a4828b8b31..6f2be7a34598 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -980,6 +980,15 @@ static int __blk_label_update(struct nd_region *nd_region,
}
}
 
+   /* release slots associated with any invalidated UUIDs */
+   mutex_lock(_mapping->lock);
+   list_for_each_entry_safe(label_ent, e, _mapping->labels, list)
+   if (test_and_clear_bit(ND_LABEL_REAP, _ent->flags)) {
+   reap_victim(nd_mapping, label_ent);
+   list_move(_ent->list, );
+   }
+   mutex_unlock(_mapping->lock);
+
/*
 * Find the resource associated with the first label in the set
 * per the v1.2 namespace specification.



Re: [RFC PATCH 8/9] cxl/mem: Register CXL memX devices

2020-11-19 Thread Dan Williams
On Tue, Nov 17, 2020 at 7:57 AM Jonathan Cameron
 wrote:
>
> On Tue, 10 Nov 2020 21:43:55 -0800
> Ben Widawsky  wrote:
>
> > From: Dan Williams 
> >
> > Create the /sys/bus/cxl hierarchy to enumerate memory devices
> > (per-endpoint control devices), memory address space devices (platform
> > address ranges with interleaving, performance, and persistence
> > attributes), and memory regions (active provisioned memory from an
> > address space device that is in use as System RAM or delegated to
> > libnvdimm as Persistent Memory regions).
> >
> > For now, only the per-endpoint control devices are registered on the
> > 'cxl' bus.
>
> Reviewing ABI without documentation is challenging even when it's simple
> so please add that for v2.
>
> This patch feels somewhat unpolished, but I guess it is mainly here to
> give an illustration of how stuff might fit together rather than
> any expectation of detailed review.

Yeah, this is definitely an early look in the spirit of "Release early
/ release often".

>
> So in that spirit I've just pointed out stuff that jumped out at me
> during a quick read through.
>
> Thanks,
>
> Jonathan
>
>
> >
> > Signed-off-by: Dan Williams 
> > Signed-off-by: Ben Widawsky 
> > ---
> >  drivers/cxl/Makefile |   2 +
> >  drivers/cxl/bus.c|  35 ++
> >  drivers/cxl/bus.h|   8 ++
> >  drivers/cxl/cxl.h|  33 +
> >  drivers/cxl/mem.c| 287 ++-
> >  5 files changed, 359 insertions(+), 6 deletions(-)
> >  create mode 100644 drivers/cxl/bus.c
> >  create mode 100644 drivers/cxl/bus.h
> >
> > diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> > index 97fdffb00f2d..1cc032092852 100644
> > --- a/drivers/cxl/Makefile
> > +++ b/drivers/cxl/Makefile
> > @@ -1,7 +1,9 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
> >  obj-$(CONFIG_CXL_MEM) += cxl_mem.o
> > +obj-$(CONFIG_CXL_BUS_PROVIDER) += cxl_bus.o
> >
> >  ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=CXL
> >  cxl_acpi-y := acpi.o
> >  cxl_mem-y := mem.o
> > +cxl_bus-y := bus.o
> > diff --git a/drivers/cxl/bus.c b/drivers/cxl/bus.c
> > new file mode 100644
> > index ..8594366955f7
> > --- /dev/null
> > +++ b/drivers/cxl/bus.c
> > @@ -0,0 +1,35 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> > +#include 
> > +#include 
> > +
> > +static struct bus_type cxl_bus_type = {
> > + .name = "cxl",
> > +};
> > +
> > +int cxl_register(struct device *dev)
> > +{
> > + int rc;
> > +
> > + dev->bus = _bus_type;
> > + rc = device_add(dev);
> > + if (rc)
> > + put_device(dev);
> > + return rc;
> > +}
> > +EXPORT_SYMBOL_GPL(cxl_register);
> > +
> > +static __init int cxl_bus_init(void)
> > +{
> > + return bus_register(_bus_type);
> > +}
> > +
> > +static void cxl_bus_exit(void)
> > +{
> > + bus_unregister(_bus_type);
> > +}
> > +
> > +module_init(cxl_bus_init);
> > +module_exit(cxl_bus_exit);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Intel Corporation");
> > diff --git a/drivers/cxl/bus.h b/drivers/cxl/bus.h
> > new file mode 100644
> > index ..fe2bea2bbc3c
> > --- /dev/null
> > +++ b/drivers/cxl/bus.h
> > @@ -0,0 +1,8 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> > +#ifndef __CXL_BUS_H__
> > +#define __CXL_BUS_H__
> > +
> > +int cxl_register(struct device *dev);
> > +
> > +#endif /* __CXL_BUS_H__ */
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index f49ab80f68bd..cef5fd9ea68b 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -3,6 +3,7 @@
> >
> >  #ifndef __CXL_H__
> >  #define __CXL_H__
> > +#include 
> >
> >  /* Device */
> >  #define CXLDEV_CAP_ARRAY_REG 0x0
> > @@ -52,12 +53,24 @@
> >  #define CXLMDEV_RESET_NEEDED_HOT 3
> >  #define CXLMDEV_RESET_NEEDED_CXL 4
> >
> > +struct cxl_memdev;
> >  struct cxl_mem {
> >   struct pci_dev *pdev;
> >   void __iomem *regs;
> > + struct cxl_memdev *cxlmd;
> >
> >   spinlock_t mbox_lock; /* Protects device mailbox and firmware */
> >
> > + struct

Re: [PATCH 1/1] ACPI/nfit: correct the badrange to be reported in nfit_handle_mce()

2020-11-18 Thread Dan Williams
On Wed, Nov 18, 2020 at 5:53 PM Leizhen (ThunderTown)
 wrote:
>
>
>
> On 2020/11/19 3:16, Dan Williams wrote:
> > On Wed, Nov 18, 2020 at 12:55 AM Leizhen (ThunderTown)
> >  wrote:
> >>
> >>
> >>
> >> On 2020/11/18 16:41, Zhen Lei wrote:
> >>> The badrange to be reported should always cover mce->addr.
> >> Maybe I should change this description to:
> >> Make sure the badrange to be reported can always cover mce->addr.
> >
> > Yes, I like that better. Can you also say a bit more about how you
> > found this bug? As far as I can see this looks like -stable material.
>
> I found it when I was learning the code. I'm looking at it carefully.

Ok, good eye.

The impact of this one is somewhat mitigated by the fact that errors
are expanded to 512 byte blast radius, and error consumption maps 4k
around errors. I suspect few are trying to use the badblock list to do
fine grained recovery so this bug went unnoticed.


Re: [PATCH 1/1] ACPI/nfit: correct the badrange to be reported in nfit_handle_mce()

2020-11-18 Thread Dan Williams
On Wed, Nov 18, 2020 at 12:55 AM Leizhen (ThunderTown)
 wrote:
>
>
>
> On 2020/11/18 16:41, Zhen Lei wrote:
> > The badrange to be reported should always cover mce->addr.
> Maybe I should change this description to:
> Make sure the badrange to be reported can always cover mce->addr.

Yes, I like that better. Can you also say a bit more about how you
found this bug? As far as I can see this looks like -stable material.


Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

2020-11-18 Thread Dan Williams
On Wed, Nov 18, 2020 at 5:51 AM Jan Kara  wrote:
>
> On Mon 16-11-20 19:35:31, John Hubbard wrote:
> >
> > On 11/16/20 6:48 PM, kernel test robot wrote:
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -45.0% regression of 
> > > phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
> > >
> >
> > That's a huge slowdown...
> >
> > >
> > > commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: 
> > > page->hpage_pinned_refcount: exact pin counts for huge pages")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > ...but that commit happened in April, 2020. Surely if this were a serious
> > issue we would have some other indication...is this worth following up
> > on?? I'm inclined to ignore it, honestly.
>
> Why this was detected so late is a fair question although it doesn't quite
> invalidate the report...

I don't know what specifically happened in this case, perhaps someone
from the lkp team can comment? However, the myth / contention that
"surely someone else would have noticed by now" is why the lkp project
was launched. Kernels regressed without much complaint and it wasn't
until much later in the process, around the time enterprise distros
rebased to new kernels, did end users start filing performance loss
regression reports. Given -stable kernel releases, 6-7 months is still
faster than many end user upgrade cycles to new kernel baselines.


Re: [RFC PATCH 2/9] cxl/acpi: add OSC support

2020-11-18 Thread Dan Williams
On Wed, Nov 18, 2020 at 4:26 AM Rafael J. Wysocki  wrote:
>
> On Tue, Nov 17, 2020 at 12:26 AM Dan Williams  
> wrote:
> >
> > On Mon, Nov 16, 2020 at 10:00 AM Jonathan Cameron
> >  wrote:
> > >
> > > On Tue, 10 Nov 2020 21:43:49 -0800
> > > Ben Widawsky  wrote:
> > >
> > > > From: Vishal Verma 
> > > >
> > > > Add support to advertise OS capabilities, and request OS control for CXL
> > > > features using the ACPI _OSC mechanism. Advertise support for all
> > > > possible CXL features, and attempt to request control too for all
> > > > possible features.
> > > >
> > > > Based on a patch by Sean Kelley.
> > > >
> > > > Signed-off-by: Vishal Verma 
> > > > Signed-off-by: Ben Widawsky 
> > >
> > > I guess unsurprisingly a lot of this is cut and paste from PCIe
> > > so can we share some of the code?
> > >
> >
> > I do not see a refactoring effort for these bit being all that
> > fruitful.
>
> Well, that depends on how much code duplication could be avoided this way.
>
> > The backport pressure for this driver stack I expect will be
> > higher than most, so I'm sensitive to avoiding unnecessary core
> > entanglements.
>
> If two pieces of code are based on the same underlying common code, it
> is immediately clear to the reader how similar to each other they are.
> Otherwise, they need to be carefully compared with each other to find
> out what the differences are and whether or not they are arbitrary or
> vitally important.  That is essential both from the revirem
> perspective today and to anyone wanting to understand the given code
> in the future (possibly in order to modify it without breaking it).
> It outweighs the convenience by far IMV, with all due respect.
>
> Recall how much effort it took to combine x86 with x86_64 and why it
> turned out to be necessary to do that work, for one example.

I agree with above, but the degree of potential code sharing and
refactoring for CXL is nowhere near approaching the x86_64 situation.
There's also the counter example in ext3 and ext4 where a split is
maintained for good reason. All I'm saying is that let's judge patches
and not theory when it comes to refactoring CXL, my expectation is
that those opportunities will be few and far between. CXL is a
superset of PCIE functionality so it should not put much pressure on
common core PCIE code to change vs incremental CXL extensions.


Re: [PATCH 1/1] ACPI/nfit: avoid accessing uninitialized memory in acpi_nfit_ctl()

2020-11-18 Thread Dan Williams
On Tue, Nov 17, 2020 at 11:36 PM Zhen Lei  wrote:
>
> The ACPI_ALLOCATE() does not zero the "buf", so when the condition
> "integer->type != ACPI_TYPE_INTEGER" in int_to_buf() is met, the result
> is unpredictable in acpi_nfit_ctl().
>
> Signed-off-by: Zhen Lei 

Looks good to me.

Reviewed-by: Dan Williams 

I'll pick this up.


Re: [RFC PATCH 1/9] cxl/acpi: Add an acpi_cxl module for the CXL interconnect

2020-11-17 Thread Dan Williams
On Tue, Nov 17, 2020 at 6:33 AM Rafael J. Wysocki  wrote:
[..]
> > +static struct acpi_driver acpi_cxl_driver = {
>
> First of all, no new acpi_driver instances, pretty please!
>
> acpi_default_enumeration() should create a platform device with the
> ACPI0017 ID for you.  Can't you provide a driver for this one?
>

Ah, yes, I recall we had this discussion around the time the ACPI0012
NFIT driver was developed. IIRC the reason ACPI0012 remaining an
acpi_driver was due to a need to handle ACPI notifications, is that
the deciding factor? ACPI0017 does not have any notifications so it
seems like platform driver is the way to go.


Re: [PATCH v4 01/10] Add auxiliary bus support

2020-11-17 Thread Dan Williams
On Mon, Nov 16, 2020 at 11:02 PM Greg KH  wrote:
>
> On Tue, Nov 17, 2020 at 07:30:00AM +0200, Leon Romanovsky wrote:
> > On Fri, Nov 13, 2020 at 08:18:50AM -0800, Dave Ertman wrote:
> > > Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
> > > It enables drivers to create an auxiliary_device and bind an
> > > auxiliary_driver to it.
> > >
> > > The bus supports probe/remove shutdown and suspend/resume callbacks.
> > > Each auxiliary_device has a unique string based id; driver binds to
> > > an auxiliary_device based on this id through the bus.
> > >
> > > Co-developed-by: Kiran Patil 
> > > Signed-off-by: Kiran Patil 
> > > Co-developed-by: Ranjani Sridharan 
> > > Signed-off-by: Ranjani Sridharan 
> > > Co-developed-by: Fred Oh 
> > > Signed-off-by: Fred Oh 
> > > Co-developed-by: Leon Romanovsky 
> > > Signed-off-by: Leon Romanovsky 
> > > Reviewed-by: Pierre-Louis Bossart 
> > > Reviewed-by: Shiraz Saleem 
> > > Reviewed-by: Parav Pandit 
> > > Reviewed-by: Dan Williams 
> > > Signed-off-by: Dave Ertman 
> > > ---
> >
> > Greg,
> >
> > This horse was beaten to death, can we please progress with this patch?
> > Create special topic branch or ack so I'll prepare this branch.
> >
> > We are in -rc4 now and we (Mellanox) can't hold our submissions anymore.
> > My mlx5_core probe patches [1] were too intrusive and they are ready to
> > be merged, Parav's patches got positive review as well [2] and will be
> > taken next.
> >
> > We delayed and have in our internal queues the patches for VDPA, eswitch
> > and followup for mlx5_core probe rework, but trapped due to this AUX bus
> > patch.
>
> There are no deadlines for kernel patches here, sorry.  Give me some
> time to properly review this, core kernel changes should not be rushed.
>
> Also, if you really want to blame someone for the delay, look at the
> patch submitters not the reviewers, as they are the ones that took a
> very long time with this over the lifecycle of this patchset, not me.  I
> have provided many "instant" reviews of this patchset, and then months
> went by between updates from them.

Please stop this finger pointing. It was already noted that the team,
out of abundance of caution / deference to the process, decided not to
push the patches while I was out on family leave. It's cruel to hold
that against them, and if anyone is to blame it's me for not
clarifying it was ok to proceed while I was out.


Re: [RFC PATCH 7/9] cxl/mem: Implement polled mode mailbox

2020-11-17 Thread Dan Williams
On Tue, Nov 17, 2020 at 10:07 AM Jonathan Cameron
 wrote:
>
> On Tue, 17 Nov 2020 08:34:38 -0800
> Ben Widawsky  wrote:
>
> > On 20-11-17 15:31:22, Jonathan Cameron wrote:
> > > On Tue, 10 Nov 2020 21:43:54 -0800
> > > Ben Widawsky  wrote:
> > >
> > > > Create a function to handle sending a command, optionally with a
> > > > payload, to the memory device, polling on a result, and then optionally
> > > > copying out the payload. The algorithm for doing this come straight out
> > > > of the CXL 2.0 specification.
> > > >
> > > > Primary mailboxes are capable of generating an interrupt when submitting
> > > > a command in the background. That implementation is saved for a later
> > > > time.
> > > >
> > > > Secondary mailboxes aren't implemented at this time.
> > > >
> > > > WARNING: This is untested with actual timeouts occurring.
> > > >
> > > > Signed-off-by: Ben Widawsky 
> > >
> > > Question inline for why the preempt / local timer dance is worth 
> > > bothering with.
> > > What am I missing?
> > >
> > > Thanks,
> > >
> > > Jonathan
> > >
> > > > ---
> > > >  drivers/cxl/cxl.h |  16 +++
> > > >  drivers/cxl/mem.c | 107 ++
> > > >  2 files changed, 123 insertions(+)
> > > >
> > > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > > > index 482fc9cdc890..f49ab80f68bd 100644
> > > > --- a/drivers/cxl/cxl.h
> > > > +++ b/drivers/cxl/cxl.h
> > > > @@ -21,8 +21,12 @@
> > > >  #define CXLDEV_MB_CTRL 0x04
> > > >  #define   CXLDEV_MB_CTRL_DOORBELL BIT(0)
> > > >  #define CXLDEV_MB_CMD 0x08
> > > > +#define   CXLDEV_MB_CMD_PAYLOAD_LENGTH_SHIFT 16
> > > >  #define CXLDEV_MB_STATUS 0x10
> > > > +#define   CXLDEV_MB_STATUS_RET_CODE_SHIFT 32
> > > > +#define   CXLDEV_MB_STATUS_RET_CODE_MASK 0x
> > > >  #define CXLDEV_MB_BG_CMD_STATUS 0x18
> > > > +#define CXLDEV_MB_PAYLOAD 0x20
> > > >
> > > >  /* Memory Device */
> > > >  #define CXLMDEV_STATUS 0
> > > > @@ -114,4 +118,16 @@ static inline u64 __cxl_raw_read_reg64(struct 
> > > > cxl_mem *cxlm, u32 reg)
> > > >
> > > >   return readq(reg_addr + reg);
> > > >  }
> > > > +
> > > > +static inline void cxl_mbox_payload_fill(struct cxl_mem *cxlm, u8 
> > > > *input,
> > > > + unsigned int length)
> > > > +{
> > > > + memcpy_toio(cxlm->mbox.regs + CXLDEV_MB_PAYLOAD, input, length);
> > > > +}
> > > > +
> > > > +static inline void cxl_mbox_payload_drain(struct cxl_mem *cxlm,
> > > > +  u8 *output, unsigned int length)
> > > > +{
> > > > + memcpy_fromio(output, cxlm->mbox.regs + CXLDEV_MB_PAYLOAD, length);
> > > > +}
> > > >  #endif /* __CXL_H__ */
> > > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > > index 9fd2d1daa534..08913360d500 100644
> > > > --- a/drivers/cxl/mem.c
> > > > +++ b/drivers/cxl/mem.c
> > > > @@ -1,5 +1,6 @@
> > > >  // SPDX-License-Identifier: GPL-2.0-only
> > > >  // Copyright(c) 2020 Intel Corporation. All rights reserved.
> > > > +#include 
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > @@ -7,6 +8,112 @@
> > > >  #include "pci.h"
> > > >  #include "cxl.h"
> > > >
> > > > +struct mbox_cmd {
> > > > + u16 cmd;
> > > > + u8 *payload;
> > > > + size_t payload_size;
> > > > + u16 return_code;
> > > > +};
> > > > +
> > > > +static int cxldev_wait_for_doorbell(struct cxl_mem *cxlm)
> > > > +{
> > > > + u64 start, now;
> > > > + int cpu, ret, timeout = 20;
> > > > +
> > > > + start = local_clock();
> > > > + preempt_disable();
> > > > + cpu = smp_processor_id();
> > > > + for (;;) {
> > > > + now = local_clock();
> > > > + preempt_enable();
> > >
> > > What do we ever do with this mailbox that is particularly
> > > performance critical? I'd like to understand why we care enough
> > > to mess around with the preemption changes and local clock etc.
> > >
> >
> > It is quite obviously a premature optimization at this point (since we only
> > support a single command in QEMU). However, the polling can be anywhere from
> > instant to 2 seconds. QEMU implementation aside again, some devices may 
> > never
> > support interrupts on completion, and so I thought providing a poll 
> > function now
> > that is capable of working for most [all?] cases was wise.
>
> Definitely seems premature.  I'd want to see real numbers on hardware
> to justify this sort of complexity.  Maybe others disagree though.

The polling is definitely needed, but I think it can be a simple
jiffies based loop and avoid this sched_clock() complexity.


Re: [RFC PATCH 2/9] cxl/acpi: add OSC support

2020-11-16 Thread Dan Williams
On Mon, Nov 16, 2020 at 10:00 AM Jonathan Cameron
 wrote:
>
> On Tue, 10 Nov 2020 21:43:49 -0800
> Ben Widawsky  wrote:
>
> > From: Vishal Verma 
> >
> > Add support to advertise OS capabilities, and request OS control for CXL
> > features using the ACPI _OSC mechanism. Advertise support for all
> > possible CXL features, and attempt to request control too for all
> > possible features.
> >
> > Based on a patch by Sean Kelley.
> >
> > Signed-off-by: Vishal Verma 
> > Signed-off-by: Ben Widawsky 
>
> I guess unsurprisingly a lot of this is cut and paste from PCIe
> so can we share some of the code?
>

I do not see a refactoring effort for these bit being all that
fruitful. The backport pressure for this driver stack I expect will be
higher than most, so I'm sensitive to avoiding unnecessary core
entanglements.


Re: [RFC PATCH 4/9] cxl/mem: Map memory device registers

2020-11-16 Thread Dan Williams
On Fri, Nov 13, 2020 at 5:12 PM Ben Widawsky  wrote:
>
> On 20-11-13 12:17:32, Bjorn Helgaas wrote:
> > On Tue, Nov 10, 2020 at 09:43:51PM -0800, Ben Widawsky wrote:
> > > All the necessary bits are initialized in order to find and map the
> > > register space for CXL Memory Devices. This is accomplished by using the
> > > Register Locator DVSEC (CXL 2.0 - 8.1.9.1) to determine which PCI BAR to
> > > use, and how much of an offset from that BAR should be added.
> >
> > "Initialize the necessary bits ..." to use the usual imperative
> > sentence structure, as you did in the subject.
> >
> > > If the memory device registers are found and mapped a new internal data
> > > structure tracking device state is allocated.
> >
> > "Allocate device state if we find device registers" or similar.
> >
> > > Signed-off-by: Ben Widawsky 
> > > ---
> > >  drivers/cxl/mem.c | 68 +++
> > >  drivers/cxl/pci.h |  6 +
> > >  2 files changed, 69 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > index aa7d881fa47b..8d9b9ab6c5ea 100644
> > > --- a/drivers/cxl/mem.c
> > > +++ b/drivers/cxl/mem.c
> > > @@ -7,9 +7,49 @@
> > >  #include "pci.h"
> > >
> > >  struct cxl_mem {
> > > +   struct pci_dev *pdev;
> > > void __iomem *regs;
> > >  };
> > >
> > > +static struct cxl_mem *cxl_mem_create(struct pci_dev *pdev, u32 reg_lo, 
> > > u32 reg_hi)
> > > +{
> > > +   struct device *dev = >dev;
> > > +   struct cxl_mem *cxlm;
> > > +   void __iomem *regs;
> > > +   u64 offset;
> > > +   u8 bar;
> > > +   int rc;
> > > +
> > > +   offset = ((u64)reg_hi << 32) | (reg_lo & 0x);
> > > +   bar = reg_lo & 0x7;
> > > +
> > > +   /* Basic sanity check that BAR is big enough */
> > > +   if (pci_resource_len(pdev, bar) < offset) {
> > > +   dev_err(dev, "bar%d: %pr: too small (offset: %#llx)\n",
> > > +   bar, >resource[bar], (unsigned long 
> > > long) offset);
> >
> > s/bar/BAR/
> >
> > > +   return ERR_PTR(-ENXIO);
> > > +   }
> > > +
> > > +   rc = pcim_iomap_regions(pdev, 1 << bar, pci_name(pdev));
> > > +   if (rc != 0) {
> > > +   dev_err(dev, "failed to map registers\n");
> > > +   return ERR_PTR(-ENXIO);
> > > +   }
> > > +
> > > +   cxlm = devm_kzalloc(>dev, sizeof(*cxlm), GFP_KERNEL);
> > > +   if (!cxlm) {
> > > +   dev_err(dev, "No memory available\n");
> > > +   return ERR_PTR(-ENOMEM);
> > > +   }
> > > +
> > > +   regs = pcim_iomap_table(pdev)[bar];
> > > +   cxlm->pdev = pdev;
> > > +   cxlm->regs = regs + offset;
> > > +
> > > +   dev_dbg(dev, "Mapped CXL Memory Device resource\n");
> > > +   return cxlm;
> > > +}
> > > +
> > >  static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> > >  {
> > > int pos;
> > > @@ -34,9 +74,9 @@ static int cxl_mem_dvsec(struct pci_dev *pdev, int 
> > > dvsec)
> > >
> > >  static int cxl_mem_probe(struct pci_dev *pdev, const struct 
> > > pci_device_id *id)
> > >  {
> > > +   struct cxl_mem *cxlm = ERR_PTR(-ENXIO);
> > > struct device *dev = >dev;
> > > -   struct cxl_mem *cxlm;
> >
> > The order was better before ("dev", then "clxm").  Oh, I suppose this
> > is a "reverse Christmas tree" thing.
> >
>
> I don't actually care either way as long as it's consistent. I tend to do
> reverse Christmas tree for no particular reason.

Yeah, reverse Christmas tree for no particular reason.

>
> > > -   int rc, regloc;
> > > +   int rc, regloc, i;
> > >
> > > rc = cxl_bus_prepared(pdev);
> > > if (rc != 0) {
> > > @@ -44,15 +84,33 @@ static int cxl_mem_probe(struct pci_dev *pdev, const 
> > > struct pci_device_id *id)
> > > return rc;
> > > }
> > >
> > > +   rc = pcim_enable_device(pdev);
> > > +   if (rc)
> > > +   return rc;
> > > +
> > > regloc = cxl_mem_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC);
> > > if (!regloc) {
> > > dev_err(dev, "register location dvsec not found\n");
> > > return -ENXIO;
> > > }
> > > +   regloc += 0xc; /* Skip DVSEC + reserved fields */
> > > +
> > > +   for (i = regloc; i < regloc + 0x24; i += 8) {
> > > +   u32 reg_lo, reg_hi;
> > > +
> > > +   pci_read_config_dword(pdev, i, _lo);
> > > +   pci_read_config_dword(pdev, i + 4, _hi);
> > > +
> > > +   if (CXL_REGLOG_IS_MEMDEV(reg_lo)) {
> > > +   cxlm = cxl_mem_create(pdev, reg_lo, reg_hi);
> > > +   break;
> > > +   }
> > > +   }
> > > +
> > > +   if (IS_ERR(cxlm))
> > > +   return -ENXIO;
> >
> > I think this would be easier to read if cxl_mem_create() returned NULL
> > on failure (it prints error messages and we throw away
> > -ENXIO/-ENOMEM distinction here anyway) so you could do:
> >
> >   struct cxl_mem *cxlm = NULL;
> >
> >   for (...) {
> > if (...) {
> >   cxlm = cxl_mem_create(pdev, reg_lo, reg_hi);
> >   break;
> > }
> >   }
> >
> >   if (!cxlm)
> > return -ENXIO;  /* 

Re: [RFC PATCH 3/9] cxl/mem: Add a driver for the type-3 mailbox

2020-11-14 Thread Dan Williams
On Fri, Nov 13, 2020 at 5:09 PM Ben Widawsky  wrote:
[..]
> > Unused, maybe move it to the patch that adds the use?
> >
>
> This is a remnant from when Dan gave me the basis to do the mmio work. I agree
> it can be removed now.

Yes.

> > > +static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> > > +{
> > > +   int pos;
> > > +
> > > +   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC);
> > > +   if (!pos)
> > > +   return 0;
> > > +
> > > +   while (pos) {
> > > +   u16 vendor, id;
> > > +
> > > +   pci_read_config_word(pdev, pos + PCI_DVSEC_VENDOR_OFFSET, 
> > > );
> > > +   pci_read_config_word(pdev, pos + PCI_DVSEC_ID_OFFSET, );
> > > +   if (vendor == PCI_DVSEC_VENDOR_CXL && dvsec == id)
> > > +   return pos;
> > > +
> > > +   pos = pci_find_next_ext_capability(pdev, pos, 
> > > PCI_EXT_CAP_ID_DVSEC);
> > > +   }
> > > +
> > > +   return 0;
> > > +}
> >
> > I assume we'll refactor and move this into the PCI core after we
> > resolve the several places this is needed.  When we do that, the
> > vendor would be passed in, so maybe we should do that here to make it
> > simpler to move this to the PCI core.
> >
>
> I think we'll need to keep this in order to try to keep the dream alive of
> loading a CXL kernel module on an older kernel. However, PCI code would 
> benefit
> from having it (in an ideal world, it'd only be there).

So I think this is fine / expected to move standalone common code like
this to the PCI core. What I'm aiming to avoid with "the dream" Ben
references is unnecessary dependencies on core changes. CXL is large
enough that it will generate more backport pressure than ACPI NFIT /
LIBNVDIMM ever did. From a self interest perspective maximizing how
much of CXL can be enabled without core dependencies is a goal just to
lighten my own backport load. The internals of cxl_mem_dvsec() are
simple enough to backport.

>
> > > +static int cxl_mem_probe(struct pci_dev *pdev, const struct 
> > > pci_device_id *id)
> > > +{
> > > +   struct device *dev = >dev;
> > > +   struct cxl_mem *cxlm;
> > > +   int rc, regloc;
> > > +
> > > +   rc = cxl_bus_prepared(pdev);
> > > +   if (rc != 0) {
> > > +   dev_err(dev, "failed to acquire interface\n");
> >
> > Interesting naming: apparently when cxl_bus_prepared() returns a
> > non-zero ("true") value, it is actually *not* prepared?
> >
>
> This looks like a rebase fail to me, but I'll let Dan answer.

Yeah, I originally envisioned this as a ternary result with
-EPROBE_DEFER as a possible return value, but now that we've found a
way to handle CXL _OSC without colliding with legacy PCIE _OSC this
can indeed move to a boolean result.

Will fix up.

>
> > > +   return rc;
> > > +   }
> > > +
> > > +   regloc = cxl_mem_dvsec(pdev, PCI_DVSEC_ID_CXL_REGLOC);
> > > +   if (!regloc) {
> > > +   dev_err(dev, "register location dvsec not found\n");
> > > +   return -ENXIO;
> > > +   }
> > > +
> > > +   cxlm = devm_kzalloc(dev, sizeof(*cxlm), GFP_KERNEL);
> > > +   if (!cxlm)
> > > +   return -ENOMEM;
> >
> > Unused.  And [4/9] removes it before it's *ever* used :)
> >
>
> Same as a few above, I think Dan was providing this for me to implement the
> reset. It could go away...

Yes, a collaboration artifact that we can clean up.

>
> > > +   return 0;
> > > +}
> > > +
> > > +static void cxl_mem_remove(struct pci_dev *pdev)
> > > +{
> > > +}
> > > +
> > > +static const struct pci_device_id cxl_mem_pci_tbl[] = {
> > > +   /* PCI class code for CXL.mem Type-3 Devices */
> > > +   { PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
> > > + PCI_CLASS_MEMORY_CXL, 0xff, 0 },
> > > +   { /* terminate list */ },
> > > +};
> > > +MODULE_DEVICE_TABLE(pci, cxl_mem_pci_tbl);
> > > +
> > > +static struct pci_driver cxl_mem_driver = {
> > > +   .name   = KBUILD_MODNAME,
> > > +   .id_table   = cxl_mem_pci_tbl,
> > > +   .probe  = cxl_mem_probe,
> > > +   .remove = cxl_mem_remove,
> > > +};
> > > +
> > > +MODULE_LICENSE("GPL v2");
> > > +MODULE_AUTHOR("Intel Corporation");
> > > +module_pci_driver(cxl_mem_driver);
> > > +MODULE_IMPORT_NS(CXL);
> > > diff --git a/drivers/cxl/pci.h b/drivers/cxl/pci.h
> > > new file mode 100644
> > > index ..beb03921e6da
> > > --- /dev/null
> >
> > > +++ b/drivers/cxl/pci.h
> > > @@ -0,0 +1,15 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> >
> > /* SPDX-... */
> > /* Copyright ...*/
> >
> > The SPDX rules are a bit arcane and annoyingly hard to grep for, but
> > I found them in Documentation/process/license-rules.rst

Yes, I did not realize the header vs source /* */ vs // SPDX style.

> >
> > > +#ifndef __CXL_PCI_H__
> > > +#define __CXL_PCI_H__
> > > +
> > > +#define PCI_CLASS_MEMORY_CXL   0x050210
> > > +
> > > +#define PCI_EXT_CAP_ID_DVSEC   0x23
> > > +#define 

Re: [RFC PATCH 3/9] cxl/mem: Add a driver for the type-3 mailbox

2020-11-11 Thread Dan Williams
On Wed, Nov 11, 2020 at 1:42 PM Randy Dunlap  wrote:
>
> On 11/11/20 9:17 AM, Dan Williams wrote:
> > On Tue, Nov 10, 2020 at 11:12 PM Christoph Hellwig  
> > wrote:
> >>
> >> On Tue, Nov 10, 2020 at 09:43:50PM -0800, Ben Widawsky wrote:
> >>> +config CXL_MEM
> >>> +tristate "CXL.mem Device Support"
> >>> +depends on PCI && CXL_BUS_PROVIDER != n
> >>
> >> depend on PCI && CXL_BUS_PROVIDER
> >>
> >>> +default m if CXL_BUS_PROVIDER
> >>
> >> Please don't set weird defaults for new code.  Especially not default
> >> to module crap like this.
> >
> > This goes back to what people like Dave C. asked for LIBNVDIMM / DAX,
> > a way to blanket turn on a subsystem without needing to go hunt down
> > individual configs. All of CXL is "default n", but if someone turns on
> > a piece of it they get all of it by default. The user can then opt-out
> > on pieces after that first opt-in. If there's a better way to turn on
> > suggested configs I'm open to switch to that style. As for the
> > "default m" I was worried that it would be "default y" without the
> > specificity, but I did not test that... will check. There have been
> > times when I wished that distros defaulted bleeding edge new enabling
> > to 'm' and putting that default in the Kconfig maybe saves me from
> > needing to file individual config changes to distros after the fact.
>
> What we as developers put into mainline kernel Kconfig files has nothing
> to do with what distros use in their distro config files.
> Or at least it shouldn't.  Maybe your experience has been different.

I agree with that sentiment, but walk it back through the requirement
I mentioned above... *if* we want a top-level CXL option (default n)
that goes and enables many CXL sub-options the default for those
sub-options is something that needs to be listed in the Kconfig. 'm'
is more flexible than 'y', so if a user wants CXL at all, and doesn't
care about how, I'd prefer it's 'm' rather than 'y'.

I have had to go submit distro config fixes when Kconfig defaulted to
'y' when 'm' was available, and the reasoning for why it was 'y' was
"oh, that was the Kconfig default when I flipped this other option".


> >>> +// Copyright(c) 2020 Intel Corporation. All rights reserved.
> >>
> >> Please don't use '//' for anything but the SPDX header.
> >
> > Ok, I find // following by /* */ a bit ugly, but I don't care enough to 
> > fight.
> >
>
> Hm, it's not in coding-style AFAICT but Linus has OK-ed C99 style comments:
> http://lkml.iu.edu/hypermail/linux/kernel/1607.1/00627.html
>
>
> >>> +MODULE_AUTHOR("Intel Corporation");
> >>
> >> A module author is not a company.
> >
> > At least I don't have a copyright assignment clause, I don't agree
> > with the vanity of listing multiple people here especially when
> > MAINTAINERS has the contact info, and I don't want to maintain a list
> > as people do drive-by contributions and we need to figure out at what
> > level of contribution mandates a new MODULE_AUTHOR line. Now, that
> > said I would be ok to duplicate the MAINTAINERS as MODULE_AUTHOR
> > lines, but I otherwise expect MAINTAINERS is the central source for
> > module contact info.
>
> Sure, MAINTAINERS is fine, but the MODULE_AUTHOR() above provides
> no useful information.
> Even saying (made up) linux-de...@linux.intel.com would be slightly better,
> but some kind of contact info would be great. Otherwise just delete that line.

True, if the goal is to allow random end users to email support
questions about this module I'd rather not put my email there.
Instead, if it's someone that has kernel development questions then
they should be able to use MAINTAINERS for that contact.


Re: [RFC PATCH 3/9] cxl/mem: Add a driver for the type-3 mailbox

2020-11-11 Thread Dan Williams
On Wed, Nov 11, 2020 at 9:17 AM Dan Williams  wrote:
[..]
> > > +
> > > + pci_read_config_word(pdev, pos + PCI_DVSEC_VENDOR_OFFSET, 
> > > );
> > > + pci_read_config_word(pdev, pos + PCI_DVSEC_ID_OFFSET, );
> > > + if (vendor == PCI_DVSEC_VENDOR_CXL && dvsec == id)
> > > + return pos;
> > > +
> > > + pos = pci_find_next_ext_capability(pdev, pos, 
> > > PCI_EXT_CAP_ID_DVSEC);
> >
> > Overly long lines again.
>
> I thought 100 is the new 80 these days?

Saw your clarification to Vishal, I had missed that. Will trim.


Re: [RFC PATCH 3/9] cxl/mem: Add a driver for the type-3 mailbox

2020-11-11 Thread Dan Williams
On Tue, Nov 10, 2020 at 11:12 PM Christoph Hellwig  wrote:
>
> On Tue, Nov 10, 2020 at 09:43:50PM -0800, Ben Widawsky wrote:
> > +config CXL_MEM
> > +tristate "CXL.mem Device Support"
> > +depends on PCI && CXL_BUS_PROVIDER != n
>
> depend on PCI && CXL_BUS_PROVIDER
>
> > +default m if CXL_BUS_PROVIDER
>
> Please don't set weird defaults for new code.  Especially not default
> to module crap like this.

This goes back to what people like Dave C. asked for LIBNVDIMM / DAX,
a way to blanket turn on a subsystem without needing to go hunt down
individual configs. All of CXL is "default n", but if someone turns on
a piece of it they get all of it by default. The user can then opt-out
on pieces after that first opt-in. If there's a better way to turn on
suggested configs I'm open to switch to that style. As for the
"default m" I was worried that it would be "default y" without the
specificity, but I did not test that... will check. There have been
times when I wished that distros defaulted bleeding edge new enabling
to 'm' and putting that default in the Kconfig maybe saves me from
needing to file individual config changes to distros after the fact.

>
> > +// Copyright(c) 2020 Intel Corporation. All rights reserved.
>
> Please don't use '//' for anything but the SPDX header.

Ok, I find // following by /* */ a bit ugly, but I don't care enough to fight.

>
> > +
> > + pci_read_config_word(pdev, pos + PCI_DVSEC_VENDOR_OFFSET, 
> > );
> > + pci_read_config_word(pdev, pos + PCI_DVSEC_ID_OFFSET, );
> > + if (vendor == PCI_DVSEC_VENDOR_CXL && dvsec == id)
> > + return pos;
> > +
> > + pos = pci_find_next_ext_capability(pdev, pos, 
> > PCI_EXT_CAP_ID_DVSEC);
>
> Overly long lines again.

I thought 100 is the new 80 these days?

> > +static void cxl_mem_remove(struct pci_dev *pdev)
> > +{
> > +}
>
> No need for the empty remove callback.

True, will fix.

>
> > +MODULE_AUTHOR("Intel Corporation");
>
> A module author is not a company.

At least I don't have a copyright assignment clause, I don't agree
with the vanity of listing multiple people here especially when
MAINTAINERS has the contact info, and I don't want to maintain a list
as people do drive-by contributions and we need to figure out at what
level of contribution mandates a new MODULE_AUTHOR line. Now, that
said I would be ok to duplicate the MAINTAINERS as MODULE_AUTHOR
lines, but I otherwise expect MAINTAINERS is the central source for
module contact info.


Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-06 Thread Dan Williams
On Fri, Nov 6, 2020 at 4:12 PM Jason Gunthorpe  wrote:
>
> On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
[..]
> The only sane way to implement this generically is for the VMM to
> provide a hypercall to obtain a real *working* addr/data pair(s) and
> then have the platform hand those out from
> pci_subdevice_msi_create_irq_domain().

Yeah, that seems a logical attach point for this magic. Appreciate you
taking the time to lay it out.


Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

2020-11-06 Thread Dan Williams
On Fri, Nov 6, 2020 at 9:51 AM Jason Gunthorpe  wrote:
[..]
> > This is true for IMS as well. But probably not implemented in the kernel as
> > such. From a HW point of view (take idxd for instance) the facility is
> > available to native OS as well. The early RFC supported this for native.
>
> I can't follow what you are trying to say here.

I'm having a hard time following the technical cruxes of this debate.
I grokked your feedback on the original IMS proposal way back at the
beginning of this effort (pre-COVID even!), so maybe I can mediate
here as well. Although, SIOV is that much harder for me to spell than
IMS, so bear with me.

> Dave said the IMS cap was to indicate that the VMM supported emulation
> of IMS so that the VMM can do the MSI addr/data translation as part of
> the emulation.
>
> I'm saying emulation will be too horrible for our devices that don't
> require *any* emulation.

This part I think I understand, i.e. why spend any logic emulating IMS
as MSI since the IMS capability can be a paravirtualized interface
from guest to VMM with none of the compromises that MSI would enforce.
Did I get that right?

> It is a bad architecture. The platform needs to handle this globally
> for all devices, not special hacky emulations things custom made for
> every device out there.

I confess I don't quite understand the shape of what "platform needs
to handle this globally" means, but I understand the desired end
result of "no emulation added where not needed". However, would this
mean that the bare-metal idxd driver can not be used directly in the
guest without modification? For example, as I understand from talking
to Ashok, idxd has some device events like error notification hard
wired to MSI while data patch interrupts are IMS. So even if the IMS
side does not hook up MSI emulation doesn't idxd still need MSI
emulation to reuse the bare metal driver directly?

> > Native devices can have both MSIx and IMS capability. But as I
> > understand this isn't how we have partitioned things in SW today. We
> > left IMS only for mdev's. And I agree this would be very useful.
>
> That split is just some decision idxd did, we are thinking about doing
> other things in our devices.

Where does the collision happen between what you need for a clean
implementation of an IMS-like capability (/me misses his "dev-msi"
name that got thrown out in the Thomas rewrite), and emulation needed
to not have VF special casing in the idxd driver.

Also feel free to straighten me out (Jason or Ashok) if I've botched
the understanding of this.


Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams
On Thu, Nov 5, 2020 at 11:28 AM Ertman, David M
 wrote:
[..]
> > > Each auxiliary_device represents a part of its parent
> > > +functionality. The generic behavior can be extended and specialized as
> > needed
> > > +by encapsulating an auxiliary_device within other domain-specific
> > structures and
> > > +the use of .ops callbacks. Devices on the auxiliary bus do not share any
> > > +structures and the use of a communication channel with the parent is
> > > +domain-specific.
> >
> > Should there be any guidance here on when to use ops and when to just
> > export functions from parent driver to child. EXPORT_SYMBOL_NS() seems
> > a perfect fit for publishing shared routines between parent and child.
> >
>
> I would leave this up the driver writers to determine what is best for them.

I think there is a pathological case that can be avoided with a
statement like the following:

"Note that ops are intended as a way to augment instance behavior
within a class of auxiliary devices, it is not the mechanism for
exporting common infrastructure from the parent. Consider
EXPORT_SYMBOL_NS() to convey infrastructure from the parent module to
the auxiliary module(s)."

As for your other dispositions of the feedback, looks good to me.


Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams
On Thu, Nov 5, 2020 at 11:30 AM Leon Romanovsky  wrote:
>
> On Thu, Nov 05, 2020 at 09:12:51AM -0800, Dan Williams wrote:
> > On Thu, Nov 5, 2020 at 1:47 AM Leon Romanovsky  wrote:
> > >
> > > On Thu, Nov 05, 2020 at 01:19:09AM -0800, Dan Williams wrote:
> > > > Some doc fixups, and minor code feedback. Otherwise looks good to me.
> > > >
> > > > On Thu, Oct 22, 2020 at 5:35 PM Dave Ertman  
> > > > wrote:
> > > > >
> > >
> > > <...>
> > >
> > > > >
> > > > > +config AUXILIARY_BUS
> > > > > +   bool
> > > >
> > > > tristate? Unless you need non-exported symbols, might as well let this
> > > > be a module.
> > >
> > > I asked it to be "bool", because bus as a module is an invitation for
> > > a disaster. For example if I compile-in mlx5 which is based on this bus,
> > > and won't add auxiliary_bus as a module to initramfs, the system won't 
> > > boot.
> >
> > Something is broken if module dependencies don't arrange for
> > auxiliary_bus.ko to be added to the initramfs automatically, but yes,
> > it is another degree of freedom for something to go wrong if you build
> > the initramfs by hand.
>
> And this is something that I would like to avoid for now.

Fair enough.

>
> >
> > >
> > > <...>
> > >
> > > >
> > > > Per above SPDX is v2 only, so...
> > >
> > > Isn't it default for the Linux kernel?
> >
> > SPDX eliminated the need to guess a default, and MODULE_LICENSE("GPL")
> > implies the "or later" language. The only default assumption is that
> > the license is GPL v2 compatible, those possibilities are myriad, but
> > v2-only is the first preference.
>
> I mean that plain GPL == GPL v2 in the kernel.

You are right, I was wrong.


Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams
On Thu, Nov 5, 2020 at 11:40 AM Parav Pandit  wrote:
>
>
>
> > From: Ertman, David M 
> > Sent: Friday, November 6, 2020 12:58 AM
> > Subject: RE: [PATCH v3 01/10] Add auxiliary bus support
> >
> > > -Original Message-
> > > From: Dan Williams 
> > > Sent: Thursday, November 5, 2020 1:19 AM
> > >
>
> [..]
> > > > +
> > > > +Another use case is for the PCI device to be split out into
> > > > +multiple sub functions.  For each sub function an auxiliary_device
> > > > +will be created.  A PCI sub function driver will bind to such
> > > > +devices that will create its own one or more class devices.  A PCI
> > > > +sub function auxiliary device will likely be contained in a struct
> > > > +with additional attributes such as user defined sub function number
> > > > +and optional attributes such as resources and a link to
> > > the
> > > > +parent device.  These attributes could be used by systemd/udev; and
> > > hence should
> > > > +be initialized before a driver binds to an auxiliary_device.
> > >
> > > This does not read like an explicit example like the previous 2. Did
> > > you have something specific in mind?
> > >
> >
> > This was added by request of Parav.
> >
> This example describes the mlx5 PCI subfunction use case.
> I didn't follow your question about 'explicit example'.
> What part is missing to identify it as explicit example?

Specifically listing "mlx5" so if someone reading this document thinks
to themselves "hey mlx5 sounds like my use case" they can go grep for
that.


Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams
On Thu, Nov 5, 2020 at 12:21 PM Greg KH  wrote:
>
> On Thu, Nov 05, 2020 at 09:12:51AM -0800, Dan Williams wrote:
> > > >
> > > > Per above SPDX is v2 only, so...
> > >
> > > Isn't it default for the Linux kernel?
> >
> > SPDX eliminated the need to guess a default, and MODULE_LICENSE("GPL")
> > implies the "or later" language. The only default assumption is that
> > the license is GPL v2 compatible, those possibilities are myriad, but
> > v2-only is the first preference.
>
> No, MODULE_LICENSE("GPL") does not imply "or later" at all.  Please see
> include/linux/module.h, it means "GPL version 2".
>

Oh, I did, and stopped before getting to:

  "only/or later" distinction is completely irrelevant and does neither
 *replace the proper license identifiers in the corresponding source file

...sorry for the noise.


Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams
On Thu, Nov 5, 2020 at 1:47 AM Leon Romanovsky  wrote:
>
> On Thu, Nov 05, 2020 at 01:19:09AM -0800, Dan Williams wrote:
> > Some doc fixups, and minor code feedback. Otherwise looks good to me.
> >
> > On Thu, Oct 22, 2020 at 5:35 PM Dave Ertman  
> > wrote:
> > >
>
> <...>
>
> > >
> > > +config AUXILIARY_BUS
> > > +   bool
> >
> > tristate? Unless you need non-exported symbols, might as well let this
> > be a module.
>
> I asked it to be "bool", because bus as a module is an invitation for
> a disaster. For example if I compile-in mlx5 which is based on this bus,
> and won't add auxiliary_bus as a module to initramfs, the system won't boot.

Something is broken if module dependencies don't arrange for
auxiliary_bus.ko to be added to the initramfs automatically, but yes,
it is another degree of freedom for something to go wrong if you build
the initramfs by hand.

>
> <...>
>
> >
> > Per above SPDX is v2 only, so...
>
> Isn't it default for the Linux kernel?

SPDX eliminated the need to guess a default, and MODULE_LICENSE("GPL")
implies the "or later" language. The only default assumption is that
the license is GPL v2 compatible, those possibilities are myriad, but
v2-only is the first preference.


Re: [PATCH v3 01/10] Add auxiliary bus support

2020-11-05 Thread Dan Williams
Some doc fixups, and minor code feedback. Otherwise looks good to me.

On Thu, Oct 22, 2020 at 5:35 PM Dave Ertman  wrote:
>
> Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
> It enables drivers to create an auxiliary_device and bind an
> auxiliary_driver to it.
>
> The bus supports probe/remove shutdown and suspend/resume callbacks.
> Each auxiliary_device has a unique string based id; driver binds to
> an auxiliary_device based on this id through the bus.
>
> Co-developed-by: Kiran Patil 
> Signed-off-by: Kiran Patil 
> Co-developed-by: Ranjani Sridharan 
> Signed-off-by: Ranjani Sridharan 
> Co-developed-by: Fred Oh 
> Signed-off-by: Fred Oh 
> Co-developed-by: Leon Romanovsky 
> Signed-off-by: Leon Romanovsky 
> Reviewed-by: Pierre-Louis Bossart 
> Reviewed-by: Shiraz Saleem 
> Reviewed-by: Parav Pandit 
> Reviewed-by: Dan Williams 
> Signed-off-by: Dave Ertman 
> ---
>  Documentation/driver-api/auxiliary_bus.rst | 228 ++
>  Documentation/driver-api/index.rst |   1 +
>  drivers/base/Kconfig   |   3 +
>  drivers/base/Makefile  |   1 +
>  drivers/base/auxiliary.c   | 267 +
>  include/linux/auxiliary_bus.h  |  78 ++
>  include/linux/mod_devicetable.h|   8 +
>  scripts/mod/devicetable-offsets.c  |   3 +
>  scripts/mod/file2alias.c   |   8 +
>  9 files changed, 597 insertions(+)
>  create mode 100644 Documentation/driver-api/auxiliary_bus.rst
>  create mode 100644 drivers/base/auxiliary.c
>  create mode 100644 include/linux/auxiliary_bus.h
>
> diff --git a/Documentation/driver-api/auxiliary_bus.rst 
> b/Documentation/driver-api/auxiliary_bus.rst
> new file mode 100644
> index ..500f29692c81
> --- /dev/null
> +++ b/Documentation/driver-api/auxiliary_bus.rst
> @@ -0,0 +1,228 @@
> +.. SPDX-License-Identifier: GPL-2.0-only
> +
> +=
> +Auxiliary Bus
> +=
> +
> +In some subsystems, the functionality of the core device (PCI/ACPI/other) is
> +too complex for a single device to be managed as a monolithic block or a 
> part of
> +the functionality needs to be exposed to a different subsystem.

I think there are three identified use cases for this now, so call out
those examples for others to compare if aux bus is a fit for their use
case.

"In some subsystems, the functionality of the core device
(PCI/ACPI/other) is too complex to be handled by a monolithic driver
(e.g. Sound Open Firmware), multiple devices might implement a common
intersection of functionality (e.g. NICs + RDMA), or a driver may want
to export an interface for another subsystem to drive (e.g. SIOV
Physical Function export Virtual Function management)."

> + Splitting the
> +functionality into smaller orthogonal devices would make it easier to manage
> +data, power management and domain-specific interaction with the hardware.

Passive voice and I don't understand what is meant by the word "orthogonal"

"A split of the functionality into child-devices representing
sub-domains of functionality makes it possible to compartmentalize,
layer, and distribute domain-specific concerns via a Linux
device-driver model"

> A key
> +requirement for such a split is that there is no dependency on a physical 
> bus,
> +device, register accesses or regmap support.
> These individual devices split from
> +the core cannot live on the platform bus as they are not physical devices 
> that
> +are controlled by DT/ACPI. The same argument applies for not using MFD in 
> this
> +scenario as MFD relies on individual function devices being physical devices.

These last few sentences are confusing the justification of the
existence of auxiliary bus, not the explanation of when why and how to
use it.

The "When Should the Auxiliary Bus Be Used" should mention where
Auxiliary bus fits relative to the platform-bus and MFD, perhaps in an
explicit "When not to use Auxiliary Bus" section to direct people to
platform-bus and MFD when those are a better fit.

> +
> +An example for this kind of requirement is the audio subsystem where a single
> +IP is handling multiple entities such as HDMI, Soundwire, local devices such 
> as
> +mics/speakers etc. The split for the core's functionality can be arbitrary or
> +be defined by the DSP firmware topology and include hooks for test/debug. 
> This
> +allows for the audio core device to be minimal and focused on 
> hardware-specific
> +control and communication.
> +
> +The auxiliary bus is intended to be minimal, generic and avoid 
> domain-specific
> +assumptions.

As this file will be sitting in Documentation/ it will be too late to
"i

<    1   2   3   4   5   6   7   8   9   10   >