[PATCH] powerpc/mm/book3s64/hash: Update 4k PAGE_SIZE kernel mapping

2019-10-15 Thread Aneesh Kumar K.V
With commit: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel
regions in the same 0xc range"), kernel now split the 64TB address range
into 4 contexts each of 16TB. That implies we can do only 16TB linear
mapping. This results in boot failure on some P9 systems.

Fix this by redoing the hash 4k mapping as below.

 vmalloc start = 0xd000
 IO start  = 0xd0003800
 vmemmap start = 0xf000

Vmalloc area is now 56TB in size and IO remap 8TB. We need to keep them in the
same top nibble address because we map both of them in the Linux page table and 
they
share the init_mm page table. We need a large vmalloc space because we use
percpu embedded first chunk allocator.

Both linear and vmemmap range is of 64TB size each and is mapped respectively
using 0xc and 0xf top nibble.

Fixes: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in the same 
0xc range")
Reported-by: Cameron Berkenpas 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 54 ++--
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 73 -
 arch/powerpc/include/asm/book3s/64/hash.h | 82 ++-
 3 files changed, 123 insertions(+), 86 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 8fd8599c9395..4cbb9fe22d76 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -12,23 +12,59 @@
  * Hence also limit max EA bits to 64TB.
  */
 #define MAX_EA_BITS_PER_CONTEXT46
-
-#define REGION_SHIFT   (MAX_EA_BITS_PER_CONTEXT - 2)
+/*
+ * For 4k hash, considering we restricted by a page table sizing that
+ * limit our address range to 64TB, keep the kernel virtual
+ * mapping in 0xd region.
+ */
+#define H_KERN_VIRT_START  ASM_CONST(0xd000)
 
 /*
- * Our page table limit us to 64TB. Hence for the kernel mapping,
- * each MAP area is limited to 16 TB.
- * The four map areas are:  linear mapping, vmap, IO and vmemmap
+ * Top 4 bits are ignored in page table walk.
  */
-#define H_KERN_MAP_SIZE(ASM_CONST(1) << REGION_SHIFT)
+#define EA_MASK(~(0xfUL << 60))
 
 /*
- * Define the address range of the kernel non-linear virtual area
- * 16TB
+ * Place vmalloc and IO in the 64TB range because we map them via linux page
+ * table and table size is limited to 64TB.
+ */
+#define H_VMALLOC_STARTH_KERN_VIRT_START
+/*
+ * 56TB vmalloc size. We require large vmalloc space for percpu mapping.
  */
-#define H_KERN_VIRT_START  ASM_CONST(0xc0001000)
+#define H_VMALLOC_SIZE (56UL << 40)
+#define H_VMALLOC_END  (H_VMALLOC_START + H_VMALLOC_SIZE)
+
+#define H_KERN_IO_STARTH_VMALLOC_END
+#define H_KERN_IO_SIZE (8UL << 40)
+#define H_KERN_IO_END  (H_KERN_IO_START + H_KERN_IO_SIZE)
+
+#define H_VMEMMAP_STARTASM_CONST(0xf000)
+#define H_VMEMMAP_SIZE (1UL << MAX_EA_BITS_PER_CONTEXT)
+#define H_VMEMMAP_END  (H_VMEMMAP_START + H_VMEMMAP_SIZE)
 
 #ifndef __ASSEMBLY__
+static inline int get_region_id(unsigned long ea)
+{
+   int id = (ea >> 60UL);
+
+   switch (id) {
+   case 0x0:
+   return USER_REGION_ID;
+   case 0xc:
+   return LINEAR_MAP_REGION_ID;
+   case 0xd:
+   if (ea < H_KERN_IO_START)
+   return VMALLOC_REGION_ID;
+   else
+   return IO_REGION_ID;
+   case 0xf:
+   return VMEMMAP_REGION_ID;
+   default:
+   return INVALID_REGION_ID;
+   }
+}
+
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
 #define H_PMD_TABLE_SIZE   (sizeof(pmd_t) << H_PMD_INDEX_SIZE)
 #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index d1d9177d9ebd..fc44bc590ac8 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -13,18 +13,61 @@
  * is handled in the hotpath.
  */
 #define MAX_EA_BITS_PER_CONTEXT49
-#define REGION_SHIFT   MAX_EA_BITS_PER_CONTEXT
+
+/*
+ * Define the address range of the kernel non-linear virtual area
+ * 2PB
+ */
+#define H_KERN_VIRT_START  ASM_CONST(0xc008)
 
 /*
  * We use one context for each MAP area.
  */
+#define REGION_SHIFT   MAX_EA_BITS_PER_CONTEXT
 #define H_KERN_MAP_SIZE(1UL << MAX_EA_BITS_PER_CONTEXT)
 
 /*
- * Define the address range of the kernel non-linear virtual area
- * 2PB
+ * Top 2 bits are ignored in page table walk.
  */
-#define H_KERN_VIRT_START  ASM_CONST(0xc008)
+#define EA_MASK(~(0xcUL << 60))
+
+/*
+ * +--+
+ * | 

Re: [PATCH 03/34] powerpc: Use CONFIG_PREEMPTION

2019-10-15 Thread Christophe Leroy




Le 15/10/2019 à 21:17, Sebastian Andrzej Siewior a écrit :

From: Thomas Gleixner 

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the entry code over to use CONFIG_PREEMPTION. Add PREEMPT_RT
output in __die().


powerpc doesn't select ARCH_SUPPORTS_RT, so this change is useless as 
CONFIG_PREEMPT_RT cannot be selected.




Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Thomas Gleixner 
[bigeasy: +traps.c, Kconfig]
Signed-off-by: Sebastian Andrzej Siewior 
---
  arch/powerpc/Kconfig   | 2 +-
  arch/powerpc/kernel/entry_32.S | 4 ++--
  arch/powerpc/kernel/entry_64.S | 4 ++--
  arch/powerpc/kernel/traps.c| 7 ++-
  4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16ee..8ead8d6e1cbc8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -106,7 +106,7 @@ config LOCKDEP_SUPPORT
  config GENERIC_LOCKBREAK
bool
default y
-   depends on SMP && PREEMPT
+   depends on SMP && PREEMPTION
  
  config GENERIC_HWEIGHT

bool
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index d60908ea37fb9..e1a4c39b83b86 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -897,7 +897,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
bne-0b
  1:
  
-#ifdef CONFIG_PREEMPT

+#ifdef CONFIG_PREEMPTION
/* check current_thread_info->preempt_count */
lwz r0,TI_PREEMPT(r2)
cmpwi   0,r0,0  /* if non-zero, just restore regs and return */
@@ -921,7 +921,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
 */
bl  trace_hardirqs_on
  #endif
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
  restore_kuap:
kuap_restore r1, r2, r9, r10, r0
  
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S

index 6467bdab8d405..83733376533e8 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -840,7 +840,7 @@ _GLOBAL(ret_from_except_lite)
bne-0b
  1:
  
-#ifdef CONFIG_PREEMPT

+#ifdef CONFIG_PREEMPTION
/* Check if we need to preempt */
andi.   r0,r4,_TIF_NEED_RESCHED
beq+restore
@@ -871,7 +871,7 @@ _GLOBAL(ret_from_except_lite)
li  r10,MSR_RI
mtmsrd  r10,1 /* Update machine state */
  #endif /* CONFIG_PPC_BOOK3E */
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
  
  	.globl	fast_exc_return_irq

  fast_exc_return_irq:
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 82f43535e6867..23d2f20be4f2e 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -252,14 +252,19 @@ NOKPROBE_SYMBOL(oops_end);
  
  static int __die(const char *str, struct pt_regs *regs, long err)

  {
+   const char *pr = "";
+


Please follow the same approach as already existing. Don't add a local 
var for that.



printk("Oops: %s, sig: %ld [#%d]\n", str, err, ++die_counter);
  
+	if (IS_ENABLED(CONFIG_PREEMPTION))

+   pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT";
+


drop


printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s%s %s\n",


Add one %s


   IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? "LE" : "BE",
   PAGE_SIZE / 1024,
   early_radix_enabled() ? " MMU=Radix" : "",
   early_mmu_has_feature(MMU_FTR_HPTE_TABLE) ? " MMU=Hash" : "",
-  IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "",


Replace by: IS_ENABLED(CONFIG_PREEMPTION) ? " PREEMPT" : ""


+  pr,


add something like: IS_ENABLED(CONFIG_PREEMPT_RT) ? "_RT" : ""


   IS_ENABLED(CONFIG_SMP) ? " SMP" : "",
   IS_ENABLED(CONFIG_SMP) ? (" NR_CPUS=" __stringify(NR_CPUS)) : "",
   debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "",



Christophe


Re: [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected

2019-10-15 Thread Sam Bobroff
On Wed, Oct 16, 2019 at 12:25:36PM +1100, Oliver O'Halloran wrote:
> Many drivers don't check for errors when they get a 0xFFs response from an
> MMIO load. As a result after an EEH event occurs a driver can get stuck in
> a polling loop unless it some kind of internal timeout logic.
> 
> Currently EEH tries to detect and report stuck drivers by dumping a stack
> trace after eeh_dev_check_failure() is called EEH_MAX_FAILS times on an
> already frozen PE. The value of EEH_MAX_FAILS was chosen so that a dump
> would occur every few seconds if the driver was spinning in a loop. This
> results in a lot of spurious stack traces in the kernel log.
> 
> Fix this by limiting it to printing one stack trace for each PE freeze. If
> the driver is truely stuck the kernel's hung task detector is better suited
> to reporting the probelm anyway.
problem
> 
> Cc: Sam Bobroff 
> Signed-off-by: Oliver O'Halloran 

Looks good to me (especially because if it's stuck in a loop the stack
trace is going to be pretty much the same every time). I tested it by
recovering a device that uses the mlx5_core driver.

Reviewed-by: Sam Bobroff 
Tested-by: Sam Bobroff 
> ---
>  arch/powerpc/kernel/eeh.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index bc8a551013be..c35069294ecf 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -503,7 +503,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>   rc = 1;
>   if (pe->state & EEH_PE_ISOLATED) {
>   pe->check_count++;
> - if (pe->check_count % EEH_MAX_FAILS == 0) {
> + if (pe->check_count == EEH_MAX_FAILS) {
>   dn = pci_device_to_OF_node(dev);
>   if (dn)
>   location = of_get_property(dn, "ibm,loc-code",
> -- 
> 2.21.0
> 


signature.asc
Description: PGP signature


[PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected

2019-10-15 Thread Oliver O'Halloran
Many drivers don't check for errors when they get a 0xFFs response from an
MMIO load. As a result after an EEH event occurs a driver can get stuck in
a polling loop unless it some kind of internal timeout logic.

Currently EEH tries to detect and report stuck drivers by dumping a stack
trace after eeh_dev_check_failure() is called EEH_MAX_FAILS times on an
already frozen PE. The value of EEH_MAX_FAILS was chosen so that a dump
would occur every few seconds if the driver was spinning in a loop. This
results in a lot of spurious stack traces in the kernel log.

Fix this by limiting it to printing one stack trace for each PE freeze. If
the driver is truely stuck the kernel's hung task detector is better suited
to reporting the probelm anyway.

Cc: Sam Bobroff 
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index bc8a551013be..c35069294ecf 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -503,7 +503,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
rc = 1;
if (pe->state & EEH_PE_ISOLATED) {
pe->check_count++;
-   if (pe->check_count % EEH_MAX_FAILS == 0) {
+   if (pe->check_count == EEH_MAX_FAILS) {
dn = pci_device_to_OF_node(dev);
if (dn)
location = of_get_property(dn, "ibm,loc-code",
-- 
2.21.0



Re: [PATCH v5 03/23] PCI: hotplug: Add a flag for the movable BARs feature

2019-10-15 Thread Bjorn Helgaas
On Mon, Sep 30, 2019 at 03:59:25PM +0300, Sergey Miroshnichenko wrote:
> Hello Bjorn,
> 
> On 9/28/19 1:02 AM, Bjorn Helgaas wrote:
> > On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
> > > When hot-adding a device, the bridge may have windows not big enough (or
> > > fragmented too much) for newly requested BARs to fit in. And expanding
> > > these bridge windows may be impossible because blocked by "neighboring"
> > > BARs and bridge windows.
> > > 
> > > Still, it may be possible to allocate a memory region for new BARs with 
> > > the
> > > following procedure:
> > > 
> > > 1) notify all the drivers which support movable BARs to pause and release
> > > the BARs; the rest of the drivers are guaranteed that their devices 
> > > will
> > > not get BARs moved;
> > > 
> > > 2) release all the bridge windows except of root bridges;
> > > 
> > > 3) try to recalculate new bridge windows that will fit all the BAR types:
> > > - fixed;
> > > - immovable;
> > > - movable;
> > > - newly requested by hot-added devices;
> > > 
> > > 4) if the previous step fails, disable BARs for one of the hot-added
> > > devices and retry from step 3;
> > > 
> > > 5) notify the drivers, so they remap BARs and resume.
> > 
> > You don't do the actual recalculation in *this* patch, but since you
> > mention the procedure here, are we confident that we never make things
> > worse?
> > 
> > It's possible that a hot-add will trigger this attempt to move things
> > around, and it's possible that we won't find space for the new device
> > even if we move things around.  But are we certain that every device
> > that worked *before* the hot-add will still work *afterwards*?
> > 
> > Much of the assignment was probably done by the BIOS using different
> > algorithms than Linux has, so I think there's some chance that the
> > BIOS did a better job and if we lose that BIOS assignment, we might
> > not be able to recreate it.
> 
> If a hardware has some special constraints on BAR assignment that the
> kernel is not aware of yet, the movable BARs may break things after a
> hotplug event. So the feature must be disabled there (manually) until
> the kernel get support for that special needs.

I'm not talking about special constraints on BAR assignment.  (I'm not
sure what those constraints would be -- AFAIK the constraints for a
spec-compliant device are all discoverable via the BAR size and type
(or the Enhanced Allocation capability)).

What I'm concerned about is the case where we boot with a working
assignment, we hot-add a device, we move things around to try to
accommodate the new device, and not only do we fail to find resources
for the new device, we also fail to find a working assignment for the
devices that were present at boot.  We've moved things around from
what BIOS did, and since we use a different algorithm than the BIOS,
there's no guarantee that we'll be able to find the assignment BIOS
did.

> > I'm not sure why the PCI_CLASS_DISPLAY_VGA special case is there; can
> > you add a comment about why that's needed?  Obviously we can't move
> > the 0xa legacy frame buffer because I think devices are allowed to
> > claim that region even if no BAR describes it.  But I would think
> > *other* BARs of VGA devices could be movable.
> 
> Sure, I'll add a comment to the code.
> 
> The issue that we are avoiding by that is the "nomodeset" command line
> argument, which prevents a video driver from being bound, so the BARs
> are seems to be used, but can't be moved, otherwise machines just hang
> after hotplug events. That was the only special ugly case we've
> spotted during testing. I'll check if it will be enough just to work
> around the 0xa.

"nomodeset" is not really documented and is a funny way to say "don't
bind video drivers that know about it", but OK.  Thanks for checking
on the other BARs.

> > > +bool pci_movable_bars_enabled(void);
> > 
> > I would really like it if this were simply
> > 
> >extern bool pci_no_movable_bars;
> > 
> > in drivers/pci/pci.h.  It would default to false since it's
> > uninitialized, and "pci=no_movable_bars" would set it to true.
> 
> I have a premonition of platforms that will not support the feature.
> Wouldn't be better to put this variable-flag to include/linux/pci.h ,
> so code in arch/* can set it, so they could work by default, without
> the command line argument?

In general I don't see why a platform wouldn't support this since
there really isn't anything platform-specific here.  But if a platform
does need to disable it, having arch code set this flag sounds
reasonable.  We shouldn't make it globally visible until we actually
need that, though.

> > We have similar "=off" and "=force" parameters for ASPM and other
> > things, and it makes the code really hard to analyze.

The "=off" and "=force" things are the biggest things I'd like to
avoid.

Bjorn


[PATCH 03/34] powerpc: Use CONFIG_PREEMPTION

2019-10-15 Thread Sebastian Andrzej Siewior
From: Thomas Gleixner 

CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the entry code over to use CONFIG_PREEMPTION. Add PREEMPT_RT
output in __die().

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Thomas Gleixner 
[bigeasy: +traps.c, Kconfig]
Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/powerpc/Kconfig   | 2 +-
 arch/powerpc/kernel/entry_32.S | 4 ++--
 arch/powerpc/kernel/entry_64.S | 4 ++--
 arch/powerpc/kernel/traps.c| 7 ++-
 4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16ee..8ead8d6e1cbc8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -106,7 +106,7 @@ config LOCKDEP_SUPPORT
 config GENERIC_LOCKBREAK
bool
default y
-   depends on SMP && PREEMPT
+   depends on SMP && PREEMPTION
 
 config GENERIC_HWEIGHT
bool
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index d60908ea37fb9..e1a4c39b83b86 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -897,7 +897,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
bne-0b
 1:
 
-#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_PREEMPTION
/* check current_thread_info->preempt_count */
lwz r0,TI_PREEMPT(r2)
cmpwi   0,r0,0  /* if non-zero, just restore regs and return */
@@ -921,7 +921,7 @@ user_exc_return:/* r10 contains MSR_KERNEL here 
*/
 */
bl  trace_hardirqs_on
 #endif
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
 restore_kuap:
kuap_restore r1, r2, r9, r10, r0
 
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 6467bdab8d405..83733376533e8 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -840,7 +840,7 @@ _GLOBAL(ret_from_except_lite)
bne-0b
 1:
 
-#ifdef CONFIG_PREEMPT
+#ifdef CONFIG_PREEMPTION
/* Check if we need to preempt */
andi.   r0,r4,_TIF_NEED_RESCHED
beq+restore
@@ -871,7 +871,7 @@ _GLOBAL(ret_from_except_lite)
li  r10,MSR_RI
mtmsrd  r10,1 /* Update machine state */
 #endif /* CONFIG_PPC_BOOK3E */
-#endif /* CONFIG_PREEMPT */
+#endif /* CONFIG_PREEMPTION */
 
.globl  fast_exc_return_irq
 fast_exc_return_irq:
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 82f43535e6867..23d2f20be4f2e 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -252,14 +252,19 @@ NOKPROBE_SYMBOL(oops_end);
 
 static int __die(const char *str, struct pt_regs *regs, long err)
 {
+   const char *pr = "";
+
printk("Oops: %s, sig: %ld [#%d]\n", str, err, ++die_counter);
 
+   if (IS_ENABLED(CONFIG_PREEMPTION))
+   pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT";
+
printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s%s %s\n",
   IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? "LE" : "BE",
   PAGE_SIZE / 1024,
   early_radix_enabled() ? " MMU=Radix" : "",
   early_mmu_has_feature(MMU_FTR_HPTE_TABLE) ? " MMU=Hash" : "",
-  IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "",
+  pr,
   IS_ENABLED(CONFIG_SMP) ? " SMP" : "",
   IS_ENABLED(CONFIG_SMP) ? (" NR_CPUS=" __stringify(NR_CPUS)) : "",
   debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "",
-- 
2.23.0



Re: [EXTERNAL] [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices

2019-10-15 Thread Frederic Barrat




Le 15/10/2019 à 07:42, Sam Bobroff a écrit :

On Fri, Sep 27, 2019 at 02:45:10PM +0200, Frederic Barrat wrote:

Recent cleanup in the way EEH support is added to a device causes a
kernel oops when the cxl driver probes a device and creates virtual
devices discovered on the FPGA:

 BUG: Kernel NULL pointer dereference at 0x00a0
 Faulting instruction address: 0xc0048070
 Oops: Kernel access of bad area, sig: 7 [#1]
 ...
 NIP [c0048070] eeh_add_device_late.part.9+0x50/0x1e0
 LR [c004805c] eeh_add_device_late.part.9+0x3c/0x1e0
 Call Trace:
 [c000200e43983900] [c079e250] _dev_info+0x5c/0x6c (unreliable)
 [c000200e43983980] [c00d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
 [c000200e439839f0] [c00606d0] pcibios_bus_add_device+0x40/0x60
 [c000200e43983a10] [c06aa3a0] pci_bus_add_device+0x30/0x100
 [c000200e43983a80] [c06aa4d4] pci_bus_add_devices+0x64/0xd0
 [c000200e43983ac0] [c0081c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
 [c000200e43983b00] [c0081c4242ac] cxl_probe+0x504/0x5b0 [cxl]
 [c000200e43983bb0] [c06bba1c] local_pci_probe+0x6c/0x110
 [c000200e43983c30] [c0159278] work_for_cpu_fn+0x38/0x60

The root cause is that those cxl virtual devices don't have a
representation in the device tree and therefore no associated pci_dn
structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
we oops.

We never had explicit support for EEH for those virtual
devices. Instead, EEH events are reported to the (real) pci device and
handled by the cxl driver. Which can then forward to the virtual
devices and handle dependencies. The fact that we try adding EEH
support for the virtual devices is new and a side-effect of the recent
cleanup.

This patch fixes it by skipping adding EEH support on powernv for
devices which don't have a pci_dn structure.

Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
Signed-off-by: Frederic Barrat 
---

Sending as an RFC, as I'm afraid of hiding potential issues and would
be interested in comments. The powernv eeh code expects a struct
pci_dn, so the fix seems safe. I'm wondering if there could be cases
(other than capi virtual devices) where we'd want to blow up and fix
instead of going undetected with this patch.


Looks good to me.

I do think it would be good to detect a missing pci_dn (WARN_ONCE()
might be appropriate).  However to implement it,
pnv_pcibios_bus_add_device() would need a way to detect that a struct
pci_dev is a cxl virtual device. I don't see an easy way to do that; do
you know if it's possible?



I think I found a solution. There's a cxl_pci_is_vphb_device() which is 
fairly cheap and would do the job. Sorry, I didn't think about it at first.




One last thing: pseries_pcibios_bus_add_device() also requires a pci_dn.
Do you know if it's possible to trigger a similar issue there, or is it
not possible for some reason?



I don't think anybody is using capi in a lpar, but it should theorically 
be possible to hit it. I'll dig some more tomorrow and adjust the patch 
when resubmitting.


Thanks!

  Fred




Cheers,
Sam.


  arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 6bc24a47e9ef..6f300ab7f0e9 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -42,7 +42,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
  {
struct pci_dn *pdn = pci_get_pdn(pdev);
  
-	if (eeh_has_flag(EEH_FORCE_DISABLED))

+   if (!pdn || eeh_has_flag(EEH_FORCE_DISABLED))
return;
  
  	dev_dbg(>dev, "EEH: Setting up device\n");

--
2.21.0





Re: [PATCH V6 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Qian Cai
On Tue, 2019-10-15 at 20:51 +0530, Anshuman Khandual wrote:
> 
> On 10/15/2019 08:11 PM, Qian Cai wrote:
> > The x86 will crash with linux-next during boot due to this series (v5) with 
> > the
> > config below plus CONFIG_DEBUG_VM_PGTABLE=y. I am not sure if v6 would 
> > address
> > it.
> > 
> > https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> > 
> > [   33.862600][T1] page:ea000900 is uninitialized and poisoned
> > [   33.862608][T1] raw:   
> > 
> > ff871140][T1]  ? _raw_spin_unlock_irq+0x27/0x40
> > [   33.871140][T1]  ? rest_init+0x307/0x307
> > [   33.871140][T1]  kernel_init+0x11/0x139
> > [   33.871140][T1]  ? rest_init+0x307/0x307
> > [   33.871140][T1]  ret_from_fork+0x27/0x50
> > [   33.871140][T1] Modules linked in:
> > [   33.871140][T1] ---[ end trace e99d392b0f7befbd ]---
> > [   33.871140][T1] RIP: 0010:alloc_gigantic_page_order+0x3fe/0x490
> 
> Hmm, with defconfig (DEBUG_VM=y and DEBUG_VM_PGTABLE=y) it does not crash but
> with the config above, it does. Just wondering if it is possible that these
> pages might not been initialized yet because DEFERRED_STRUCT_PAGE_INIT=y ?

Yes, this patch works fine.

diff --git a/init/main.c b/init/main.c
index 676d8020dd29..591be8f9e8e0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1177,7 +1177,6 @@ static noinline void __init kernel_init_freeable(void)
workqueue_init();
 
init_mm_internals();
-   debug_vm_pgtable();
 
do_pre_smp_initcalls();
lockup_detector_init();
@@ -1186,6 +1185,8 @@ static noinline void __init kernel_init_freeable(void)
sched_init_smp();
 
page_alloc_init_late();
+   debug_vm_pgtable();
+
/* Initialize page ext after all struct pages are initialized. */
page_ext_init();

> 
> [   13.898549][T1] page:ea000500 is uninitialized and poisoned
> [   13.898549][T1] raw:   
>  
> [   13.898549][T1] raw:   
>  
> [   13.898549][T1] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [   13.898549][T1] [ cut here ]
> [   13.898549][T1] kernel BUG at ./include/linux/mm.h:1107!
> [   13.898549][T1] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> [   13.898549][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.4.0-rc3-next-20191015+ #


Re: [PATCH V6 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Qian Cai
On Tue, 2019-10-15 at 14:51 +0530, Anshuman Khandual wrote:
> +static unsigned long __init get_random_vaddr(void)
> +{
> + unsigned long random_vaddr, random_pages, total_user_pages;
> +
> + total_user_pages = (TASK_SIZE - FIRST_USER_ADDRESS) / PAGE_SIZE;
> +
> + random_pages = get_random_long() % total_user_pages;
> + random_vaddr = FIRST_USER_ADDRESS + random_pages * PAGE_SIZE;
> +
> + WARN_ON(random_vaddr > TASK_SIZE);
> + WARN_ON(random_vaddr < FIRST_USER_ADDRESS);

It would be nice if this patch does not introduce a new W=1 GCC warning here on
x86 because FIRST_USER_ADDRESS is 0, and GCC think the code is dumb because
"random_vaddr" is unsigned,

In file included from ./arch/x86/include/asm/bug.h:83,
 from ./include/linux/bug.h:5,
 from ./include/linux/mmdebug.h:5,
 from ./include/linux/gfp.h:5,
 from mm/debug_vm_pgtable.c:13:
mm/debug_vm_pgtable.c: In function ‘get_random_vaddr’:
mm/debug_vm_pgtable.c:359:23: warning: comparison of unsigned expression < 0 is
always false [-Wtype-limits]
  WARN_ON(random_vaddr < FIRST_USER_ADDRESS);
   ^
./include/asm-generic/bug.h:113:25: note: in definition of macro ‘WARN_ON’
  int __ret_warn_on = !!(condition);\
 ^


Re: [PATCH RFC 0/5] ARM: Raspberry Pi 4 DMA support

2019-10-15 Thread Catalin Marinas
On Tue, Oct 15, 2019 at 09:48:22AM +0200, Nicolas Saenz Julienne wrote:
> A little off topic but I was wondering if you have a preferred way to refer to
> the arm architecture in a way that it unambiguously excludes arm64 (for 
> example
> arm32 would work).

arm32 should be fine. Neither arm64 nor arm32 are officially endorsed
ARM Ltd names (officially the exception model is AArch32 while the
instruction set is one of A32/T32/T16).

-- 
Catalin


Re: [PATCH v2 01/29] powerpc: Rename "notes" PT_NOTE to "note"

2019-10-15 Thread Kees Cook
On Tue, Oct 15, 2019 at 06:54:13PM +0200, Borislav Petkov wrote:
> On Fri, Oct 11, 2019 at 11:25:52AM -0500, Segher Boessenkool wrote:
> > Names *matter*, internal names doubly so.  So why replace a good name with
> > a worse name?  Because it is slightly less work for you?
> 
> So if we agree on the name "notes" and we decide to rename the other
> arches, this should all be done in a separate patchset anyway, and ontop
> of this one. And I believe Kees wouldn't mind doing it ontop since he's
> gotten his hands dirty already. :-P

Yeah, I'm fine with that. I would prefer to do it as a separate step,
just to minimize the logical steps each patch takes. Shall I spin a v3
with the Acks added and a final rename for this?

-- 
Kees Cook


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-15 Thread Greg KH
On Tue, Oct 15, 2019 at 06:40:29PM +0800, Yunsheng Lin wrote:
> On 2019/10/14 17:25, Greg KH wrote:
> > On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
> >> On 2019/10/12 18:47, Greg KH wrote:
> >>> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
>  On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> > On 2019/10/12 15:40, Greg KH wrote:
> >> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> >>> add pci and acpi maintainer
> >>> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> >>>
> >>> On 2019/10/11 19:15, Peter Zijlstra wrote:
>  On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> > But I failed to see why the above is related to making 
> > node_to_cpumask_map()
> > NUMA_NO_NODE aware?
> 
>  Your initial bug is for hns3, which is a PCI device, which really 
>  _MUST_
>  have a node assigned.
> 
>  It not having one, is a straight up bug. We must not silently accept
>  NO_NODE there, ever.
> 
> >>>
> >>> I suppose you mean reporting a lack of affinity when the node of a 
> >>> pcie
> >>> device is not set by "not silently accept NO_NODE".
> >>
> >> If the firmware of a pci device does not provide the node information,
> >> then yes, warn about that.
> >>
> >>> As Greg has asked about in [1]:
> >>> what is a user to do when the user sees the kernel reporting that?
> >>>
> >>> We may tell user to contact their vendor for info or updates about
> >>> that when they do not know about their system well enough, but their
> >>> vendor may get away with this by quoting ACPI spec as the spec
> >>> considering this optional. Should the user believe this is indeed a
> >>> fw bug or a misreport from the kernel?
> >>
> >> Say it is a firmware bug, if it is a firmware bug, that's simple.
> >>
> >>> If this kind of reporting is common pratice and will not cause any
> >>> misunderstanding, then maybe we can report that.
> >>
> >> Yes, please do so, that's the only way those boxes are ever going to 
> >> get
> >> fixed.  And go add the test to the "firmware testing" tool that is 
> >> based
> >> on Linux that Intel has somewhere, to give vendors a chance to fix this
> >> before they ship hardware.
> >>
> >> This shouldn't be a big deal, we warn of other hardware bugs all the
> >> time.
> >
> > Ok, thanks for clarifying.
> >
> > Will send a patch to catch the case when a pcie device without numa node
> > being set and warn about it.
> >
> > Maybe use dev->bus to verify if it is a pci device?
> 
>  No, do that in the pci bus core code itself, when creating the devices
>  as that is when you know, or do not know, the numa node, right?
> 
>  This can't be in the driver core only, as each bus type will have a
>  different way of determining what the node the device is on.  For some
>  reason, I thought the PCI core code already does this, right?
> >>>
> >>> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> >>> thing...
> >>>
> >>> Anyway, it looks like the pci core code does call set_dev_node() based
> >>> on the PCI bridge, so if that is set up properly, all should be fine.
> >>>
> >>> If not, well, you have buggy firmware and you need to warn about that at
> >>> the time you are creating the bridge.  Look at the call to
> >>> pcibus_to_node() in pci_register_host_bridge().
> >>
> >> Thanks for pointing out the specific function.
> >> Maybe we do not need to warn about the case when the device has a parent,
> >> because we must have warned about the parent if the device has a parent
> >> and the parent also has a node of NO_NODE, so do not need to warn the child
> >> device anymore? like blew:
> >>
> >> @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
> >> pci_host_bridge *bridge)
> >> list_add_tail(>node, _root_buses);
> >> up_write(_bus_sem);
> >>
> >> +   if (nr_node_ids > 1 && !parent &&
> > 
> > Why do you need to check this?  If you have a parent, it's your node
> > should be set, if not, that's an error, right?
> 
> If the device has parent and the parent device also has a node of
> NUMA_NO_NODE, then maybe we have warned about the parent device, so
> we do not have to warn about the child device?

But it's a PCI bridge, if it is not set properly, that needs to be fixed
otherwise the PCI devices attached to it have no hope of working
properly.

> In pci_register_host_bridge():
> 
>   if (!parent)
>   set_dev_node(bus->bridge, pcibus_to_node(bus));
> 
> The above only set the node of the bridge device to the node of bus if
> the bridge device does not have a parent.

Odd, what happens to devices behind another bridge today?  Are their
nodes set properly today?  Is the node 

Re: [PATCH v2 01/29] powerpc: Rename "notes" PT_NOTE to "note"

2019-10-15 Thread Borislav Petkov
On Fri, Oct 11, 2019 at 11:25:52AM -0500, Segher Boessenkool wrote:
> Names *matter*, internal names doubly so.  So why replace a good name with
> a worse name?  Because it is slightly less work for you?

So if we agree on the name "notes" and we decide to rename the other
arches, this should all be done in a separate patchset anyway, and ontop
of this one. And I believe Kees wouldn't mind doing it ontop since he's
gotten his hands dirty already. :-P

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-15 Thread Darrick J. Wong
On Tue, Oct 15, 2019 at 09:10:04AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Oct 15, 2019 at 01:01:02AM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 14, 2019 at 03:09:48PM -0500, Eric Sandeen wrote:
> > > We're in agreement here.  ;)  I only worry about implementing things like 
> > > this
> > > which sound like guarantees, but aren't, and end up encouraging bad 
> > > behavior
> > > or promoting misconceptions.
> > > 
> > > More and more, I think we should reconsider Darrick's "bootfs" (ext2 by 
> > > another
> > > name, but with extra-sync-iness) proposal...
> > 
> > Having a separate simple file system for the boot loader makes a lot of
> > sense.  Note that vfat of EFI is the best choice, but at least it is
> > something.  SysV Unix from the 90s actually had a special file system just
> > for that, and fs/bfs/ in Linux supports that.  So this isn't really a new
> > thing either.
> 
> Did you mean to say "vfaat of EFI isn't the best choice"?
> 
> If we were going to be doing something like "bootfs", what sort of
> semantics would be sufficient?  Is doing an implied fsync() on every
> close(2) enough, or do we need to do something even more conservative?

I'm assuming you'd also want to make sure the journal checkpoints as
part of fsync, right?  Aside from being an April Fools joke, bootfs[1]
does implement the semantics I needed to fix all the complaining about
grub being broken. 8-)

Granted there's also the systemd bootloader spec[2] which says
FAT{16,32}...

[1] https://lore.kernel.org/linux-fsdevel/20190401070001.GJ1173@magnolia/
[2] https://systemd.io/BOOT_LOADER_SPECIFICATION.html

--D

>- Ted


Re: [PATCH V6 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual



On 10/15/2019 08:11 PM, Qian Cai wrote:
> The x86 will crash with linux-next during boot due to this series (v5) with 
> the
> config below plus CONFIG_DEBUG_VM_PGTABLE=y. I am not sure if v6 would address
> it.
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> 
> [   33.862600][T1] page:ea000900 is uninitialized and poisoned
> [   33.862608][T1] raw:   
> ff871140][T1]  ? _raw_spin_unlock_irq+0x27/0x40
> [   33.871140][T1]  ? rest_init+0x307/0x307
> [   33.871140][T1]  kernel_init+0x11/0x139
> [   33.871140][T1]  ? rest_init+0x307/0x307
> [   33.871140][T1]  ret_from_fork+0x27/0x50
> [   33.871140][T1] Modules linked in:
> [   33.871140][T1] ---[ end trace e99d392b0f7befbd ]---
> [   33.871140][T1] RIP: 0010:alloc_gigantic_page_order+0x3fe/0x490

Hmm, with defconfig (DEBUG_VM=y and DEBUG_VM_PGTABLE=y) it does not crash but
with the config above, it does. Just wondering if it is possible that these
pages might not been initialized yet because DEFERRED_STRUCT_PAGE_INIT=y ?

[   13.898549][T1] page:ea000500 is uninitialized and poisoned
[   13.898549][T1] raw:    

[   13.898549][T1] raw:    

[   13.898549][T1] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   13.898549][T1] [ cut here ]
[   13.898549][T1] kernel BUG at ./include/linux/mm.h:1107!
[   13.898549][T1] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[   13.898549][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.4.0-rc3-next-20191015+ #


Re: [PATCH V6 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Qian Cai
The x86 will crash with linux-next during boot due to this series (v5) with the
config below plus CONFIG_DEBUG_VM_PGTABLE=y. I am not sure if v6 would address
it.

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

[   33.862600][T1] page:ea000900 is uninitialized and poisoned
[   33.862608][T1] raw:   
ff871140][T1]  ? _raw_spin_unlock_irq+0x27/0x40
[   33.871140][T1]  ? rest_init+0x307/0x307
[   33.871140][T1]  kernel_init+0x11/0x139
[   33.871140][T1]  ? rest_init+0x307/0x307
[   33.871140][T1]  ret_from_fork+0x27/0x50
[   33.871140][T1] Modules linked in:
[   33.871140][T1] ---[ end trace e99d392b0f7befbd ]---
[   33.871140][T1] RIP: 0010:alloc_gigantic_page_order+0x3fe/0x490

On Tue, 2019-10-15 at 14:51 +0530, Anshuman Khandual wrote:
> This series adds a test validation for architecture exported page table
> helpers. Patch in the series adds basic transformation tests at various
> levels of the page table. Before that it exports gigantic page allocation
> function from HugeTLB.
> 
> This test was originally suggested by Catalin during arm64 THP migration
> RFC discussion earlier. Going forward it can include more specific tests
> with respect to various generic MM functions like THP, HugeTLB etc and
> platform specific tests.
> 
> https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/
> 
> Changes in V6:
> 
> - Moved alloc_gigantic_page_order() into mm/page_alloc.c per Michal
> - Moved alloc_gigantic_page_order() within CONFIG_CONTIG_ALLOC in the test
> - Folded Andrew's include/asm-generic/pgtable.h fix into the test patch 2/2
> 
> Changes in V5: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=185991)
> 
> - Redefined and moved X86 mm_p4d_folded() into a different header per 
> Kirill/Ingo
> - Updated the config option comment per Ingo and dropped 'kernel module' 
> reference
> - Updated the commit message and dropped 'kernel module' reference
> - Changed DEBUG_ARCH_PGTABLE_TEST into DEBUG_VM_PGTABLE per Ingo
> - Moved config option from mm/Kconfig.debug into lib/Kconfig.debug
> - Renamed core test function arch_pgtable_tests() as debug_vm_pgtable()
> - Renamed mm/arch_pgtable_test.c as mm/debug_vm_pgtable.c
> - debug_vm_pgtable() gets called from kernel_init_freeable() after 
> init_mm_internals()
> - Added an entry in Documentation/features/debug/ per Ingo
> - Enabled the test on arm64 and x86 platforms for now
> 
> Changes in V4: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=183465)
> 
> - Disable DEBUG_ARCH_PGTABLE_TEST for ARM and IA64 platforms
> 
> Changes in V3: 
> (https://lore.kernel.org/patchwork/project/lkml/list/?series=411216)
> 
> - Changed test trigger from module format into late_initcall()
> - Marked all functions with __init to be freed after completion
> - Changed all __PGTABLE_PXX_FOLDED checks as mm_pxx_folded()
> - Folded in PPC32 fixes from Christophe
> 
> Changes in V2:
> 
> https://lore.kernel.org/linux-mm/1568268173-31302-1-git-send-email-anshuman.khand...@arm.com/T/#t
> 
> - Fixed small typo error in MODULE_DESCRIPTION()
> - Fixed m64k build problems for lvalue concerns in pmd_xxx_tests()
> - Fixed dynamic page table level folding problems on x86 as per Kirril
> - Fixed second pointers during pxx_populate_tests() per Kirill and Gerald
> - Allocate and free pte table with pte_alloc_one/pte_free per Kirill
> - Modified pxx_clear_tests() to accommodate s390 lower 12 bits situation
> - Changed RANDOM_NZVALUE value from 0xbe to 0xff
> - Changed allocation, usage, free sequence for saved_ptep
> - Renamed VMA_FLAGS as VMFLAGS
> - Implemented a new method for random vaddr generation
> - Implemented some other cleanups
> - Dropped extern reference to mm_alloc()
> - Created and exported new alloc_gigantic_page_order()
> - Dropped the custom allocator and used new alloc_gigantic_page_order()
> 
> Changes in V1:
> 
> https://lore.kernel.org/linux-mm/1567497706-8649-1-git-send-email-anshuman.khand...@arm.com/
> 
> - Added fallback mechanism for PMD aligned memory allocation failure
> 
> Changes in RFC V2:
> 
> https://lore.kernel.org/linux-mm/1565335998-22553-1-git-send-email-anshuman.khand...@arm.com/T/#u
> 
> - Moved test module and it's config from lib/ to mm/
> - Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
> - Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
> - Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
> - Dropped loadable module config option
> - Basic tests now use memory blocks with required size and alignment
> - PUD aligned memory block gets allocated with alloc_contig_range()
> - If PUD aligned memory could not be allocated it falls back on PMD aligned
>   memory block from page allocator and pud_* tests are skipped
> - Clear and populate tests now operate on real in memory page table entries
> - Dummy mm_struct gets allocated with mm_alloc()
> - Dummy 

[Bug 205201] New: Booting halts if Dawicontrol DC-2976 UW SCSI board installed, unless RAM size limited to 3500M

2019-10-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205201

Bug ID: 205201
   Summary: Booting halts if Dawicontrol DC-2976 UW SCSI board
installed, unless RAM size limited to 3500M
   Product: Platform Specific/Hardware
   Version: 2.5
Kernel Version: 5.3
  Hardware: PPC-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: PPC-64
  Assignee: platform_ppc...@kernel-bugs.osdl.org
  Reporter: rj.ron...@gmail.com
Regression: No

Hardware environment:
AmigaOne X5000/20 (2 GHz), 8 GB RAM
Dawicontrol DC-2976 UW SCSI controller board (PCI) 

Linux distro: Ubuntu 16,04.6, Debian 8, Fienix (Not relevant, problem appears
before loading starts).

Description: Booting halts immediately after loading the kernel, if 
Dawicontrol DC-2976 UW SCSI boad is installed in the machine. If the RAM size
is limited to 3500MB (with U-boot variable 'mem=3500M'), booting continues
normally.

Steps to reproduce: Turn on the machine.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-15 Thread Theodore Y. Ts'o
On Tue, Oct 15, 2019 at 01:01:02AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 14, 2019 at 03:09:48PM -0500, Eric Sandeen wrote:
> > We're in agreement here.  ;)  I only worry about implementing things like 
> > this
> > which sound like guarantees, but aren't, and end up encouraging bad behavior
> > or promoting misconceptions.
> > 
> > More and more, I think we should reconsider Darrick's "bootfs" (ext2 by 
> > another
> > name, but with extra-sync-iness) proposal...
> 
> Having a separate simple file system for the boot loader makes a lot of
> sense.  Note that vfat of EFI is the best choice, but at least it is
> something.  SysV Unix from the 90s actually had a special file system just
> for that, and fs/bfs/ in Linux supports that.  So this isn't really a new
> thing either.

Did you mean to say "vfaat of EFI isn't the best choice"?

If we were going to be doing something like "bootfs", what sort of
semantics would be sufficient?  Is doing an implied fsync() on every
close(2) enough, or do we need to do something even more conservative?

 - Ted


Re: [PATCH V6 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual



On 10/15/2019 05:16 PM, Michal Hocko wrote:
> On Tue 15-10-19 14:51:42, Anshuman Khandual wrote:
>> This adds tests which will validate architecture page table helpers and
>> other accessors in their compliance with expected generic MM semantics.
>> This will help various architectures in validating changes to existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required size and alignments. But if
>> memory pages with required size and alignment could not be allocated, then
>> all depending individual tests are just skipped afterwards. This test gets
>> called right after init_mm_internals() required for alloc_contig_range() to
>> work correctly.
>>
>> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
>> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
>> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
>> arm64. Going forward, other architectures too can enable this after fixing
>> build or runtime problems (if any) with their page table helpers.
> 
> A highlevel description of tests and what they are testing for would be
> really appreciated. Who wants to run these tests and why/when? What kind
> of bugs would get detected? In short why do we really need/want this
> code in the tree?

Sure, will do.


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread Michal Hocko
On Tue 15-10-19 14:09:56, Michal Hocko wrote:
> On Tue 15-10-19 13:50:02, David Hildenbrand wrote:
> > On 15.10.19 13:47, Michal Hocko wrote:
> > > On Tue 15-10-19 13:42:03, David Hildenbrand wrote:
> > > [...]
> > > > > -static bool pfn_range_valid_gigantic(struct zone *z,
> > > > > - unsigned long start_pfn, unsigned long nr_pages)
> > > > > -{
> > > > > - unsigned long i, end_pfn = start_pfn + nr_pages;
> > > > > - struct page *page;
> > > > > -
> > > > > - for (i = start_pfn; i < end_pfn; i++) {
> > > > > - if (!pfn_valid(i))
> > > > > - return false;
> > > > > -
> > > > > - page = pfn_to_page(i);
> > > > 
> > > > Am I missing something or should here really be a pfn_to_online_page() 
> > > > here
> > > > instead of a pfn_valid() ?
> > > 
> > > http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz
> > > 
> > 
> > So we managed to add PageReserved(page) but not pfn_to_online_page(). But it
> > is the right thing to do? (or am I missing something?)
> 
> Yeah, pfn_to_online_page is better. But please note that this is an
> optimistic check. The real check has to be done when isolating the
> pageblock because things might change in the meantime.

Except I have missed that we do get zone from the page and other
undefined state. Scratch my above comment.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread Michal Hocko
On Tue 15-10-19 13:50:02, David Hildenbrand wrote:
> On 15.10.19 13:47, Michal Hocko wrote:
> > On Tue 15-10-19 13:42:03, David Hildenbrand wrote:
> > [...]
> > > > -static bool pfn_range_valid_gigantic(struct zone *z,
> > > > -   unsigned long start_pfn, unsigned long nr_pages)
> > > > -{
> > > > -   unsigned long i, end_pfn = start_pfn + nr_pages;
> > > > -   struct page *page;
> > > > -
> > > > -   for (i = start_pfn; i < end_pfn; i++) {
> > > > -   if (!pfn_valid(i))
> > > > -   return false;
> > > > -
> > > > -   page = pfn_to_page(i);
> > > 
> > > Am I missing something or should here really be a pfn_to_online_page() 
> > > here
> > > instead of a pfn_valid() ?
> > 
> > http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz
> > 
> 
> So we managed to add PageReserved(page) but not pfn_to_online_page(). But it
> is the right thing to do? (or am I missing something?)

Yeah, pfn_to_online_page is better. But please note that this is an
optimistic check. The real check has to be done when isolating the
pageblock because things might change in the meantime.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread David Hildenbrand

On 15.10.19 13:50, David Hildenbrand wrote:

On 15.10.19 13:47, Michal Hocko wrote:

On Tue 15-10-19 13:42:03, David Hildenbrand wrote:
[...]

-static bool pfn_range_valid_gigantic(struct zone *z,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
-
-   for (i = start_pfn; i < end_pfn; i++) {
-   if (!pfn_valid(i))
-   return false;
-
-   page = pfn_to_page(i);


Am I missing something or should here really be a pfn_to_online_page() here
instead of a pfn_valid() ?


http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz



So we managed to add PageReserved(page) but not pfn_to_online_page().
But it is the right thing to do? (or am I missing something?)



Will send a patch.

--

Thanks,

David / dhildenb


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread David Hildenbrand

On 15.10.19 13:47, Michal Hocko wrote:

On Tue 15-10-19 13:42:03, David Hildenbrand wrote:
[...]

-static bool pfn_range_valid_gigantic(struct zone *z,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
-
-   for (i = start_pfn; i < end_pfn; i++) {
-   if (!pfn_valid(i))
-   return false;
-
-   page = pfn_to_page(i);


Am I missing something or should here really be a pfn_to_online_page() here
instead of a pfn_valid() ?


http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz



So we managed to add PageReserved(page) but not pfn_to_online_page(). 
But it is the right thing to do? (or am I missing something?)


--

Thanks,

David / dhildenb


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread Michal Hocko
On Tue 15-10-19 13:42:03, David Hildenbrand wrote:
[...]
> > -static bool pfn_range_valid_gigantic(struct zone *z,
> > -   unsigned long start_pfn, unsigned long nr_pages)
> > -{
> > -   unsigned long i, end_pfn = start_pfn + nr_pages;
> > -   struct page *page;
> > -
> > -   for (i = start_pfn; i < end_pfn; i++) {
> > -   if (!pfn_valid(i))
> > -   return false;
> > -
> > -   page = pfn_to_page(i);
> 
> Am I missing something or should here really be a pfn_to_online_page() here
> instead of a pfn_valid() ?

http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs


Re: [PATCH V6 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Michal Hocko
On Tue 15-10-19 14:51:42, Anshuman Khandual wrote:
> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> Test page table and memory pages creating it's entries at various level are
> all allocated from system memory with required size and alignments. But if
> memory pages with required size and alignment could not be allocated, then
> all depending individual tests are just skipped afterwards. This test gets
> called right after init_mm_internals() required for alloc_contig_range() to
> work correctly.
> 
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.

A highlevel description of tests and what they are testing for would be
really appreciated. Who wants to run these tests and why/when? What kind
of bugs would get detected? In short why do we really need/want this
code in the tree?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread David Hildenbrand

On 15.10.19 11:21, Anshuman Khandual wrote:

alloc_gigantic_page() implements an allocation method where it scans over
various zones looking for a large contiguous memory block which could not
have been allocated through the buddy allocator. A subsequent patch which
tests arch page table helpers needs such a method to allocate PUD_SIZE
sized memory block. In the future such methods might have other use cases
as well. So alloc_gigantic_page() has been split carving out actual memory
allocation method and made available via new alloc_gigantic_page_order()
which is wrapped under CONFIG_CONTIG_ALLOC.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Mike Kravetz 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
  include/linux/gfp.h |  3 ++
  mm/hugetlb.c| 76 +--
  mm/page_alloc.c | 98 +
  3 files changed, 102 insertions(+), 75 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..379ad23437d1 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -589,6 +589,9 @@ static inline bool pm_suspended_storage(void)
  /* The below functions must be run on a range from a single zone. */
  extern int alloc_contig_range(unsigned long start, unsigned long end,
  unsigned migratetype, gfp_t gfp_mask);
+extern struct page *alloc_gigantic_page_order(unsigned int order,
+ gfp_t gfp_mask, int nid,
+ nodemask_t *nodemask);
  #endif
  void free_contig_range(unsigned long pfn, unsigned int nr_pages);
  
diff --git a/mm/hugetlb.c b/mm/hugetlb.c

index 977f9a323a7a..d199556a4a2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1023,86 +1023,12 @@ static void free_gigantic_page(struct page *page, 
unsigned int order)
  }
  
  #ifdef CONFIG_CONTIG_ALLOC

-static int __alloc_gigantic_page(unsigned long start_pfn,
-   unsigned long nr_pages, gfp_t gfp_mask)
-{
-   unsigned long end_pfn = start_pfn + nr_pages;
-   return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
-}
-
-static bool pfn_range_valid_gigantic(struct zone *z,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
-
-   for (i = start_pfn; i < end_pfn; i++) {
-   if (!pfn_valid(i))
-   return false;
-
-   page = pfn_to_page(i);


Am I missing something or should here really be a pfn_to_online_page() 
here instead of a pfn_valid() ?



--

Thanks,

David / dhildenb


Re: [PATCH v7 1/8] powerpc: detect the secure boot mode of the system

2019-10-15 Thread Michael Ellerman
Hi Nayna,

Just a few comments.

Nayna Jain  writes:
> Secure boot on PowerNV defines different IMA policies based on the secure
> boot state of the system.

This description has got out of sync with what the patch does I think.
There's no IMA in here. I think you can just drop that sentence.

> This patch defines a function to detect the secure boot state of the
> system.

That's what the patch really does ^ - just make it clear that it's only
on powernv.

>
> The PPC_SECURE_BOOT config represents the base enablement of secureboot
> on POWER.

s/POWER/powerpc/.

>
> Signed-off-by: Nayna Jain 
> ---
>  arch/powerpc/Kconfig   | 10 ++
>  arch/powerpc/include/asm/secure_boot.h | 29 ++
>  arch/powerpc/kernel/Makefile   |  2 ++
>  arch/powerpc/kernel/secure_boot.c  | 42 ++
>  4 files changed, 83 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/secure_boot.h
>  create mode 100644 arch/powerpc/kernel/secure_boot.c
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 3e56c9c2f16e..b4a221886fcf 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -934,6 +934,16 @@ config PPC_MEM_KEYS
>  
> If unsure, say y.
>  
> +config PPC_SECURE_BOOT
> + prompt "Enable secure boot support"
> + bool
> + depends on PPC_POWERNV
> + help
> +   Systems with firmware secure boot enabled needs to define security
^
 need
> +   policies to extend secure boot to the OS. This config allows user
  ^
  a
> +   to enable OS secure boot on systems that have firmware support for
> +   it. If in doubt say N.
> +
>  endmenu
>  
>  config ISA_DMA_API
> diff --git a/arch/powerpc/include/asm/secure_boot.h 
> b/arch/powerpc/include/asm/secure_boot.h
> new file mode 100644
> index ..23d2ef2f1f7b
> --- /dev/null
> +++ b/arch/powerpc/include/asm/secure_boot.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Secure boot definitions
> + *
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain
> + */
> +#ifndef _ASM_POWER_SECURE_BOOT_H
> +#define _ASM_POWER_SECURE_BOOT_H
> +
> +#ifdef CONFIG_PPC_SECURE_BOOT
> +
> +bool is_powerpc_os_secureboot_enabled(void);
> +struct device_node *get_powerpc_os_sb_node(void);

This function is never used outside arch/powerpc/kernel/secure_boot.c
and so doesn't need to be public.

> +#else
> +
> +static inline bool is_powerpc_os_secureboot_enabled(void)
> +{

I know there's a distinction between firmware secureboot and OS
secureboot, but I don't think we need that baked into the name. So just
is_ppc_secureboot_enabled() would be fine.

> + return false;
> +}
> +
> +static inline struct device_node *get_powerpc_os_sb_node(void)
> +{
> + return NULL;
> +}
> +
> +#endif
> +#endif
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index a7ca8fe62368..e2a54fa240ac 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -161,6 +161,8 @@ ifneq ($(CONFIG_PPC_POWERNV)$(CONFIG_PPC_SVM),)
>  obj-y+= ucall.o
>  endif
>  
> +obj-$(CONFIG_PPC_SECURE_BOOT)+= secure_boot.o
> +
>  # Disable GCOV, KCOV & sanitizers in odd or sensitive code
>  GCOV_PROFILE_prom_init.o := n
>  KCOV_INSTRUMENT_prom_init.o := n
> diff --git a/arch/powerpc/kernel/secure_boot.c 
> b/arch/powerpc/kernel/secure_boot.c
> new file mode 100644
> index ..0488dbcab6b9
> --- /dev/null
> +++ b/arch/powerpc/kernel/secure_boot.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain
> + */
> +#include 
> +#include 
> +#include 
> +
> +struct device_node *get_powerpc_os_sb_node(void)
> +{
> + return of_find_compatible_node(NULL, NULL, "ibm,secvar-v1");
> +}

Given that's only used in this file, once, it should just be inlined
into its caller.

> +
> +bool is_powerpc_os_secureboot_enabled(void)
> +{
> + struct device_node *node;
> +
> + node = get_powerpc_os_sb_node();
> + if (!node)
> + goto disabled;
> +
> + if (!of_device_is_available(node)) {
> + pr_err("Secure variables support is in error state, fail 
> secure\n");
> + goto enabled;
> + }
> +
> + /*
> +  * secureboot is enabled if os-secure-enforcing property exists,
> +  * else disabled.
> +  */
> + if (!of_find_property(node, "os-secure-enforcing", NULL))

Using of_property_read_bool() is preferable.

> + goto disabled;
> +
> +enabled:
> + pr_info("secureboot mode enabled\n");
> + return true;
> +
> +disabled:
> + pr_info("secureboot mode disabled\n");
> + return false;
> +}

You 

Re: [PATCH v7 2/8] powerpc: add support to initialize ima policy rules

2019-10-15 Thread Michael Ellerman
Nayna Jain  writes:
> PowerNV systems uses kernel based bootloader, thus its secure boot
> implementation uses kernel IMA security subsystem to verify the kernel
> before kexec. Since the verification policy might differ based on the
> secure boot mode of the system, the policies are defined at runtime.
>
> This patch implements the arch-specific support to define the IMA policy
> rules based on the runtime secure boot mode of the system.
>
> This patch provides arch-specific IMA policies if PPC_SECURE_BOOT
> config is enabled.
...
> diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
> new file mode 100644
> index ..c22d82965eb4
> --- /dev/null
> +++ b/arch/powerpc/kernel/ima_arch.c
> @@ -0,0 +1,33 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain
> + */
> +
> +#include 
> +#include 
> +
> +bool arch_ima_get_secureboot(void)
> +{
> + return is_powerpc_os_secureboot_enabled();
> +}
> +
> +/* Defines IMA appraise rules for secureboot */
> +static const char *const arch_rules[] = {
> + "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
> +#if !IS_ENABLED(CONFIG_MODULE_SIG_FORCE)
> + "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
> +#endif

This confuses me.

If I spell it out we get:

#if IS_ENABLED(CONFIG_MODULE_SIG_FORCE)
// nothing
#else
"appraise func=MODULE_CHECK appraise_type=imasig|modsig",
#endif

Which is just:

#ifdef CONFIG_MODULE_SIG_FORCE
// nothing
#else
"appraise func=MODULE_CHECK appraise_type=imasig|modsig",
#endif

But CONFIG_MODULE_SIG_FORCE enabled says that we *do* require modules to
have a valid signature. Isn't that the inverse of what the rules say?

Presumably I'm misunderstanding something :)

cheers


Re: [PATCH v7 4/8] powerpc/ima: add measurement rules to ima arch specific policy

2019-10-15 Thread Michael Ellerman
Nayna Jain  writes:
> This patch adds the measurement rules to the arch specific policies on
> trusted boot enabled systems.
>
> Signed-off-by: Nayna Jain 
> Reviewed-by: Mimi Zohar 
> ---
>  arch/powerpc/kernel/ima_arch.c | 45 +++---
>  1 file changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
> index c22d82965eb4..88bfe4a1a9a5 100644
> --- a/arch/powerpc/kernel/ima_arch.c
> +++ b/arch/powerpc/kernel/ima_arch.c
> @@ -12,8 +12,19 @@ bool arch_ima_get_secureboot(void)
>   return is_powerpc_os_secureboot_enabled();
>  }
>  
> -/* Defines IMA appraise rules for secureboot */
> +/*
> + * The "arch_rules" contains both the securebot and trustedboot rules for 
> adding
> + * the kexec kernel image and kernel modules file hashes to the IMA 
> measurement
> + * list and verifying the file signatures against known good values.
> + *
> + * The "appraise_type=imasig|modsig" option allows the good signature to be
> + * stored as an xattr or as an appended signature. The "template=ima-modsig"
> + * option includes the appended signature, when available, in the IMA
> + * measurement list.
> + */
>  static const char *const arch_rules[] = {
> + "measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
> + "measure func=MODULE_CHECK template=ima-modsig",
>   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
>  #if !IS_ENABLED(CONFIG_MODULE_SIG_FORCE)
>   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
> @@ -22,12 +33,40 @@ static const char *const arch_rules[] = {
>  };
>  
>  /*
> - * Returns the relevant IMA arch policies based on the system secureboot 
> state.
> + * The "measure_rules" are enabled only on "trustedboot" enabled systems.
> + * These rules add the kexec kernel image and kernel modules file hashes to
> + * the IMA measurement list.
> + */
> +static const char *const measure_rules[] = {
> + "measure func=KEXEC_KERNEL_CHECK",
> + "measure func=MODULE_CHECK",

Why do these ones not have "template=ima-modsig" on the end?

> + NULL
> +};
> +
> +/*
> + * Returns the relevant IMA arch policies based on the system secureboot
> + * and trustedboot state.
>   */
>  const char *const *arch_get_ima_policy(void)
>  {
> - if (is_powerpc_os_secureboot_enabled())
> + const char *const *rules;
> + int offset = 0;
> +
> + for (rules = arch_rules; *rules != NULL; rules++) {
> + if (strncmp(*rules, "appraise", 8) == 0)
> + break;
> + offset++;
> + }

This seems like kind of a hack, doesn't it? :)

What we really want is three sets of rules isn't it? But some of them
are shared between the different sets. But they just have to be flat
arrays of strings.

I think it would probably be cleaner to just use a #define for the
shared part of the rules, eg something like:

#ifdef CONFIG_MODULE_SIG_FORCE
#define APPRAISE_MODULE
#else
#define APPRAISE_MODULE \
"appraise func=MODULE_CHECK appraise_flag=check_blacklist 
appraise_type=imasig|modsig",
#endif

#define APPRAISE_KERNEL \
"appraise func=KEXEC_KERNEL_CHECK appraise_flag=check_blacklist 
appraise_type=imasig|modsig"

#define MEASURE_KERNEL \
"measure func=KEXEC_KERNEL_CHECK"

#define MEASURE_MODULE \
"measure func=MODULE_CHECK"

#define APPEND_TEMPLATE_IMA_MODSIG  \
" template=ima-modsig"

static const char *const secure_and_trusted_rules[] = {
MEASURE_KERNEL APPEND_TEMPLATE_IMA_MODSIG,
MEASURE_MODULE APPEND_TEMPLATE_IMA_MODSIG,
APPRAISE_KERNEL,
APPRAISE_MODULE
NULL
};

static const char *const secure_rules[] = {
APPRAISE_KERNEL,
APPRAISE_MODULE
NULL
};

static const char *const trusted_rules[] = {
MEASURE_KERNEL,
MEASURE_MODULE,
NULL
};


> +
> + if (is_powerpc_os_secureboot_enabled()
> + && is_powerpc_trustedboot_enabled())
>   return arch_rules;
>  
> + if (is_powerpc_os_secureboot_enabled())
> + return arch_rules + offset;
> +
> + if (is_powerpc_trustedboot_enabled())
> + return measure_rules;

Those is_foo() routines print each time they're called don't they? So on
a system with only trusted boot I think that will print:

secureboot mode disabled
secureboot mode disabled
trustedboot mode enabled

Which is a bit verbose :)

cheers


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread Anshuman Khandual



On 10/15/2019 04:15 PM, Michal Hocko wrote:
> On Tue 15-10-19 14:51:41, Anshuman Khandual wrote:
> [...]
>> +/**
>> + * alloc_gigantic_page_order() -- tries to allocate given order of pages
>> + * @order:  allocation order (greater than MAX_ORDER)
>> + * @gfp_mask:   GFP mask to use during compaction
>> + * @nid:allocation node
>> + * @nodemask:   allocation nodemask
>> + *
>> + * This routine is an wrapper around alloc_contig_range() which scans over
>> + * all zones on an applicable zonelist to find a contiguous pfn range which
>> + * can the be allocated with alloc_contig_range(). This routine is intended
>> + * to be used for allocations greater than MAX_ORDER.
>> + *
>> + * Return: page on success or NULL on failure. On success a memory block
>> + * of 'order' starting with 'page' has been allocated successfully. Memory
>> + * allocated here needs to be freed with free_contig_range().
>> + */
>> +struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
>> +   int nid, nodemask_t *nodemask)
> 
> One of the objections when Mike has proposed a similar thing last year
> was that the interface shouldn't be order bases
> http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz
> 
> Order based API makes sense for the buddy allocator but why should we
> restrict sizes like that for an allocator that is capable to allocate
> arbitrary page sized requests?

Fair enough, will change it. Anyways we calculate nr_pages from the order 
argument at the very beginning.


Re: [PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread Michal Hocko
On Tue 15-10-19 14:51:41, Anshuman Khandual wrote:
[...]
> +/**
> + * alloc_gigantic_page_order() -- tries to allocate given order of pages
> + * @order:   allocation order (greater than MAX_ORDER)
> + * @gfp_mask:GFP mask to use during compaction
> + * @nid: allocation node
> + * @nodemask:allocation nodemask
> + *
> + * This routine is an wrapper around alloc_contig_range() which scans over
> + * all zones on an applicable zonelist to find a contiguous pfn range which
> + * can the be allocated with alloc_contig_range(). This routine is intended
> + * to be used for allocations greater than MAX_ORDER.
> + *
> + * Return: page on success or NULL on failure. On success a memory block
> + * of 'order' starting with 'page' has been allocated successfully. Memory
> + * allocated here needs to be freed with free_contig_range().
> + */
> +struct page *alloc_gigantic_page_order(unsigned int order, gfp_t gfp_mask,
> +int nid, nodemask_t *nodemask)

One of the objections when Mike has proposed a similar thing last year
was that the interface shouldn't be order bases
http://lkml.kernel.org/r/20180423000943.go17...@dhcp22.suse.cz

Order based API makes sense for the buddy allocator but why should we
restrict sizes like that for an allocator that is capable to allocate
arbitrary page sized requests?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-15 Thread Yunsheng Lin
On 2019/10/14 17:25, Greg KH wrote:
> On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
>> On 2019/10/12 18:47, Greg KH wrote:
>>> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
 On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> On 2019/10/12 15:40, Greg KH wrote:
>> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
>>> add pci and acpi maintainer
>>> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
>>>
>>> On 2019/10/11 19:15, Peter Zijlstra wrote:
 On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> But I failed to see why the above is related to making 
> node_to_cpumask_map()
> NUMA_NO_NODE aware?

 Your initial bug is for hns3, which is a PCI device, which really 
 _MUST_
 have a node assigned.

 It not having one, is a straight up bug. We must not silently accept
 NO_NODE there, ever.

>>>
>>> I suppose you mean reporting a lack of affinity when the node of a pcie
>>> device is not set by "not silently accept NO_NODE".
>>
>> If the firmware of a pci device does not provide the node information,
>> then yes, warn about that.
>>
>>> As Greg has asked about in [1]:
>>> what is a user to do when the user sees the kernel reporting that?
>>>
>>> We may tell user to contact their vendor for info or updates about
>>> that when they do not know about their system well enough, but their
>>> vendor may get away with this by quoting ACPI spec as the spec
>>> considering this optional. Should the user believe this is indeed a
>>> fw bug or a misreport from the kernel?
>>
>> Say it is a firmware bug, if it is a firmware bug, that's simple.
>>
>>> If this kind of reporting is common pratice and will not cause any
>>> misunderstanding, then maybe we can report that.
>>
>> Yes, please do so, that's the only way those boxes are ever going to get
>> fixed.  And go add the test to the "firmware testing" tool that is based
>> on Linux that Intel has somewhere, to give vendors a chance to fix this
>> before they ship hardware.
>>
>> This shouldn't be a big deal, we warn of other hardware bugs all the
>> time.
>
> Ok, thanks for clarifying.
>
> Will send a patch to catch the case when a pcie device without numa node
> being set and warn about it.
>
> Maybe use dev->bus to verify if it is a pci device?

 No, do that in the pci bus core code itself, when creating the devices
 as that is when you know, or do not know, the numa node, right?

 This can't be in the driver core only, as each bus type will have a
 different way of determining what the node the device is on.  For some
 reason, I thought the PCI core code already does this, right?
>>>
>>> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
>>> thing...
>>>
>>> Anyway, it looks like the pci core code does call set_dev_node() based
>>> on the PCI bridge, so if that is set up properly, all should be fine.
>>>
>>> If not, well, you have buggy firmware and you need to warn about that at
>>> the time you are creating the bridge.  Look at the call to
>>> pcibus_to_node() in pci_register_host_bridge().
>>
>> Thanks for pointing out the specific function.
>> Maybe we do not need to warn about the case when the device has a parent,
>> because we must have warned about the parent if the device has a parent
>> and the parent also has a node of NO_NODE, so do not need to warn the child
>> device anymore? like blew:
>>
>> @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
>> pci_host_bridge *bridge)
>> list_add_tail(>node, _root_buses);
>> up_write(_bus_sem);
>>
>> +   if (nr_node_ids > 1 && !parent &&
> 
> Why do you need to check this?  If you have a parent, it's your node
> should be set, if not, that's an error, right?

If the device has parent and the parent device also has a node of
NUMA_NO_NODE, then maybe we have warned about the parent device, so
we do not have to warn about the child device?

In pci_register_host_bridge():

if (!parent)
set_dev_node(bus->bridge, pcibus_to_node(bus));

The above only set the node of the bridge device to the node of bus if
the bridge device does not have a parent.

bus->dev.parent = bus->bridge;

dev_set_name(>dev, "%04x:%02x", pci_domain_nr(bus), bus->number);
name = dev_name(>dev);

err = device_register(>dev);

The above then set the bus device's parent to bridge device, and then
call device_register(), which will set the bus device's node according to
bridge device' node.

> 
>> +   dev_to_node(bus->bridge) == NUMA_NO_NODE)
>> +   dev_err(bus->bridge, FW_BUG "No node assigned on NUMA 
>> capable HW. Please contact your vendor for updates.\n");
>> +
>>  

Re: [PATCH v2 0/3] crypto: powerpc - convert SPE AES algorithms to skcipher API

2019-10-15 Thread Ard Biesheuvel
On Tue, 15 Oct 2019 at 04:45, Eric Biggers  wrote:
>
> This series converts the glue code for the PowerPC SPE implementations
> of AES-ECB, AES-CBC, AES-CTR, and AES-XTS from the deprecated
> "blkcipher" API to the "skcipher" API.  This is needed in order for the
> blkcipher API to be removed.
>
> Patch 1-2 are fixes.  Patch 3 is the actual conversion.
>
> Tested with:
>
> export ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu-
> make mpc85xx_defconfig
> cat >> .config << EOF
> # CONFIG_MODULES is not set
> # CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
> CONFIG_DEBUG_KERNEL=y
> CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
> CONFIG_CRYPTO_AES=y
> CONFIG_CRYPTO_CBC=y
> CONFIG_CRYPTO_CTR=y
> CONFIG_CRYPTO_ECB=y
> CONFIG_CRYPTO_XTS=y
> CONFIG_CRYPTO_AES_PPC_SPE=y
> EOF
> make olddefconfig
> make -j32
> qemu-system-ppc -M mpc8544ds -cpu e500 -nographic \
> -kernel arch/powerpc/boot/zImage \
> -append cryptomgr.fuzz_iterations=1000
>
> Note that xts-ppc-spe still fails the comparison tests due to the lack
> of ciphertext stealing support.  This is not addressed by this series.
>
> Changed since v1:
>
> - Split fixes into separate patches.
>
> - Made ppc_aes_setkey_skcipher() call ppc_aes_setkey(), rather than
>   creating a separate expand_key() function.  This keeps the code
>   shorter.
>
> Eric Biggers (3):
>   crypto: powerpc - don't unnecessarily use atomic scatterwalk
>   crypto: powerpc - don't set ivsize for AES-ECB
>   crypto: powerpc - convert SPE AES algorithms to skcipher API
>

For the series

Reviewed-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 


>  arch/powerpc/crypto/aes-spe-glue.c | 389 -
>  crypto/Kconfig |   1 +
>  2 files changed, 166 insertions(+), 224 deletions(-)
>
> --
> 2.23.0
>


Re: [PATCH v7 3/8] powerpc: detect the trusted boot state of the system

2019-10-15 Thread Michael Ellerman
Nayna Jain  writes:
> PowerNV systems enables the IMA measurement rules only if the
> trusted boot is enabled on the system.

That confused me a lot. But the key is the distinction between appraisal
rules vs measurement rules, right?

I think it would be clearer if it was phrased as a positive statement, eg:

  On PowerNV systems when trusted boot is enabled, additional IMA rules
  are enabled to implement measurement.

Or something like that.

> This patch adds the function to detect if the system has trusted
> boot enabled.

It would probably help people to briefly explain the difference between
secure vs trusted boot.

> diff --git a/arch/powerpc/include/asm/secure_boot.h 
> b/arch/powerpc/include/asm/secure_boot.h
> index 23d2ef2f1f7b..ecd08515e301 100644
> --- a/arch/powerpc/include/asm/secure_boot.h
> +++ b/arch/powerpc/include/asm/secure_boot.h
> @@ -12,6 +12,7 @@
>  
>  bool is_powerpc_os_secureboot_enabled(void);
>  struct device_node *get_powerpc_os_sb_node(void);
> +bool is_powerpc_trustedboot_enabled(void);
>  
>  #else
>  
> @@ -25,5 +26,10 @@ static inline struct device_node 
> *get_powerpc_os_sb_node(void)
>   return NULL;
>  }
>  
> +static inline bool is_powerpc_os_trustedboot_enabled(void)

That has an extra "_os" in it.

> +{
> + return false;
> +}
> +
>  #endif
>  #endif
> diff --git a/arch/powerpc/kernel/secure_boot.c 
> b/arch/powerpc/kernel/secure_boot.c
> index 0488dbcab6b9..9d5ac1b39e46 100644
> --- a/arch/powerpc/kernel/secure_boot.c
> +++ b/arch/powerpc/kernel/secure_boot.c
> @@ -7,6 +7,27 @@
>  #include 
>  #include 
>  
> +static const char * const fwsecureboot_compat[] = {
> + "ibm,secureboot-v1",
> + "ibm,secureboot-v2",
> + NULL,
> +};
> +
> +static struct device_node *get_powerpc_fw_sb_node(void)
> +{
> + struct device_node *node;
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(fwsecureboot_compat); ++i) {
> + node = of_find_compatible_node(NULL, NULL,
> +fwsecureboot_compat[i]);
> + if (node)
> + return node;
> + }
> +
> + return NULL;
> +}

You shouldn't need to do that by hand, instead use
of_find_matching_node(), eg:

static struct device_node *get_powerpc_fw_sb_node(void)
{
static const struct of_device_id ids[] = {
{ .compatible = "ibm,secureboot-v1", },
{ .compatible = "ibm,secureboot-v2", },
{},
};

return of_find_matching_node(NULL, ids);
}


> @@ -40,3 +61,17 @@ bool is_powerpc_os_secureboot_enabled(void)
>   pr_info("secureboot mode disabled\n");
>   return false;
>  }
> +
> +bool is_powerpc_trustedboot_enabled(void)
> +{
> + struct device_node *node;
> +
> + node = get_powerpc_fw_sb_node();
> + if (node && (of_find_property(node, "trusted-enabled", NULL))) {

Again this can use of_property_read_bool(), which copes with a NULL node
also, so just:

+   if (of_property_read_bool(node, "trusted-enabled"))) {

> + pr_info("trustedboot mode enabled\n");
> + return true;
> + }
> +
> + pr_info("trustedboot mode disabled\n");
> + return false;
> +}
> -- 
> 2.20.1


cheers


Re: [PATCH v7 2/8] powerpc: add support to initialize ima policy rules

2019-10-15 Thread Michael Ellerman
Mimi Zohar  writes:
> On Mon, 2019-10-07 at 21:14 -0400, Nayna Jain wrote:
...
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index b4a221886fcf..deb19ec6ba3d 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -938,6 +938,8 @@ config PPC_SECURE_BOOT
>>  prompt "Enable secure boot support"
>>  bool
>>  depends on PPC_POWERNV
>> +depends on IMA
>> +depends on IMA_ARCH_POLICY
>
> As IMA_ARCH_POLICY is dependent on IMA, I don't see a need for
> depending on both IMA and IMA_ARCH_POLICY.

I agree. And what we actually depend on is the arch part, so it should
depend on just IMA_ARCH_POLICY.

cheers


[PATCH V6 2/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual
This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics.
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.

Test page table and memory pages creating it's entries at various level are
all allocated from system memory with required size and alignments. But if
memory pages with required size and alignment could not be allocated, then
all depending individual tests are just skipped afterwards. This test gets
called right after init_mm_internals() required for alloc_contig_range() to
work correctly.

This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
arm64. Going forward, other architectures too can enable this after fixing
build or runtime problems (if any) with their page table helpers.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org

Tested-by: Christophe Leroy#PPC32
Suggested-by: Catalin Marinas 
Signed-off-by: Andrew Morton 
Signed-off-by: Christophe Leroy 
Signed-off-by: Anshuman Khandual 
---
 .../debug/debug-vm-pgtable/arch-support.txt   |  34 ++
 arch/arm64/Kconfig|   1 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/pgtable_64.h |   6 +
 include/asm-generic/pgtable.h |   6 +
 init/main.c   |   1 +
 lib/Kconfig.debug |  21 +
 mm/Makefile   |   1 +
 mm/debug_vm_pgtable.c | 450 ++
 9 files changed, 521 insertions(+)
 create mode 100644 
Documentation/features/debug/debug-vm-pgtable/arch-support.txt
 create mode 100644 mm/debug_vm_pgtable.c

diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
new file mode 100644
index ..d6b8185dcf1e
--- /dev/null
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -0,0 +1,34 @@
+#
+# Feature name:  debug-vm-pgtable
+# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
+# description:   arch supports pgtable tests for semantics compliance
+#
+---
+| arch |status|
+---
+|   alpha: | TODO |
+| arc: | TODO |
+| arm: | TODO |
+|   arm64: |  ok  |
+| c6x: | TODO |
+|csky: | TODO |
+|   h8300: | TODO |
+| hexagon: | TODO |
+|ia64: | TODO |
+|m68k: | TODO |
+|  microblaze: | TODO |
+|mips: | TODO |
+|   nds32: | TODO |
+|   nios2: | TODO |
+|openrisc: | TODO |
+|  parisc: | TODO |
+| powerpc: | TODO |
+|   riscv: | TODO |
+|s390: | TODO |
+|  sh: | TODO |
+|   sparc: | TODO |
+|  um: | TODO |
+|   unicore32: | TODO |
+| x86: |  ok  |
+|  xtensa: | TODO |
+---
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 950a56b71ff0..8a3b3eaa49e9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -11,6 +11,7 @@ config ARM64
select ACPI_PPTT if ACPI
select ARCH_CLOCKSOURCE_DATA
select ARCH_HAS_DEBUG_VIRTUAL
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_DMA_COHERENT_TO_PFN
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index abe822d52167..13c9bd950256 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -61,6 +61,7 @@ config X86
select ARCH_CLOCKSOURCE_INIT
select ARCH_HAS_ACPI_TABLE_UPGRADE  if ACPI
select ARCH_HAS_DEBUG_VIRTUAL
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_IS_ALLOWED

[PATCH V6 1/2] mm/page_alloc: Make alloc_gigantic_page() available for general use

2019-10-15 Thread Anshuman Khandual
alloc_gigantic_page() implements an allocation method where it scans over
various zones looking for a large contiguous memory block which could not
have been allocated through the buddy allocator. A subsequent patch which
tests arch page table helpers needs such a method to allocate PUD_SIZE
sized memory block. In the future such methods might have other use cases
as well. So alloc_gigantic_page() has been split carving out actual memory
allocation method and made available via new alloc_gigantic_page_order()
which is wrapped under CONFIG_CONTIG_ALLOC.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Mike Kravetz 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 include/linux/gfp.h |  3 ++
 mm/hugetlb.c| 76 +--
 mm/page_alloc.c | 98 +
 3 files changed, 102 insertions(+), 75 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fb07b503dc45..379ad23437d1 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -589,6 +589,9 @@ static inline bool pm_suspended_storage(void)
 /* The below functions must be run on a range from a single zone. */
 extern int alloc_contig_range(unsigned long start, unsigned long end,
  unsigned migratetype, gfp_t gfp_mask);
+extern struct page *alloc_gigantic_page_order(unsigned int order,
+ gfp_t gfp_mask, int nid,
+ nodemask_t *nodemask);
 #endif
 void free_contig_range(unsigned long pfn, unsigned int nr_pages);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 977f9a323a7a..d199556a4a2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1023,86 +1023,12 @@ static void free_gigantic_page(struct page *page, 
unsigned int order)
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
-static int __alloc_gigantic_page(unsigned long start_pfn,
-   unsigned long nr_pages, gfp_t gfp_mask)
-{
-   unsigned long end_pfn = start_pfn + nr_pages;
-   return alloc_contig_range(start_pfn, end_pfn, MIGRATE_MOVABLE,
- gfp_mask);
-}
-
-static bool pfn_range_valid_gigantic(struct zone *z,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long i, end_pfn = start_pfn + nr_pages;
-   struct page *page;
-
-   for (i = start_pfn; i < end_pfn; i++) {
-   if (!pfn_valid(i))
-   return false;
-
-   page = pfn_to_page(i);
-
-   if (page_zone(page) != z)
-   return false;
-
-   if (PageReserved(page))
-   return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
-   }
-
-   return true;
-}
-
-static bool zone_spans_last_pfn(const struct zone *zone,
-   unsigned long start_pfn, unsigned long nr_pages)
-{
-   unsigned long last_pfn = start_pfn + nr_pages - 1;
-   return zone_spans_pfn(zone, last_pfn);
-}
-
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
int nid, nodemask_t *nodemask)
 {
unsigned int order = huge_page_order(h);
-   unsigned long nr_pages = 1 << order;
-   unsigned long ret, pfn, flags;
-   struct zonelist *zonelist;
-   struct zone *zone;
-   struct zoneref *z;
-
-   zonelist = node_zonelist(nid, gfp_mask);
-   for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), 
nodemask) {
-   spin_lock_irqsave(>lock, flags);
 
-   pfn = ALIGN(zone->zone_start_pfn, nr_pages);
-   while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-   if (pfn_range_valid_gigantic(zone, pfn, nr_pages)) {
-   /*
-* We release the zone lock here because
-   

[PATCH V6 0/2] mm/debug: Add tests validating architecture page table helpers

2019-10-15 Thread Anshuman Khandual
This series adds a test validation for architecture exported page table
helpers. Patch in the series adds basic transformation tests at various
levels of the page table. Before that it exports gigantic page allocation
function from HugeTLB.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Changes in V6:

- Moved alloc_gigantic_page_order() into mm/page_alloc.c per Michal
- Moved alloc_gigantic_page_order() within CONFIG_CONTIG_ALLOC in the test
- Folded Andrew's include/asm-generic/pgtable.h fix into the test patch 2/2

Changes in V5: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=185991)

- Redefined and moved X86 mm_p4d_folded() into a different header per 
Kirill/Ingo
- Updated the config option comment per Ingo and dropped 'kernel module' 
reference
- Updated the commit message and dropped 'kernel module' reference
- Changed DEBUG_ARCH_PGTABLE_TEST into DEBUG_VM_PGTABLE per Ingo
- Moved config option from mm/Kconfig.debug into lib/Kconfig.debug
- Renamed core test function arch_pgtable_tests() as debug_vm_pgtable()
- Renamed mm/arch_pgtable_test.c as mm/debug_vm_pgtable.c
- debug_vm_pgtable() gets called from kernel_init_freeable() after 
init_mm_internals()
- Added an entry in Documentation/features/debug/ per Ingo
- Enabled the test on arm64 and x86 platforms for now

Changes in V4: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=183465)

- Disable DEBUG_ARCH_PGTABLE_TEST for ARM and IA64 platforms

Changes in V3: 
(https://lore.kernel.org/patchwork/project/lkml/list/?series=411216)

- Changed test trigger from module format into late_initcall()
- Marked all functions with __init to be freed after completion
- Changed all __PGTABLE_PXX_FOLDED checks as mm_pxx_folded()
- Folded in PPC32 fixes from Christophe

Changes in V2:

https://lore.kernel.org/linux-mm/1568268173-31302-1-git-send-email-anshuman.khand...@arm.com/T/#t

- Fixed small typo error in MODULE_DESCRIPTION()
- Fixed m64k build problems for lvalue concerns in pmd_xxx_tests()
- Fixed dynamic page table level folding problems on x86 as per Kirril
- Fixed second pointers during pxx_populate_tests() per Kirill and Gerald
- Allocate and free pte table with pte_alloc_one/pte_free per Kirill
- Modified pxx_clear_tests() to accommodate s390 lower 12 bits situation
- Changed RANDOM_NZVALUE value from 0xbe to 0xff
- Changed allocation, usage, free sequence for saved_ptep
- Renamed VMA_FLAGS as VMFLAGS
- Implemented a new method for random vaddr generation
- Implemented some other cleanups
- Dropped extern reference to mm_alloc()
- Created and exported new alloc_gigantic_page_order()
- Dropped the custom allocator and used new alloc_gigantic_page_order()

Changes in V1:

https://lore.kernel.org/linux-mm/1567497706-8649-1-git-send-email-anshuman.khand...@arm.com/

- Added fallback mechanism for PMD aligned memory allocation failure

Changes in RFC V2:

https://lore.kernel.org/linux-mm/1565335998-22553-1-git-send-email-anshuman.khand...@arm.com/T/#u

- Moved test module and it's config from lib/ to mm/
- Renamed config TEST_ARCH_PGTABLE as DEBUG_ARCH_PGTABLE_TEST
- Renamed file from test_arch_pgtable.c to arch_pgtable_test.c
- Added relevant MODULE_DESCRIPTION() and MODULE_AUTHOR() details
- Dropped loadable module config option
- Basic tests now use memory blocks with required size and alignment
- PUD aligned memory block gets allocated with alloc_contig_range()
- If PUD aligned memory could not be allocated it falls back on PMD aligned
  memory block from page allocator and pud_* tests are skipped
- Clear and populate tests now operate on real in memory page table entries
- Dummy mm_struct gets allocated with mm_alloc()
- Dummy page table entries get allocated with [pud|pmd|pte]_alloc_[map]()
- Simplified [p4d|pgd]_basic_tests(), now has random values in the entries

Original RFC V1:

https://lore.kernel.org/linux-mm/1564037723-26676-1-git-send-email-anshuman.khand...@arm.com/

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Mike Kravetz 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: 

Re: [PATCH v4 2/4] powerpc: expose secure variables to userspace via sysfs

2019-10-15 Thread Oliver O'Halloran
On Tue, 2019-10-01 at 19:41 -0400, Nayna Jain wrote:
> PowerNV secure variables, which store the keys used for OS kernel
> verification, are managed by the firmware. These secure variables need to
> be accessed by the userspace for addition/deletion of the certificates.
> 
> This patch adds the sysfs interface to expose secure variables for PowerNV
> secureboot. The users shall use this interface for manipulating
> the keys stored in the secure variables.
> 
> Signed-off-by: Nayna Jain 
> Reviewed-by: Greg Kroah-Hartman 
> ---
>  Documentation/ABI/testing/sysfs-secvar |  37 +
>  arch/powerpc/Kconfig   |  10 ++
>  arch/powerpc/kernel/Makefile   |   1 +
>  arch/powerpc/kernel/secvar-sysfs.c | 198 +
>  4 files changed, 246 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-secvar
>  create mode 100644 arch/powerpc/kernel/secvar-sysfs.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-secvar 
> b/Documentation/ABI/testing/sysfs-secvar
> new file mode 100644
> index ..815bd8ec4d5e
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-secvar
> @@ -0,0 +1,37 @@
> +What:/sys/firmware/secvar
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description: This directory is created if the POWER firmware supports OS
> + secureboot, thereby secure variables. It exposes interface
> + for reading/writing the secure variables
> +
> +What:/sys/firmware/secvar/vars
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description: This directory lists all the secure variables that are supported
> + by the firmware.
> +
> +What:/sys/firmware/secvar/vars/
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description: Each secure variable is represented as a directory named as
> + . The variable name is unique and is in ASCII
> + representation. The data and size can be determined by reading
> + their respective attribute files.
> +
> +What:/sys/firmware/secvar/vars//size
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description: An integer representation of the size of the content of the
> + variable. In other words, it represents the size of the data.
> +
> +What:/sys/firmware/secvar/vars//data
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description: A read-only file containing the value of the variable
> +
> +What:/sys/firmware/secvar/vars//update
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description: A write-only file that is used to submit the new value for the
> + variable.

How are the update mechanism's weird requirements communicated to
userspace? The design you've got for the OPAL bits is that the update
requirements depends on the secvar backend, but I see nothing plumbing
through what the secvar backend actually is.

> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index deb19ec6ba3d..89084e4e5054 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -946,6 +946,16 @@ config PPC_SECURE_BOOT
> to enable OS secure boot on systems that have firmware support for
> it. If in doubt say N.
>  
> +config SECVAR_SYSFS
that should probably be PPC_SECVAR_SYSFS since it's PPC specific

> + tristate "Enable sysfs interface for POWER secure variables"
> + depends on PPC_SECURE_BOOT
> + depends on SYSFS
> + help
> +   POWER secure variables are managed and controlled by firmware.
> +   These variables are exposed to userspace via sysfs to enable
> +   read/write operations on these variables. Say Y if you have
> +   secure boot enabled and want to expose variables to userspace.
> +
>  endmenu
>  
>  config ISA_DMA_API
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 3cf26427334f..116a3a5c0557 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -162,6 +162,7 @@ obj-y += ucall.o
>  endif
>  
>  obj-$(CONFIG_PPC_SECURE_BOOT)+= secure_boot.o ima_arch.o secvar-ops.o
> +obj-$(CONFIG_SECVAR_SYSFS)   += secvar-sysfs.o
>  
>  # Disable GCOV, KCOV & sanitizers in odd or sensitive code
>  GCOV_PROFILE_prom_init.o := n


> diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
> b/arch/powerpc/kernel/secvar-sysfs.c
> new file mode 100644
> index ..87a7cea41523
> --- /dev/null
> +++ b/arch/powerpc/kernel/secvar-sysfs.c
> @@ -0,0 +1,198 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2019 IBM Corporation 
> + *
> + * This code exposes secure variables to user via sysfs
> + */

Adding a pr_fmt for this file would be a good idea. There's a few
pr_err()s in here that would benefit from some context.

> +#include 
> +#include 
> +#include 
> +#include 
> 

Re: [PATCH v4 1/4] powerpc/powernv: Add OPAL API interface to access secure variable

2019-10-15 Thread Oliver O'Halloran
On Tue, 2019-10-01 at 19:41 -0400, Nayna Jain wrote:
> The X.509 certificates trusted by the platform and required to secure boot
> the OS kernel are wrapped in secure variables, which are controlled by
> OPAL.
> 
> This patch adds firmware/kernel interface to read and write OPAL secure
> variables based on the unique key.
> 
> This support can be enabled using CONFIG_OPAL_SECVAR.
> 
> Signed-off-by: Claudio Carvalho 
> Signed-off-by: Nayna Jain 
> ---
>  arch/powerpc/include/asm/opal-api.h  |   5 +-
>  arch/powerpc/include/asm/opal.h  |   8 ++
>  arch/powerpc/include/asm/powernv.h   |   2 +
>  arch/powerpc/include/asm/secvar.h|  35 +
>  arch/powerpc/kernel/Makefile |   2 +-
>  arch/powerpc/kernel/secvar-ops.c |  19 +++
>  arch/powerpc/platforms/powernv/Kconfig   |   6 +
>  arch/powerpc/platforms/powernv/Makefile  |   1 +
>  arch/powerpc/platforms/powernv/opal-call.c   |   3 +
>  arch/powerpc/platforms/powernv/opal-secvar.c | 137 +++
>  arch/powerpc/platforms/powernv/opal.c|   5 +
>  11 files changed, 221 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/secvar.h
>  create mode 100644 arch/powerpc/kernel/secvar-ops.c
>  create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 378e3997845a..c1f25a760eb1 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -211,7 +211,10 @@
>  #define OPAL_MPIPL_UPDATE173
>  #define OPAL_MPIPL_REGISTER_TAG  174
>  #define OPAL_MPIPL_QUERY_TAG 175
> -#define OPAL_LAST175
> +#define OPAL_SECVAR_GET  176
> +#define OPAL_SECVAR_GET_NEXT 177
> +#define OPAL_SECVAR_ENQUEUE_UPDATE   178
> +#define OPAL_LAST178
>  
>  #define QUIESCE_HOLD 1 /* Spin all calls at entry */
>  #define QUIESCE_REJECT   2 /* Fail all calls with 
> OPAL_BUSY */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index a0cf8fba4d12..03392dc3f5e2 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -298,6 +298,13 @@ int opal_sensor_group_clear(u32 group_hndl, int token);
>  int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
>  int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
>  
> +int opal_secvar_get(const char *key, uint64_t key_len, u8 *data,
> + uint64_t *data_size);
> +int opal_secvar_get_next(const char *key, uint64_t *key_len,
> +  uint64_t key_buf_size);
> +int opal_secvar_enqueue_update(const char *key, uint64_t key_len, u8 *data,
> +uint64_t data_size);
> +
>  s64 opal_mpipl_update(enum opal_mpipl_ops op, u64 src, u64 dest, u64 size);
>  s64 opal_mpipl_register_tag(enum opal_mpipl_tags tag, u64 addr);
>  s64 opal_mpipl_query_tag(enum opal_mpipl_tags tag, u64 *addr);
> @@ -392,6 +399,7 @@ void opal_wake_poller(void);
>  void opal_powercap_init(void);
>  void opal_psr_init(void);
>  void opal_sensor_groups_init(void);
> +void opal_secvar_init(void);
>  
>  #endif /* __ASSEMBLY__ */
>  
> diff --git a/arch/powerpc/include/asm/powernv.h 
> b/arch/powerpc/include/asm/powernv.h
> index e1a858718716..cff980a85dd2 100644
> --- a/arch/powerpc/include/asm/powernv.h
> +++ b/arch/powerpc/include/asm/powernv.h
> @@ -12,10 +12,12 @@ extern void powernv_set_nmmu_ptcr(unsigned long ptcr);
>  void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val);
>  
>  void pnv_tm_init(void);
> +
>  #else
>  static inline void powernv_set_nmmu_ptcr(unsigned long ptcr) { }
>  
>  static inline void pnv_tm_init(void) { }
> +
>  #endif
>  
>  #endif /* _ASM_POWERNV_H */
> diff --git a/arch/powerpc/include/asm/secvar.h 
> b/arch/powerpc/include/asm/secvar.h
> new file mode 100644
> index ..4cc35b58b986
> --- /dev/null
> +++ b/arch/powerpc/include/asm/secvar.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain
> + *
> + * PowerPC secure variable operations.
> + */
> +#ifndef SECVAR_OPS_H
> +#define SECVAR_OPS_H
> +
> +#include 
> +#include 
> +
> +extern const struct secvar_operations *secvar_ops;
> +
> +struct secvar_operations {
> + int (*get)(const char *key, uint64_t key_len, u8 *data,
> +uint64_t *data_size);
> + int (*get_next)(const char *key, uint64_t *key_len,
> + uint64_t keybufsize);
> + int (*set)(const char *key, uint64_t key_len, u8 *data,
> +uint64_t data_size);
> +};
> +
> +#ifdef CONFIG_PPC_SECURE_BOOT
> +
> +extern void set_secvar_ops(const struct secvar_operations *ops);
> +
> +#else
> +
> +static inline void 

Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-15 Thread Christoph Hellwig
On Tue, Oct 15, 2019 at 09:56:20AM +0800, Pingfan Liu wrote:
> Agree. For the consistency of the whole fs, we need grub to be aware of
> log. While this patch just assumes that files accessed by grub are
> known, and the consistency is forced only on these files.
> > get by freezing and unfreezing the file system using the FIFREEZE and
> > FITHAW ioctls.  And if my memory is serving me correctly Dave has been
> freeze will block any further modification to the fs. That is different
> from my patch, which does not have such limitation.

So you freeze and immediately unfreeze.


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-15 Thread Christoph Hellwig
On Mon, Oct 14, 2019 at 03:09:48PM -0500, Eric Sandeen wrote:
> We're in agreement here.  ;)  I only worry about implementing things like this
> which sound like guarantees, but aren't, and end up encouraging bad behavior
> or promoting misconceptions.
> 
> More and more, I think we should reconsider Darrick's "bootfs" (ext2 by 
> another
> name, but with extra-sync-iness) proposal...

Having a separate simple file system for the boot loader makes a lot of
sense.  Note that vfat of EFI is the best choice, but at least it is
something.  SysV Unix from the 90s actually had a special file system just
for that, and fs/bfs/ in Linux supports that.  So this isn't really a new
thing either.


Re: [PATCH RFC 0/5] ARM: Raspberry Pi 4 DMA support

2019-10-15 Thread Nicolas Saenz Julienne
On Mon, 2019-10-14 at 21:59 +0100, Catalin Marinas wrote:
> On Mon, Oct 14, 2019 at 08:31:02PM +0200, Nicolas Saenz Julienne wrote:
> > the Raspberry Pi 4 offers up to 4GB of memory, of which only the first
> > is DMA capable device wide. This forces us to use of bounce buffers,
> > which are currently not very well supported by ARM's custom DMA ops.
> > Among other things the current mechanism (see dmabounce.c) isn't
> > suitable for high memory. Instead of fixing it, this series introduces a
> > way of selecting dma-direct as the default DMA ops provider which allows
> > for the Raspberry Pi to make use of swiotlb.
> 
> I presume these patches go on top of this series:
> 
> http://lkml.kernel.org/r/20190911182546.17094-1-nsaenzjulie...@suse.de

Yes, forgot to mention it. It's relevant for the first patch.

> 
> which I queued here:
> 
> 
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/zone-dma

Thanks!

A little off topic but I was wondering if you have a preferred way to refer to
the arm architecture in a way that it unambiguously excludes arm64 (for example
arm32 would work).

Regards,
Nicolas



signature.asc
Description: This is a digitally signed message part


Re: [PATCH 2/2] virtio_ring: Use DMA API if memory is encrypted

2019-10-15 Thread Christoph Hellwig
On Fri, Oct 11, 2019 at 06:25:19PM -0700, Ram Pai wrote:
> From: Thiago Jung Bauermann 
> 
> Normally, virtio enables DMA API with VIRTIO_F_IOMMU_PLATFORM, which must
> be set by both device and guest driver. However, as a hack, when DMA API
> returns physical addresses, guest driver can use the DMA API; even though
> device does not set VIRTIO_F_IOMMU_PLATFORM and just uses physical
> addresses.

Sorry, but this is a complete bullshit hack.  Driver must always use
the DMA API if they do DMA, and if virtio devices use physical addresses
that needs to be returned through the platform firmware interfaces for
the dma setup.  If you don't do that yet (which based on previous
informations you don't), you need to fix it, and we can then quirk
old implementations that already are out in the field.

In other words: we finally need to fix that virtio mess and not pile
hacks on top of hacks.


Re: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()

2019-10-15 Thread Christoph Hellwig
On Mon, Oct 14, 2019 at 11:29:24AM +0100, Robin Murphy wrote:
>> However, I would like to see the commit message (and maybe the inline
>> comments) expanded a bit on what the distinction here is about.  Some
>> of the text from the next patch would be suitable, about DMA addresses
>> usually being in a different address space but not in the case of
>> bounce buffering.
>
> Right, this needs a much tighter definition. "DMA address happens to be a 
> valid physical address" is true of various IOMMU setups too, but I can't 
> believe it's meaningful in such cases.
>
> If what you actually want is "DMA is direct or SWIOTLB" - i.e. "DMA address 
> is physical address of DMA data (not necessarily the original buffer)" - 
> wouldn't dma_is_direct() suffice?

It would.  But drivers have absolutely no business knowing any of this.


RE: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()

2019-10-15 Thread Ram Pai
On Mon, Oct 14, 2019 at 11:29:24AM +0100, Robin Murphy wrote:
> On 14/10/2019 05:51, David Gibson wrote:
> >On Fri, Oct 11, 2019 at 06:25:18PM -0700, Ram Pai wrote:
> >>From: Thiago Jung Bauermann 
> >>
> >>In order to safely use the DMA API, virtio needs to know whether DMA
> >>addresses are in fact physical addresses and for that purpose,
> >>dma_addr_is_phys_addr() is introduced.
> >>
> >>cc: Benjamin Herrenschmidt 
> >>cc: David Gibson 
> >>cc: Michael Ellerman 
> >>cc: Paul Mackerras 
> >>cc: Michael Roth 
> >>cc: Alexey Kardashevskiy 
> >>cc: Paul Burton 
> >>cc: Robin Murphy 
> >>cc: Bartlomiej Zolnierkiewicz 
> >>cc: Marek Szyprowski 
> >>cc: Christoph Hellwig 
> >>Suggested-by: Michael S. Tsirkin 
> >>Signed-off-by: Ram Pai 
> >>Signed-off-by: Thiago Jung Bauermann 
> >
> >The change itself looks ok, so
> >
> >Reviewed-by: David Gibson 
> >
> >However, I would like to see the commit message (and maybe the inline
> >comments) expanded a bit on what the distinction here is about.  Some
> >of the text from the next patch would be suitable, about DMA addresses
> >usually being in a different address space but not in the case of
> >bounce buffering.
> 
> Right, this needs a much tighter definition. "DMA address happens to
> be a valid physical address" is true of various IOMMU setups too,
> but I can't believe it's meaningful in such cases.

The definition by itself is meaningful AFAICT. At its core its just
indicating whether DMA addresses are physical addresses or not.

However its up to the caller to use it meaningfully. For non-virtio caller,
dma_addr_is_phys_addr() by itself may or may not be meaningful.

But for a virtio caller, this information when paired with
mem_encrypt_active() is meaningful.

For IOMMU setups DMA API will get used regardless of "DMA address
happens to be a valid physical address"


> 
> If what you actually want is "DMA is direct or SWIOTLB" - i.e. "DMA
> address is physical address of DMA data (not necessarily the
> original buffer)" - wouldn't dma_is_direct() suffice?

This may also work, I think.  MST: thoughts?

RP



Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-15 Thread Daniel Axtens


> There is a potential problem here, as Will Deacon wrote up at:
>
>   
> https://lore.kernel.org/linux-arm-kernel/20190827131818.14724-1-w...@kernel.org/
>
> ... in the section starting:
>
> | *** Other architecture maintainers -- start here! ***
>
> ... whereby the CPU can spuriously fault on an access after observing a
> valid PTE.
>
> For arm64 we handle the spurious fault, and it looks like x86 would need
> something like its vmalloc_fault() applying to the shadow region to
> cater for this.

I'm not really up on x86 - my first thought would be that their stronger
memory ordering might be sufficient but I really don't know. Reading the
thread I see arm and powerpc discussions but nothing from anyone else,
so I'm none the wiser there...

Andy, do you have any thoughts?

Regards,
Daniel

>
> Thanks,
> Mark.


Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-15 Thread Daniel Axtens
>>> @@ -2497,6 +2533,9 @@ void *__vmalloc_node_range(unsigned long size, 
>>> unsigned long align,
>>> if (!addr)
>>> return NULL;
>>>  
>>> +   if (kasan_populate_vmalloc(real_size, area))
>>> +   return NULL;
>>> +
>>
>> KASAN itself uses __vmalloc_node_range() to allocate and map shadow in 
>> memory online callback.
>> So we should either skip non-vmalloc and non-module addresses here or teach 
>> kasan's memory online/offline
>> callbacks to not use __vmalloc_node_range() (do something similar to 
>> kasan_populate_vmalloc() perhaps?). 
>
> Ah, right you are. I haven't been testing that.
>
> I am a bit nervous about further restricting kasan_populate_vmalloc: I
> seem to remember having problems with code using the vmalloc family of
> functions to map memory that doesn't lie within vmalloc space but which
> still has instrumented accesses.

I was wrong or remembering early implementation bugs.

If the memory we're allocating in __vmalloc_node_range falls outside of
vmalloc and module space, it shouldn't be getting shadow mapped for it
by kasan_populate_vmalloc. For v9, I've guarded the call with
is_vmalloc_or_module. It seems to work fine when tested with hotplugged
memory.

Thanks again.

Regards,
Daniel

> On the other hand, I'm not keen on rewriting any of the memory
> on/offline code if I can avoid it!
>
> I'll have a look and get back you as soon as I can.
>
> Thanks for catching this.
>
> Kind regards,
> Daniel
>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "kasan-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to kasan-dev+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/kasan-dev/352cb4fa-2e57-7e3b-23af-898e113bbe22%40virtuozzo.com.


Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-15 Thread Daniel Axtens
Mark Rutland  writes:

> On Tue, Oct 01, 2019 at 04:58:30PM +1000, Daniel Axtens wrote:
>> Hook into vmalloc and vmap, and dynamically allocate real shadow
>> memory to back the mappings.
>> 
>> Most mappings in vmalloc space are small, requiring less than a full
>> page of shadow space. Allocating a full shadow page per mapping would
>> therefore be wasteful. Furthermore, to ensure that different mappings
>> use different shadow pages, mappings would have to be aligned to
>> KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
>> 
>> Instead, share backing space across multiple mappings. Allocate a
>> backing page when a mapping in vmalloc space uses a particular page of
>> the shadow region. This page can be shared by other vmalloc mappings
>> later on.
>> 
>> We hook in to the vmap infrastructure to lazily clean up unused shadow
>> memory.
>> 
>> To avoid the difficulties around swapping mappings around, this code
>> expects that the part of the shadow region that covers the vmalloc
>> space will not be covered by the early shadow page, but will be left
>> unmapped. This will require changes in arch-specific code.
>> 
>> This allows KASAN with VMAP_STACK, and may be helpful for architectures
>> that do not have a separate module space (e.g. powerpc64, which I am
>> currently working on). It also allows relaxing the module alignment
>> back to PAGE_SIZE.
>> 
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202009
>> Acked-by: Vasily Gorbik 
>> Signed-off-by: Daniel Axtens 
>> [Mark: rework shadow allocation]
>> Signed-off-by: Mark Rutland 
>
> Sorry to point this out so late, but your S-o-B should come last in the
> chain per Documentation/process/submitting-patches.rst. Judging by the
> rest of that, I think you want something like:
>
> Co-developed-by: Mark Rutland 
> Signed-off-by: Mark Rutland  [shadow rework]
> Signed-off-by: Daniel Axtens 
>
> ... leaving yourself as the Author in the headers.

no worries, I wasn't really sure how best to arrange them, so thanks for
clarifying!

>
> Sorry to have made that more complicated!
>
> [...]
>
>> +static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>> +void *unused)
>> +{
>> +unsigned long page;
>> +
>> +page = (unsigned long)__va(pte_pfn(*ptep) << PAGE_SHIFT);
>> +
>> +spin_lock(_mm.page_table_lock);
>> +
>> +if (likely(!pte_none(*ptep))) {
>> +pte_clear(_mm, addr, ptep);
>> +free_page(page);
>> +}
>
> There should be TLB maintenance between clearing the PTE and freeing the
> page here.

Fixed for v9.

Regards,
Daniel

>
> Thanks,
> Mark.