Re: [PATCH 2/4] powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE

2019-07-15 Thread Oliver O'Halloran
On Mon, Jul 15, 2019 at 4:49 PM Michael Ellerman  wrote:
>
> Christophe Leroy  writes:
> > PPC32 also have flush_dcache_range() so it can also support
> > ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE without changes.
> >
> > Signed-off-by: Christophe Leroy 
> > ---
> >  arch/powerpc/Kconfig | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index d7996cfaceca..cf6e30f637be 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -127,13 +127,13 @@ config PPC
> >   select ARCH_HAS_KCOV
> >   select ARCH_HAS_MMIOWB  if PPC64
> >   select ARCH_HAS_PHYS_TO_DMA
> > - select ARCH_HAS_PMEM_APIif PPC64
> > + select ARCH_HAS_PMEM_API
> >   select ARCH_HAS_PTE_SPECIAL
> >   select ARCH_HAS_MEMBARRIER_CALLBACKS
> >   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
> > && PPC64
> >   select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) 
> > && !RELOCATABLE && !HIBERNATION)
> >   select ARCH_HAS_TICK_BROADCAST  if 
> > GENERIC_CLOCKEVENTS_BROADCAST
> > - select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
> > + select ARCH_HAS_UACCESS_FLUSHCACHE
> >   select ARCH_HAS_UBSAN_SANITIZE_ALL
> >   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
> >   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>
> This didn't build for me, probably due to something that's changed in
> the long period between you posting it and me applying it?
>
> corenet32_smp_defconfig:
>
>   powerpc64-unknown-linux-gnu-ld: lib/iov_iter.o: in function 
> `_copy_from_iter_flushcache':
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `memcpy_page_flushcache'
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `memcpy_flushcache'
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `__copy_from_user_flushcache'
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `memcpy_flushcache'

I think lib/pmem.c just needs to be moved out of obj64-y.

>
> cheers


Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-15 Thread Michael Ellerman
Segher Boessenkool  writes:
> On Sun, Jul 14, 2019 at 07:45:15AM +0900, Masahiro Yamada wrote:
>> On Sat, Jul 13, 2019 at 10:17 PM Segher Boessenkool
>>  wrote:
>> > On Sat, Jul 13, 2019 at 07:47:44AM -0500, Segher Boessenkool wrote:
>> > > On Sat, Jul 13, 2019 at 12:21:06PM +0900, Masahiro Yamada wrote:
>> > > > The KBUILD_ARFLAGS addition in arch/powerpc/Makefile has never worked
>> > > > in a useful way because it is always overridden by the following code
>> > > > in the top Makefile:
>> > > >
>> > > >   # use the deterministic mode of AR if available
>> > > >   KBUILD_ARFLAGS := $(call ar-option,D)
>> > > >
>> > > > The code in the top Makefile was added in 2011, by commit 40df759e2b9e
>> > > > ("kbuild: Fix build with binutils <= 2.19").
>> > > >
>> > > > The KBUILD_ARFLAGS addition for ppc has always been dead code from the
>> > > > beginning.
>> > >
>> > > That was added in 43c9127d94d6 to replace my 8995ac870273 from 2007.
>> > >
>> > > Is it no longer supported to build a 64-bit kernel with a toolchain
>> > > that defaults to 32-bit, or the other way around?  And with non-native
>> > > toolchains (this one didn't run on Linux, even).
>> >
>> > It was an --enable-targets=all toolchain, somewhat common for crosses,
>> > if that matters.
>> 
>> I always use the same toolchain
>> for compile-testing PPC32/64.
>> 
>> I have never been hit by the issue you mention.
>> Somebody would have reported it if it were still a problem.
>
> But did you use --enable-targets=all?

I do. And I don't see any errors with this patch applied.

> The problem was empty archives IIRC.  Not a problem anymore with thin
> archives, maybe?

Maybe? Though I can't get it to break even before we switched to them.

>> Moreover, commit 43c9127d94d6
>> translated the environment variable "GNUTARGET"
>> into the command option "--target="
>> 
>> My powerpc-linux-ar does not know it:
>> 
>> powerpc-linux-ar: -t: No such file or directory
>
> Yes, that is why I used the environment variable, all binutils work
> with that.  There was no --target option in GNU ar before 2.22.

Yeah, we're not very good at testing with really old binutils, so I
guess we broke that.

I'm inclined to merge this, it doesn't seem to break anything, and it
fixes using --target on old binutils that don't have it.

cheers


[PATCH kernel v2] powerpc/xive: Drop deregistered irqs

2019-07-15 Thread Alexey Kardashevskiy
There is a race between releasing an irq on one cpu and fetching it
from XIVE on another cpu as there does not seem to be any locking between
these, probably because xive_irq_chip::irq_shutdown() is supposed to
remove the irq from all queues in the system which it does not do.

As a result, when such released irq appears in a queue, we take it
from the queue but we do not change the current priority on that cpu and
since there is no handler for the irq, EOI is never called and the cpu
current priority remains elevated (7 vs. 0xff==unmasked). If another irq
is assigned to the same cpu, then that device stops working until irq
is moved to another cpu or the device is reset.

This adds a new ppc_md.orphan_irq callback which is called if no irq
descriptor is found. The XIVE implementation drops the current priority
to 0xff which effectively unmasks interrupts in a current CPU.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* added ppc_md.orphan_irq

---

Found it on P9 system with:
- a host with 8 cpus online
- a boot disk on ahci with its msix on cpu#0
- a guest with 2xGPUs + 6xNVLink + 4 cpus
- GPU#0 from the guest is bound to the same cpu#0.

Killing a guest killed ahci and therefore the host because of the race.
Note that VFIO masks interrupts first and only then resets the device.
---
 arch/powerpc/include/asm/machdep.h |  3 +++
 arch/powerpc/kernel/irq.c  |  9 ++---
 arch/powerpc/sysdev/xive/common.c  | 10 ++
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index c43d6eca9edd..6cc14e28e89a 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -59,6 +59,9 @@ struct machdep_calls {
/* Return an irq, or 0 to indicate there are none pending. */
unsigned int(*get_irq)(void);
 
+   /* Drops irq if it does not have a valid descriptor */
+   void(*orphan_irq)(unsigned int irq);
+
/* PCI stuff */
/* Called after allocating resources */
void(*pcibios_fixup)(void);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bc68c53af67c..b4e06d05bdba 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
may_hard_irq_enable();
 
/* And finally process it */
-   if (unlikely(!irq))
+   if (unlikely(!irq)) {
__this_cpu_inc(irq_stat.spurious_irqs);
-   else
-   generic_handle_irq(irq);
+   } else if (generic_handle_irq(irq)) {
+   if (ppc_md.orphan_irq)
+   ppc_md.orphan_irq(irq);
+   __this_cpu_inc(irq_stat.spurious_irqs);
+   }
 
trace_irq_exit(regs);
 
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1c20f0..b4054091999a 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -283,6 +283,15 @@ static unsigned int xive_get_irq(void)
return irq;
 }
 
+static void xive_orphan_irq(unsigned int irq)
+{
+   struct xive_cpu *xc = __this_cpu_read(xive_cpu);
+
+   xc->cppr = 0xff;
+   out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
+   DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
+}
+
 /*
  * After EOI'ing an interrupt, we need to re-check the queue
  * to see if another interrupt is pending since multiple
@@ -1419,6 +1428,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
void __iomem *area, u32 o
xive_irq_priority = max_prio;
 
ppc_md.get_irq = xive_get_irq;
+   ppc_md.orphan_irq = xive_orphan_irq;
__xive_enabled = true;
 
pr_devel("Initializing host..\n");
-- 
2.17.1



Re: [PATCH v3 5/7] kexec_elf: remove elf_addr_to_cpu macro

2019-07-15 Thread Sven Schnelle
Hi Michael,

On Thu, Jul 11, 2019 at 09:08:51PM +1000, Michael Ellerman wrote:
> Sven Schnelle  writes:
> > On Wed, Jul 10, 2019 at 05:09:29PM +0200, Christophe Leroy wrote:
> >> Le 10/07/2019 à 16:29, Sven Schnelle a écrit :
> >> > It had only one definition, so just use the function directly.
> >> 
> >> It had only one definition because it was for ppc64 only.
> >> But as far as I understand (at least from the name of the new file), you
> >> want it to be generic, don't you ? Therefore I get on 32 bits it would be
> >> elf32_to_cpu().
> >
> > That brings up the question whether we need those endianess conversions. I 
> > would
> > assume that the ELF file has always the same endianess as the running 
> > kernel. So
> > i think we could just drop them. What do you think?
> 
> We should be able to kexec from big to little endian or vice versa, so
> they are necessary.

I'll update the patch to check for a needed 32/64 bit conversion during runtime,
so we can also kexec from 32 to 64 bit kernels and vice versa. Don't know
whether that's possible on powerpc, but at least on parisc it is.

Regards
Sven


Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-15 Thread Segher Boessenkool
On Mon, Jul 15, 2019 at 05:05:34PM +1000, Michael Ellerman wrote:
> Segher Boessenkool  writes:
> > Yes, that is why I used the environment variable, all binutils work
> > with that.  There was no --target option in GNU ar before 2.22.
> 
> Yeah, we're not very good at testing with really old binutils, so I
> guess we broke that.
> 
> I'm inclined to merge this, it doesn't seem to break anything, and it
> fixes using --target on old binutils that don't have it.

But we don't set the target any other way either.  I don't think this
will work with a 32-bit toolchain (default target 32 bit) and a 64-bit
kernel, or the other way around.

Then again, does that work at *all* nowadays?  Do we even consider that
important, *should* it work?


Segher


[PATCH v4 7/7] kexec_elf: support 32 bit ELF files

2019-07-15 Thread Sven Schnelle
The powerpc version only supported 64 bit. Add some
code to switch decoding of fields during runtime so
we can kexec a 32 bit kernel from a 64 bit kernel and
vice versa.

Signed-off-by: Sven Schnelle 
---
 kernel/kexec_elf.c | 57 ++
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index 9421eebbacf0..a39d01154829 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -8,8 +8,6 @@
 #include 
 #include 
 
-#define elf_addr_to_cpuelf64_to_cpu
-
 static inline bool elf_is_elf_file(const struct elfhdr *ehdr)
 {
return memcmp(ehdr->e_ident, ELFMAG, SELFMAG) == 0;
@@ -139,9 +137,6 @@ static int elf_read_ehdr(const char *buf, size_t len, 
struct elfhdr *ehdr)
ehdr->e_type  = elf16_to_cpu(ehdr, buf_ehdr->e_type);
ehdr->e_machine   = elf16_to_cpu(ehdr, buf_ehdr->e_machine);
ehdr->e_version   = elf32_to_cpu(ehdr, buf_ehdr->e_version);
-   ehdr->e_entry = elf_addr_to_cpu(ehdr, buf_ehdr->e_entry);
-   ehdr->e_phoff = elf_addr_to_cpu(ehdr, buf_ehdr->e_phoff);
-   ehdr->e_shoff = elf_addr_to_cpu(ehdr, buf_ehdr->e_shoff);
ehdr->e_flags = elf32_to_cpu(ehdr, buf_ehdr->e_flags);
ehdr->e_phentsize = elf16_to_cpu(ehdr, buf_ehdr->e_phentsize);
ehdr->e_phnum = elf16_to_cpu(ehdr, buf_ehdr->e_phnum);
@@ -149,6 +144,24 @@ static int elf_read_ehdr(const char *buf, size_t len, 
struct elfhdr *ehdr)
ehdr->e_shnum = elf16_to_cpu(ehdr, buf_ehdr->e_shnum);
ehdr->e_shstrndx  = elf16_to_cpu(ehdr, buf_ehdr->e_shstrndx);
 
+   switch (ehdr->e_ident[EI_CLASS]) {
+   case ELFCLASS64:
+   ehdr->e_entry = elf64_to_cpu(ehdr, buf_ehdr->e_entry);
+   ehdr->e_phoff = elf64_to_cpu(ehdr, buf_ehdr->e_phoff);
+   ehdr->e_shoff = elf64_to_cpu(ehdr, buf_ehdr->e_shoff);
+   break;
+
+   case ELFCLASS32:
+   ehdr->e_entry = elf32_to_cpu(ehdr, buf_ehdr->e_entry);
+   ehdr->e_phoff = elf32_to_cpu(ehdr, buf_ehdr->e_phoff);
+   ehdr->e_shoff = elf32_to_cpu(ehdr, buf_ehdr->e_shoff);
+   break;
+
+   default:
+   pr_debug("Unknown ELF class.\n");
+   return -EINVAL;
+   }
+
return elf_is_ehdr_sane(ehdr, len) ? 0 : -ENOEXEC;
 }
 
@@ -179,6 +192,7 @@ static int elf_read_phdr(const char *buf, size_t len,
 {
/* Override the const in proghdrs, we are the ones doing the loading. */
struct elf_phdr *phdr = (struct elf_phdr *) &elf_info->proghdrs[idx];
+   const struct elfhdr *ehdr = elf_info->ehdr;
const char *pbuf;
struct elf_phdr *buf_phdr;
 
@@ -186,18 +200,31 @@ static int elf_read_phdr(const char *buf, size_t len,
buf_phdr = (struct elf_phdr *) pbuf;
 
phdr->p_type   = elf32_to_cpu(elf_info->ehdr, buf_phdr->p_type);
-   phdr->p_offset = elf_addr_to_cpu(elf_info->ehdr, buf_phdr->p_offset);
-   phdr->p_paddr  = elf_addr_to_cpu(elf_info->ehdr, buf_phdr->p_paddr);
-   phdr->p_vaddr  = elf_addr_to_cpu(elf_info->ehdr, buf_phdr->p_vaddr);
phdr->p_flags  = elf32_to_cpu(elf_info->ehdr, buf_phdr->p_flags);
 
-   /*
-* The following fields have a type equivalent to Elf_Addr
-* both in 32 bit and 64 bit ELF.
-*/
-   phdr->p_filesz = elf_addr_to_cpu(elf_info->ehdr, buf_phdr->p_filesz);
-   phdr->p_memsz  = elf_addr_to_cpu(elf_info->ehdr, buf_phdr->p_memsz);
-   phdr->p_align  = elf_addr_to_cpu(elf_info->ehdr, buf_phdr->p_align);
+   switch (ehdr->e_ident[EI_CLASS]) {
+   case ELFCLASS64:
+   phdr->p_offset = elf64_to_cpu(ehdr, buf_phdr->p_offset);
+   phdr->p_paddr  = elf64_to_cpu(ehdr, buf_phdr->p_paddr);
+   phdr->p_vaddr  = elf64_to_cpu(ehdr, buf_phdr->p_vaddr);
+   phdr->p_filesz = elf64_to_cpu(ehdr, buf_phdr->p_filesz);
+   phdr->p_memsz  = elf64_to_cpu(ehdr, buf_phdr->p_memsz);
+   phdr->p_align  = elf64_to_cpu(ehdr, buf_phdr->p_align);
+   break;
+
+   case ELFCLASS32:
+   phdr->p_offset = elf32_to_cpu(ehdr, buf_phdr->p_offset);
+   phdr->p_paddr  = elf32_to_cpu(ehdr, buf_phdr->p_paddr);
+   phdr->p_vaddr  = elf32_to_cpu(ehdr, buf_phdr->p_vaddr);
+   phdr->p_filesz = elf32_to_cpu(ehdr, buf_phdr->p_filesz);
+   phdr->p_memsz  = elf32_to_cpu(ehdr, buf_phdr->p_memsz);
+   phdr->p_align  = elf32_to_cpu(ehdr, buf_phdr->p_align);
+   break;
+
+   default:
+   pr_debug("Unknown ELF class.\n");
+   return -EINVAL;
+   }
 
return elf_is_phdr_sane(phdr, len) ? 0 : -ENOEXEC;
 }
-- 
2.20.1



[PATCH v4 4/7] kexec_elf: remove PURGATORY_STACK_SIZE

2019-07-15 Thread Sven Schnelle
It's not used anywhere so just drop it.

Signed-off-by: Sven Schnelle 
---
 kernel/kexec_elf.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index effe9dc0b055..70d31b8feeae 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -8,8 +8,6 @@
 #include 
 #include 
 
-#define PURGATORY_STACK_SIZE   (16 * 1024)
-
 #define elf_addr_to_cpuelf64_to_cpu
 
 #ifndef Elf_Rel
-- 
2.20.1



[PATCH v4 6/7] kexec_elf: remove unused variable in kexec_elf_load()

2019-07-15 Thread Sven Schnelle
base was never assigned, so we can remove it.

Reviewed-by: Christophe Leroy 
Signed-off-by: Sven Schnelle 
---
 kernel/kexec_elf.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index e346659af324..9421eebbacf0 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -350,7 +350,7 @@ int kexec_elf_load(struct kimage *image, struct elfhdr 
*ehdr,
 struct kexec_buf *kbuf,
 unsigned long *lowest_load_addr)
 {
-   unsigned long base = 0, lowest_addr = UINT_MAX;
+   unsigned long lowest_addr = UINT_MAX;
int ret;
size_t i;
 
@@ -372,7 +372,7 @@ int kexec_elf_load(struct kimage *image, struct elfhdr 
*ehdr,
kbuf->bufsz = size;
kbuf->memsz = phdr->p_memsz;
kbuf->buf_align = phdr->p_align;
-   kbuf->buf_min = phdr->p_paddr + base;
+   kbuf->buf_min = phdr->p_paddr;
ret = kexec_add_buffer(kbuf);
if (ret)
goto out;
@@ -382,9 +382,6 @@ int kexec_elf_load(struct kimage *image, struct elfhdr 
*ehdr,
lowest_addr = load_addr;
}
 
-   /* Update entry point to reflect new load address. */
-   ehdr->e_entry += base;
-
*lowest_load_addr = lowest_addr;
ret = 0;
  out:
-- 
2.20.1



[PATCH v4 5/7] kexec_elf: remove Elf_Rel macro

2019-07-15 Thread Sven Schnelle
It wasn't used anywhere, so lets drop it.

Reviewed-by: Christophe Leroy 
Signed-off-by: Sven Schnelle 
---
 kernel/kexec_elf.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index 70d31b8feeae..e346659af324 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -10,10 +10,6 @@
 
 #define elf_addr_to_cpuelf64_to_cpu
 
-#ifndef Elf_Rel
-#define Elf_RelElf64_Rel
-#endif /* Elf_Rel */
-
 static inline bool elf_is_elf_file(const struct elfhdr *ehdr)
 {
return memcmp(ehdr->e_ident, ELFMAG, SELFMAG) == 0;
-- 
2.20.1



[PATCH v4 2/7] kexec_elf: change order of elf_*_to_cpu() functions

2019-07-15 Thread Sven Schnelle
Change the order to have a 64/32/16 order, no functional change.

Signed-off-by: Sven Schnelle 
---
 kernel/kexec_elf.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index 6e9f52171ede..76e7df64d715 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -31,22 +31,22 @@ static uint64_t elf64_to_cpu(const struct elfhdr *ehdr, 
uint64_t value)
return value;
 }
 
-static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value)
+static uint32_t elf32_to_cpu(const struct elfhdr *ehdr, uint32_t value)
 {
if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
-   value = le16_to_cpu(value);
+   value = le32_to_cpu(value);
else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
-   value = be16_to_cpu(value);
+   value = be32_to_cpu(value);
 
return value;
 }
 
-static uint32_t elf32_to_cpu(const struct elfhdr *ehdr, uint32_t value)
+static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value)
 {
if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
-   value = le32_to_cpu(value);
+   value = le16_to_cpu(value);
else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
-   value = be32_to_cpu(value);
+   value = be16_to_cpu(value);
 
return value;
 }
-- 
2.20.1



[PATCH v4 3/7] kexec_elf: remove parsing of section headers

2019-07-15 Thread Sven Schnelle
We're not using them, so we can drop the parsing.

Signed-off-by: Sven Schnelle 
---
 include/linux/kexec.h |   1 -
 kernel/kexec_elf.c| 137 --
 2 files changed, 138 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index da2a6b1d69e7..f0b809258ed3 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -226,7 +226,6 @@ struct kexec_elf_info {
 
const struct elfhdr *ehdr;
const struct elf_phdr *proghdrs;
-   struct elf_shdr *sechdrs;
 };
 
 int kexec_build_elf_info(const char *buf, size_t len, struct elfhdr *ehdr,
diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index 76e7df64d715..effe9dc0b055 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -244,134 +244,6 @@ static int elf_read_phdrs(const char *buf, size_t len,
return 0;
 }
 
-/**
- * elf_is_shdr_sane - check that it is safe to use the section header
- * @buf_len:   size of the buffer in which the ELF file is loaded.
- */
-static bool elf_is_shdr_sane(const struct elf_shdr *shdr, size_t buf_len)
-{
-   bool size_ok;
-
-   /* SHT_NULL headers have undefined values, so we can't check them. */
-   if (shdr->sh_type == SHT_NULL)
-   return true;
-
-   /* Now verify sh_entsize */
-   switch (shdr->sh_type) {
-   case SHT_SYMTAB:
-   size_ok = shdr->sh_entsize == sizeof(Elf_Sym);
-   break;
-   case SHT_RELA:
-   size_ok = shdr->sh_entsize == sizeof(Elf_Rela);
-   break;
-   case SHT_DYNAMIC:
-   size_ok = shdr->sh_entsize == sizeof(Elf_Dyn);
-   break;
-   case SHT_REL:
-   size_ok = shdr->sh_entsize == sizeof(Elf_Rel);
-   break;
-   case SHT_NOTE:
-   case SHT_PROGBITS:
-   case SHT_HASH:
-   case SHT_NOBITS:
-   default:
-   /*
-* This is a section whose entsize requirements
-* I don't care about.  If I don't know about
-* the section I can't care about it's entsize
-* requirements.
-*/
-   size_ok = true;
-   break;
-   }
-
-   if (!size_ok) {
-   pr_debug("ELF section with wrong entry size.\n");
-   return false;
-   } else if (shdr->sh_addr + shdr->sh_size < shdr->sh_addr) {
-   pr_debug("ELF section address wraps around.\n");
-   return false;
-   }
-
-   if (shdr->sh_type != SHT_NOBITS) {
-   if (shdr->sh_offset + shdr->sh_size < shdr->sh_offset) {
-   pr_debug("ELF section location wraps around.\n");
-   return false;
-   } else if (shdr->sh_offset + shdr->sh_size > buf_len) {
-   pr_debug("ELF section not in file.\n");
-   return false;
-   }
-   }
-
-   return true;
-}
-
-static int elf_read_shdr(const char *buf, size_t len,
-struct kexec_elf_info *elf_info,
-int idx)
-{
-   struct elf_shdr *shdr = &elf_info->sechdrs[idx];
-   const struct elfhdr *ehdr = elf_info->ehdr;
-   const char *sbuf;
-   struct elf_shdr *buf_shdr;
-
-   sbuf = buf + ehdr->e_shoff + idx * sizeof(*buf_shdr);
-   buf_shdr = (struct elf_shdr *) sbuf;
-
-   shdr->sh_name  = elf32_to_cpu(ehdr, buf_shdr->sh_name);
-   shdr->sh_type  = elf32_to_cpu(ehdr, buf_shdr->sh_type);
-   shdr->sh_addr  = elf_addr_to_cpu(ehdr, buf_shdr->sh_addr);
-   shdr->sh_offset= elf_addr_to_cpu(ehdr, buf_shdr->sh_offset);
-   shdr->sh_link  = elf32_to_cpu(ehdr, buf_shdr->sh_link);
-   shdr->sh_info  = elf32_to_cpu(ehdr, buf_shdr->sh_info);
-
-   /*
-* The following fields have a type equivalent to Elf_Addr
-* both in 32 bit and 64 bit ELF.
-*/
-   shdr->sh_flags = elf_addr_to_cpu(ehdr, buf_shdr->sh_flags);
-   shdr->sh_size  = elf_addr_to_cpu(ehdr, buf_shdr->sh_size);
-   shdr->sh_addralign = elf_addr_to_cpu(ehdr, buf_shdr->sh_addralign);
-   shdr->sh_entsize   = elf_addr_to_cpu(ehdr, buf_shdr->sh_entsize);
-
-   return elf_is_shdr_sane(shdr, len) ? 0 : -ENOEXEC;
-}
-
-/**
- * elf_read_shdrs - read the section headers from the buffer
- *
- * This function assumes that the section header table was checked for sanity.
- * Use elf_is_ehdr_sane() if it wasn't.
- */
-static int elf_read_shdrs(const char *buf, size_t len,
- struct kexec_elf_info *elf_info)
-{
-   size_t shdr_size, i;
-
-   /*
-* e_shnum is at most 65536 so calculating
-* the size of the section header cannot overflow.
-*/
-   shdr_size = sizeof(struct elf_shdr) * elf_info->ehdr->e_shnum;
-
-   elf_info->sechdrs = kzalloc(shdr_size, GFP_KERNEL);
-   if (!elf_info->sechdrs)
-   return -ENOMEM;
-
-   for (i =

[PATCH v4 0/7] kexec: add generic support for elf kernel images

2019-07-15 Thread Sven Schnelle
Changes to v3:
 - add support for 32-bit ELF files

Changes to v2:
 - use git format-patch -C

Changes to v1:
 - split up patch into smaller pieces
 - rebase onto powerpc/next
 - remove unused variable in kexec_elf_load()

Changes to RFC version:
 - remove unused Elf_Rel macro
 - remove section header parsing
 - remove PURGATORY_STACK_SIZE
 - change order of elf_*_to_cpu() functions
 - remove elf_addr_to_cpu macro

Sven Schnelle (7):
  kexec: add KEXEC_ELF
  kexec_elf: change order of elf_*_to_cpu() functions
  kexec_elf: remove parsing of section headers
  kexec_elf: remove PURGATORY_STACK_SIZE
  kexec_elf: remove Elf_Rel macro
  kexec_elf: remove unused variable in kexec_elf_load()
  kexec_elf: support 32 bit ELF files

 arch/Kconfig   |   3 +
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/kernel/kexec_elf_64.c | 551 +
 include/linux/kexec.h  |  23 ++
 kernel/Makefile|   1 +
 kernel/kexec_elf.c | 418 ++
 6 files changed, 456 insertions(+), 541 deletions(-)
 create mode 100644 kernel/kexec_elf.c

-- 
2.20.1



[PATCH v4 1/7] kexec: add KEXEC_ELF

2019-07-15 Thread Sven Schnelle
Right now powerpc provides an implementation to read elf files
with the kexec_file() syscall. Make that available as a public
kexec interface so it can be re-used on other architectures.

Signed-off-by: Sven Schnelle 
---
 arch/Kconfig  |   3 +
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/kernel/kexec_elf_64.c| 551 +-
 include/linux/kexec.h |  24 +
 kernel/Makefile   |   1 +
 .../kexec_elf_64.c => kernel/kexec_elf.c  | 199 ++-
 6 files changed, 75 insertions(+), 704 deletions(-)
 copy arch/powerpc/kernel/kexec_elf_64.c => kernel/kexec_elf.c (71%)

diff --git a/arch/Kconfig b/arch/Kconfig
index c47b328eada0..30694aca4316 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -18,6 +18,9 @@ config KEXEC_CORE
select CRASH_CORE
bool
 
+config KEXEC_ELF
+   bool
+
 config HAVE_IMA_KEXEC
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 12cee37f15c4..addc2dad78e0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -510,6 +510,7 @@ config KEXEC_FILE
select KEXEC_CORE
select HAVE_IMA_KEXEC
select BUILD_BIN2C
+   select KEXEC_ELF
depends on PPC64
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec_elf_64.c
index ba4f18a43ee8..30bd57a93c17 100644
--- a/arch/powerpc/kernel/kexec_elf_64.c
+++ b/arch/powerpc/kernel/kexec_elf_64.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0-only
 /*
  * Load ELF vmlinux file for the kexec_file_load syscall.
  *
@@ -10,15 +11,6 @@
  * Based on kexec-tools' kexec-elf-exec.c and kexec-elf-ppc64.c.
  * Heavily modified for the kernel by
  * Thiago Jung Bauermann .
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation (version 2 of the License).
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
  */
 
 #define pr_fmt(fmt)"kexec_elf: " fmt
@@ -39,532 +31,6 @@
 #define Elf_RelElf64_Rel
 #endif /* Elf_Rel */
 
-struct elf_info {
-   /*
-* Where the ELF binary contents are kept.
-* Memory managed by the user of the struct.
-*/
-   const char *buffer;
-
-   const struct elfhdr *ehdr;
-   const struct elf_phdr *proghdrs;
-   struct elf_shdr *sechdrs;
-};
-
-static inline bool elf_is_elf_file(const struct elfhdr *ehdr)
-{
-   return memcmp(ehdr->e_ident, ELFMAG, SELFMAG) == 0;
-}
-
-static uint64_t elf64_to_cpu(const struct elfhdr *ehdr, uint64_t value)
-{
-   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
-   value = le64_to_cpu(value);
-   else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
-   value = be64_to_cpu(value);
-
-   return value;
-}
-
-static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value)
-{
-   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
-   value = le16_to_cpu(value);
-   else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
-   value = be16_to_cpu(value);
-
-   return value;
-}
-
-static uint32_t elf32_to_cpu(const struct elfhdr *ehdr, uint32_t value)
-{
-   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
-   value = le32_to_cpu(value);
-   else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
-   value = be32_to_cpu(value);
-
-   return value;
-}
-
-/**
- * elf_is_ehdr_sane - check that it is safe to use the ELF header
- * @buf_len:   size of the buffer in which the ELF file is loaded.
- */
-static bool elf_is_ehdr_sane(const struct elfhdr *ehdr, size_t buf_len)
-{
-   if (ehdr->e_phnum > 0 && ehdr->e_phentsize != sizeof(struct elf_phdr)) {
-   pr_debug("Bad program header size.\n");
-   return false;
-   } else if (ehdr->e_shnum > 0 &&
-  ehdr->e_shentsize != sizeof(struct elf_shdr)) {
-   pr_debug("Bad section header size.\n");
-   return false;
-   } else if (ehdr->e_ident[EI_VERSION] != EV_CURRENT ||
-  ehdr->e_version != EV_CURRENT) {
-   pr_debug("Unknown ELF version.\n");
-   return false;
-   }
-
-   if (ehdr->e_phoff > 0 && ehdr->e_phnum > 0) {
-   size_t phdr_size;
-
-   /*
-* e_phnum is at most 65535 so calculating the size of the
-* program header cannot overflow.
-*/
-   phdr_size = sizeof(struct elf_phdr) * ehdr->e_phnum;
-
-   /* Sanity check the program header table location. */
-   if (ehdr->e_phoff + phdr_size < ehdr->e_

Re: [PATCH kernel v4 2/4] powerpc/iommu: Allow bypass-only for DMA

2019-07-15 Thread Alexey Kardashevskiy




On 13/07/2019 01:20, Christoph Hellwig wrote:

This skips the 32bit DMA setup check if the bypass is can be selected.


That sentence does not parse.  I think you need to dop the "can be"
based on the actual patch.



"the 32bit DMA setup check" is
"if (!(tbl = get_iommu_table_base(dev)))".

I can rephrase though.


--
Alexey


Re: [PATCH kernel v4 0/4 repost] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59

2019-07-15 Thread Alexey Kardashevskiy




On 13/07/2019 01:28, Christoph Hellwig wrote:

On Fri, Jul 12, 2019 at 07:45:05PM +1000, Alexey Kardashevskiy wrote:

This is an attempt to allow DMA masks between 32..59 which are not large
enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
on the max order, up to 40 is usually available.


Can you elaborate what you man with supported in detail?  In the end
a DMA devices DMA capability is only really interesting as a lower
bound.

e.g. if you have a DMA that supports 40-bit DMA addressing we could
always treat it as if supports 32-bit addressing,and I thought the
powerpc code does that, 


powerpc does that and this is what the patchset is changing as people 
complained that 2GB DMA window has bad effects on AMD GPUs (cannot 
allocate enough buffers) and 40/100Gbit devices (lower performance), I 
do not have the details handy.



as the DMA API now relies on that.


Relies on what precisely? If a device cannot do full 64bit, then it has 
to be no more than just 32bit?




 Did I miss
something and it explicitly rejected that (in which case I didn't spot
the fix in this series), or is this just an optimization to handle these
devices more optimally, in which case maybe the changelog could be
improved a bit.



4/4 did this essentially:

-   const u64 window_size = min((u64)pe->table_group.tce32_size, 
max_memory);
+   const u64 maxblock = 1UL << (PAGE_SHIFT + MAX_ORDER - 1);
+   const u64 window_size = min((maxblock * 8) << PAGE_SHIFT, max_memory);

where tce32_size==2<<30. The 4/4 commit log has the details, what is 
missing there? Thanks,




--
Alexey


Misc EEH fixes

2019-07-15 Thread Oliver O'Halloran
Fixes of no particular importance. I just wanted to cut down the pile of
patches I've got hanging around.




[PATCH 1/5] powerpc/eeh_cache: Don't use pci_dn when inserting new ranges

2019-07-15 Thread Oliver O'Halloran
At the point where we start inserting ranges into the EEH address cache the
binding between pci_dev and eeh_dev has already been set up. Instead of
consulting the pci_dn tree we can retrieve the eeh_dev directly using
pci_dev_to_eeh_dev().

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh_cache.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
index 3204723..908ba69 100644
--- a/arch/powerpc/kernel/eeh_cache.c
+++ b/arch/powerpc/kernel/eeh_cache.c
@@ -156,18 +156,10 @@ eeh_addr_cache_insert(struct pci_dev *dev, 
resource_size_t alo,
 
 static void __eeh_addr_cache_insert_dev(struct pci_dev *dev)
 {
-   struct pci_dn *pdn;
struct eeh_dev *edev;
int i;
 
-   pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
-   if (!pdn) {
-   pr_warn("PCI: no pci dn found for dev=%s\n",
-   pci_name(dev));
-   return;
-   }
-
-   edev = pdn_to_eeh_dev(pdn);
+   edev = pci_dev_to_eeh_dev(dev);
if (!edev) {
pr_warn("PCI: no EEH dev found for %s\n",
pci_name(dev));
-- 
2.9.5



[PATCH 2/5] powerpc/eeh_sysfs: Fix incorrect comment

2019-07-15 Thread Oliver O'Halloran
The EEH_ATTR_SHOW() helper is used to display fields from struct eeh_dev
not struct pci_dn.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh_sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
index 3fa04dd..6a2c2886f 100644
--- a/arch/powerpc/kernel/eeh_sysfs.c
+++ b/arch/powerpc/kernel/eeh_sysfs.c
@@ -30,7 +30,7 @@
 /**
  * EEH_SHOW_ATTR -- Create sysfs entry for eeh statistic
  * @_name: name of file in sysfs directory
- * @_memb: name of member in struct pci_dn to access
+ * @_memb: name of member in struct eeh_dev to access
  * @_format: printf format for display
  *
  * All of the attributes look very similar, so just
-- 
2.9.5



[PATCH 3/5] powerpc/eeh_sysfs: ifdef pseries sr-iov sysfs properties

2019-07-15 Thread Oliver O'Halloran
There are several EEH sysfs properties that only exists when the
"ibm,is-open-sriov-pf" property appears in the device tree node of the PCI
device. This used on pseries to indicate to the guest that the hypervisor
allows the guest to configure the SR-IOV capability. Doing this requires
some handshaking between the guest, hypervisor and userspace when a VF is
EEH frozen which is why these properties exist.

This is all dead code on non-pseries platforms so wrap it in an #ifdef
CONFIG_PPC_PSERIES to make the dependency clearer.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh_sysfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
index 6a2c2886f..3adf8cd 100644
--- a/arch/powerpc/kernel/eeh_sysfs.c
+++ b/arch/powerpc/kernel/eeh_sysfs.c
@@ -91,7 +91,7 @@ static ssize_t eeh_pe_state_store(struct device *dev,
 
 static DEVICE_ATTR_RW(eeh_pe_state);
 
-#ifdef CONFIG_PCI_IOV
+#if defined(CONFIG_PCI_IOV) && defined(CONFIG_PPC_PSERIES)
 static ssize_t eeh_notify_resume_show(struct device *dev,
  struct device_attribute *attr, char *buf)
 {
@@ -148,7 +148,7 @@ static void eeh_notify_resume_remove(struct pci_dev *pdev)
 #else
 static inline int eeh_notify_resume_add(struct pci_dev *pdev) { return 0; }
 static inline void eeh_notify_resume_remove(struct pci_dev *pdev) { }
-#endif /* CONFIG_PCI_IOV */
+#endif /* CONFIG_PCI_IOV && CONFIG PPC_PSERIES*/
 
 void eeh_sysfs_add_device(struct pci_dev *pdev)
 {
-- 
2.9.5



[PATCH 4/5] powerpc/eeh_sysfs: Remove double pci_dn lookup.

2019-07-15 Thread Oliver O'Halloran
In eeh_notify_resume_show() the pci_dn for the device is looked up once in
the declaration block and then once after checking for a NULL eeh_dev.
Remove the second lookup since it's pointless.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh_sysfs.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
index 3adf8cd..c4cc8fc 100644
--- a/arch/powerpc/kernel/eeh_sysfs.c
+++ b/arch/powerpc/kernel/eeh_sysfs.c
@@ -102,7 +102,6 @@ static ssize_t eeh_notify_resume_show(struct device *dev,
if (!edev || !edev->pe)
return -ENODEV;
 
-   pdn = pci_get_pdn(pdev);
return sprintf(buf, "%d\n", pdn->last_allow_rc);
 }
 
-- 
2.9.5



[PATCH 5/5] powerpc/eeh_sysfs: Make clearing EEH_DEV_SYSFS saner

2019-07-15 Thread Oliver O'Halloran
The eeh_sysfs_remove_device() function is supposed to clear the
EEH_DEV_SYSFS flag since it indicates the EEH sysfs entries have been added
for a pci_dev.

When the sysfs files are removed eeh_remove_device() the eeh_dev and the
pci_dev have already been de-associated. This then causes the
pci_dev_to_eeh_dev() call in eeh_sysfs_remove_device() to return NULL so
the flag can't be cleared from the still-live eeh_dev. This problem is
worked around in the caller by clearing the flag manually. However, this
behaviour doesn't make a whole lot of sense, so this patch fixes it by:

a) Re-ordering eeh_remove_device() so that eeh_sysfs_remove_device() is
   called before de-associating the pci_dev and eeh_dev.

b) Making eeh_sysfs_remove_device() emit a warning if there's no
   corresponding eeh_dev for a pci_dev. The paths where the sysfs
   files are only reachable if EEH was setup for the device
   for the device in the first place so hitting this warning
   indicates a programming error.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/eeh.c   | 30 +-
 arch/powerpc/kernel/eeh_sysfs.c | 15 ---
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index f192d57..6e24896 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1203,7 +1203,6 @@ void eeh_add_device_late(struct pci_dev *dev)
eeh_rmv_from_parent_pe(edev);
eeh_addr_cache_rmv_dev(edev->pdev);
eeh_sysfs_remove_device(edev->pdev);
-   edev->mode &= ~EEH_DEV_SYSFS;
 
/*
 * We definitely should have the PCI device removed
@@ -1306,17 +1305,11 @@ void eeh_remove_device(struct pci_dev *dev)
edev->pdev = NULL;
 
/*
-* The flag "in_error" is used to trace EEH devices for VFs
-* in error state or not. It's set in eeh_report_error(). If
-* it's not set, eeh_report_{reset,resume}() won't be called
-* for the VF EEH device.
+* eeh_sysfs_remove_device() uses pci_dev_to_eeh_dev() so we need to
+* remove the sysfs files before clearing dev.archdata.edev
 */
-   edev->in_error = false;
-   dev->dev.archdata.edev = NULL;
-   if (!(edev->pe->state & EEH_PE_KEEP))
-   eeh_rmv_from_parent_pe(edev);
-   else
-   edev->mode |= EEH_DEV_DISCONNECTED;
+   if (edev->mode & EEH_DEV_SYSFS)
+   eeh_sysfs_remove_device(dev);
 
/*
 * We're removing from the PCI subsystem, that means
@@ -1327,8 +1320,19 @@ void eeh_remove_device(struct pci_dev *dev)
edev->mode |= EEH_DEV_NO_HANDLER;
 
eeh_addr_cache_rmv_dev(dev);
-   eeh_sysfs_remove_device(dev);
-   edev->mode &= ~EEH_DEV_SYSFS;
+
+   /*
+* The flag "in_error" is used to trace EEH devices for VFs
+* in error state or not. It's set in eeh_report_error(). If
+* it's not set, eeh_report_{reset,resume}() won't be called
+* for the VF EEH device.
+*/
+   edev->in_error = false;
+   dev->dev.archdata.edev = NULL;
+   if (!(edev->pe->state & EEH_PE_KEEP))
+   eeh_rmv_from_parent_pe(edev);
+   else
+   edev->mode |= EEH_DEV_DISCONNECTED;
 }
 
 int eeh_unfreeze_pe(struct eeh_pe *pe)
diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
index c4cc8fc..5614fd83 100644
--- a/arch/powerpc/kernel/eeh_sysfs.c
+++ b/arch/powerpc/kernel/eeh_sysfs.c
@@ -175,22 +175,23 @@ void eeh_sysfs_remove_device(struct pci_dev *pdev)
 {
struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
 
+   if (!edev) {
+   WARN_ON(eeh_enabled());
+   return;
+   }
+
+   edev->mode &= ~EEH_DEV_SYSFS;
+
/*
 * The parent directory might have been removed. We needn't
 * continue for that case.
 */
-   if (!pdev->dev.kobj.sd) {
-   if (edev)
-   edev->mode &= ~EEH_DEV_SYSFS;
+   if (!pdev->dev.kobj.sd)
return;
-   }
 
device_remove_file(&pdev->dev, &dev_attr_eeh_mode);
device_remove_file(&pdev->dev, &dev_attr_eeh_pe_config_addr);
device_remove_file(&pdev->dev, &dev_attr_eeh_pe_state);
 
eeh_notify_resume_remove(pdev);
-
-   if (edev)
-   edev->mode &= ~EEH_DEV_SYSFS;
 }
-- 
2.9.5



Re: [PATCH v3 03/11] s390x/mm: Implement arch_remove_memory()

2019-07-15 Thread David Hildenbrand
On 01.07.19 14:47, Michal Hocko wrote:
> On Mon 01-07-19 09:45:03, Michal Hocko wrote:
>> On Mon 27-05-19 13:11:44, David Hildenbrand wrote:
>>> Will come in handy when wanting to handle errors after
>>> arch_add_memory().
>>
>> I do not understand this. Why do you add a code for something that is
>> not possible on this HW (based on the comment - is it still valid btw?)
> 
> Same as the previous patch (drop it).

No. As the description says, this will be needed to handle errors in
patch 6 cleanly.

And BTW, with paravirtualied devices like virtio-pmem and virtio-mem,
this will also see some other users in the future.

Thanks.

> 
>>> Cc: Martin Schwidefsky 
>>> Cc: Heiko Carstens 
>>> Cc: Andrew Morton 
>>> Cc: Michal Hocko 
>>> Cc: Mike Rapoport 
>>> Cc: David Hildenbrand 
>>> Cc: Vasily Gorbik 
>>> Cc: Oscar Salvador 
>>> Signed-off-by: David Hildenbrand 
>>> ---
>>>  arch/s390/mm/init.c | 13 +++--
>>>  1 file changed, 7 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
>>> index d552e330fbcc..14955e0a9fcf 100644
>>> --- a/arch/s390/mm/init.c
>>> +++ b/arch/s390/mm/init.c
>>> @@ -243,12 +243,13 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>  void arch_remove_memory(int nid, u64 start, u64 size,
>>> struct vmem_altmap *altmap)
>>>  {
>>> -   /*
>>> -* There is no hardware or firmware interface which could trigger a
>>> -* hot memory remove on s390. So there is nothing that needs to be
>>> -* implemented.
>>> -*/
>>> -   BUG();
>>> +   unsigned long start_pfn = start >> PAGE_SHIFT;
>>> +   unsigned long nr_pages = size >> PAGE_SHIFT;
>>> +   struct zone *zone;
>>> +
>>> +   zone = page_zone(pfn_to_page(start_pfn));
>>> +   __remove_pages(zone, start_pfn, nr_pages, altmap);
>>> +   vmem_remove_mapping(start, size);
>>>  }
>>>  #endif
>>>  #endif /* CONFIG_MEMORY_HOTPLUG */
>>> -- 
>>> 2.20.1
>>>
>>
>> -- 
>> Michal Hocko
>> SUSE Labs
> 


-- 

Thanks,

David / dhildenb


Re: [PATCH v3 02/11] s390x/mm: Fail when an altmap is used for arch_add_memory()

2019-07-15 Thread David Hildenbrand
On 01.07.19 14:46, Michal Hocko wrote:
> On Mon 01-07-19 09:43:06, Michal Hocko wrote:
>> On Mon 27-05-19 13:11:43, David Hildenbrand wrote:
>>> ZONE_DEVICE is not yet supported, fail if an altmap is passed, so we
>>> don't forget arch_add_memory()/arch_remove_memory() when unlocking
>>> support.
>>
>> Why do we need this? Sure ZONE_DEVICE is not supported for s390 and so
>> might be the case for other arches which support hotplug. I do not see
>> much point in adding warning to each of them.
> 
> I would drop this one. If there is a strong reason to have something
> like that it should come with a better explanation and it can be done on
> top.
> 

This was requested by Dan and I agree it is the right thing to do. In
the context of paravirtualized devices (e.g., virtio-pmem), it makes
sense to block functionality an arch does not support.

I'll leave the decision to Andrew.

-- 

Thanks,

David / dhildenb


Re: [PATCH v3 06/11] mm/memory_hotplug: Allow arch_remove_pages() without CONFIG_MEMORY_HOTREMOVE

2019-07-15 Thread David Hildenbrand
On 01.07.19 14:51, Michal Hocko wrote:
> On Mon 01-07-19 10:01:41, Michal Hocko wrote:
>> On Mon 27-05-19 13:11:47, David Hildenbrand wrote:
>>> We want to improve error handling while adding memory by allowing
>>> to use arch_remove_memory() and __remove_pages() even if
>>> CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:
>>>
>>> arch_add_memory()
>>> rc = do_something();
>>> if (rc) {
>>> arch_remove_memory();
>>> }
>>>
>>> We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
>>> quite some dependencies for memory offlining.
>>
>> If we cannot really remove CONFIG_MEMORY_HOTREMOVE altogether then why
>> not simply add an empty placeholder for arch_remove_memory when the
>> config is disabled?
> 
> In other words, can we replace this by something as simple as:
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index ae892eef8b82..0329027fe740 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -128,6 +128,20 @@ extern void arch_remove_memory(int nid, u64 start, u64 
> size,
>  struct vmem_altmap *altmap);
>  extern void __remove_pages(struct zone *zone, unsigned long start_pfn,
>  unsigned long nr_pages, struct vmem_altmap *altmap);
> +#else
> +/*
> + * Allow code using
> + * arch_add_memory();
> + * rc = do_something();
> + * if (rc)
> + *   arch_remove_memory();
> + *
> + * without ifdefery.
> + */
> +static inline void arch_remove_memory(int nid, u64 start, u64 size,
> +struct vmem_altmap *altmap)
> +{
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  /*
> 

A system configured without CONFIG_MEMORY_HOTREMOVE should not suddenly
behave worse than before when adding of memory fails. What you suggest
result in that.

The goal should be to force architectures to properly implement
arch_remove_memory() right from the start - which is the case for all
architectures after this patch set *except* arm, for which a proper
implementation is on the way.

So I'm leaving it like it is. arch_remove_memory() will be mandatory for
architectures implementing arch_add_memory().

-- 

Thanks,

David / dhildenb


Re: [PATCH v3 09/11] mm/memory_hotplug: Remove memory block devices before arch_remove_memory()

2019-07-15 Thread David Hildenbrand
On 01.07.19 10:41, Michal Hocko wrote:
> On Mon 27-05-19 13:11:50, David Hildenbrand wrote:
>> Let's factor out removing of memory block devices, which is only
>> necessary for memory added via add_memory() and friends that created
>> memory block devices. Remove the devices before calling
>> arch_remove_memory().
>>
>> This finishes factoring out memory block device handling from
>> arch_add_memory() and arch_remove_memory().
> 
> OK, this makes sense again. Just a nit. Calling find_memory_block_by_id
> for each memory block looks a bit suboptimal, especially when we are
> removing consequent physical memblocks. I have to confess that I do not
> know how expensive is the search and I also expect that there won't be
> that many memblocks in the removed range anyway as large setups have
> large memblocks.
> 

The devices are not allocated sequentially, so there is no easy way to
look them up.

There is a comment for find_memory_block():

"For now, we have a linear search to go find the appropriate
memory_block corresponding to a particular phys_index. If this gets to
be a real problem, we can always use a radix tree or something here."

So if this becomes a problem, we need a separate data structure to speed
up the lookup. (IOW, this was already the same in the old code)

Thanks!

-- 

Thanks,

David / dhildenb


Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail

2019-07-15 Thread David Hildenbrand
On 01.07.19 12:27, Michal Hocko wrote:
> On Mon 01-07-19 11:36:44, Oscar Salvador wrote:
>> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote:
>>> Yeah, we do not allow to offline multi zone (node) ranges so the current
>>> code seems to be over engineered.
>>>
>>> Anyway, I am wondering why do we have to strictly check for already
>>> removed nodes links. Is the sysfs code going to complain we we try to
>>> remove again?
>>
>> No, sysfs will silently "fail" if the symlink has already been removed.
>> At least that is what I saw last time I played with it.
>>
>> I guess the question is what if sysfs handling changes in the future
>> and starts dropping warnings when trying to remove a symlink is not there.
>> Maybe that is unlikely to happen?
> 
> And maybe we handle it then rather than have a static allocation that
> everybody with hotremove configured has to pay for.
> 

So what's the suggestion? Dropping the nodemask_t completely and calling
sysfs_remove_link() on already potentially removed links?

Of course, we can also just use mem_blk->nid and rest assured that it
will never be called for memory blocks belonging to multiple nodes.

-- 

Thanks,

David / dhildenb


Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-15 Thread Masahiro Yamada
On Mon, Jul 15, 2019 at 4:30 PM Segher Boessenkool
 wrote:
>
> On Mon, Jul 15, 2019 at 05:05:34PM +1000, Michael Ellerman wrote:
> > Segher Boessenkool  writes:
> > > Yes, that is why I used the environment variable, all binutils work
> > > with that.  There was no --target option in GNU ar before 2.22.

I use binutils 2.30
It does not understand --target option.

$ powerpc-linux-ar --version
GNU ar (GNU Binutils) 2.30
Copyright (C) 2018 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) any later version.
This program has absolutely no warranty.

If I give --target=elf$(BITS)-$(GNUTARGET) option, I see this:
powerpc-linux-ar: -t: No such file or directory



> > Yeah, we're not very good at testing with really old binutils, so I
> > guess we broke that.
> >
> > I'm inclined to merge this, it doesn't seem to break anything, and it
> > fixes using --target on old binutils that don't have it.
>
> But we don't set the target any other way either.  I don't think this
> will work with a 32-bit toolchain (default target 32 bit) and a 64-bit
> kernel, or the other way around.
>
> Then again, does that work at *all* nowadays?  Do we even consider that
> important, *should* it work?


Let me confirm if I understood this discussion.


[1] KBUILD_ARFLAGS += --target=elf$(BITS)-$(GNUTARGET)
is pointless since it is always overridden by another
KBUILD_ARFLAGS assignment.

[2] If we stop overriding it, it would cause build errors.
So, --target is not only useless, but it is rather harmful.


So, we all agreed with this patch, right?


We are discussing whether or not to revive
GNUTARGET=elf$(BITS)-$(GNUTARGET)
in a *separate* patch, correct?


-- 
Best Regards
Masahiro Yamada


Re: [PATCH kernel v4 2/4] powerpc/iommu: Allow bypass-only for DMA

2019-07-15 Thread Christoph Hellwig
On Mon, Jul 15, 2019 at 06:33:00PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 13/07/2019 01:20, Christoph Hellwig wrote:
> > > This skips the 32bit DMA setup check if the bypass is can be selected.
> > 
> > That sentence does not parse.  I think you need to dop the "can be"
> > based on the actual patch.
> 
> 
> "the 32bit DMA setup check" is
> "if (!(tbl = get_iommu_table_base(dev)))".
> 
> I can rephrase though.

What I mean is to replace

"This skips the 32bit DMA setup check if the bypass is can be selected."

with

"This skips the 32bit DMA setup check if the bypass is selected."

or alternatively:

"This skips the 32bit DMA setup check if the bypass can be selected."

but I think the first version is more accurate.


Re: [PATCH kernel v4 0/4 repost] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59

2019-07-15 Thread Christoph Hellwig
On Mon, Jul 15, 2019 at 06:43:12PM +1000, Alexey Kardashevskiy wrote:
> > e.g. if you have a DMA that supports 40-bit DMA addressing we could
> > always treat it as if supports 32-bit addressing,and I thought the
> > powerpc code does that,
> 
> powerpc does that and this is what the patchset is changing as people
> complained that 2GB DMA window has bad effects on AMD GPUs (cannot allocate
> enough buffers) and 40/100Gbit devices (lower performance), I do not have
> the details handy.

Make sense.  I'm just surprised about the complains from the habalabs
folks, which sounded like a 40something bit DMA mask did not work at all
for them on power9, which did not fit my reading of the code.

> 
> > as the DMA API now relies on that.
> 
> Relies on what precisely? If a device cannot do full 64bit, then it has to
> be no more than just 32bit?

The fact that if say you iommu only supports mode that return up to
32-bit iova and a driver sets a 48 or 64-bit mask you still return
success instead of letting the driver handle the failure and set a
32-bit mask in the fallback code.  As said I think the powerpc code
is fine based on my reading from it.

> >  Did I miss
> > something and it explicitly rejected that (in which case I didn't spot
> > the fix in this series), or is this just an optimization to handle these
> > devices more optimally, in which case maybe the changelog could be
> > improved a bit.
> 
> 
> 4/4 did this essentially:

As long as the above is fine (which I think it is) just make it a little
mor clear that this is a simple optimization, not a bug fix for DMA API
usage.


Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-15 Thread Christian Borntraeger
I think Vasily already has a clone3 patch for s390x with 435. 


On 14.07.19 21:22, Christian Brauner wrote:
> A while ago Arnd made it possible to give new system calls the same
> syscall number on all architectures (except alpha). To not break this
> nice new feature let's mark 435 for clone3 as reserved on all
> architectures that do not yet implement it.
> Even if an architecture does not plan to implement it this ensures that
> new system calls coming after clone3 will have the same number on all
> architectures.
> 
> Signed-off-by: Christian Brauner 
> Cc: Arnd Bergmann 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@lists.linux-m68k.org
> Cc: linux-m...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> ---
>  arch/alpha/kernel/syscalls/syscall.tbl| 1 +
>  arch/ia64/kernel/syscalls/syscall.tbl | 1 +
>  arch/m68k/kernel/syscalls/syscall.tbl | 1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
>  arch/parisc/kernel/syscalls/syscall.tbl   | 1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl  | 1 +
>  arch/s390/kernel/syscalls/syscall.tbl | 1 +
>  arch/sh/kernel/syscalls/syscall.tbl   | 1 +
>  arch/sparc/kernel/syscalls/syscall.tbl| 1 +
>  11 files changed, 11 insertions(+)
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
> b/arch/alpha/kernel/syscalls/syscall.tbl
> index 1db9bbcfb84e..728fe028c02c 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -474,3 +474,4 @@
>  542  common  fsmount sys_fsmount
>  543  common  fspick  sys_fspick
>  544  common  pidfd_open  sys_pidfd_open
> +# 545 reserved for clone3
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
> b/arch/ia64/kernel/syscalls/syscall.tbl
> index ecc44926737b..36d5faf4c86c 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -355,3 +355,4 @@
>  432  common  fsmount sys_fsmount
>  433  common  fspick  sys_fspick
>  434  common  pidfd_open  sys_pidfd_open
> +# 435 reserved for clone3
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
> b/arch/m68k/kernel/syscalls/syscall.tbl
> index 9a3eb2558568..a88a285a0e5f 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -434,3 +434,4 @@
>  432  common  fsmount sys_fsmount
>  433  common  fspick  sys_fspick
>  434  common  pidfd_open  sys_pidfd_open
> +# 435 reserved for clone3
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
> b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 97035e19ad03..c9c879ec9b6d 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -373,3 +373,4 @@
>  432  n32 fsmount sys_fsmount
>  433  n32 fspick  sys_fspick
>  434  n32 pidfd_open  sys_pidfd_open
> +# 435 reserved for clone3
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl 
> b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index d7292722d3b0..bbce9159caa1 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -349,3 +349,4 @@
>  432  n64 fsmount sys_fsmount
>  433  n64 fspick  sys_fspick
>  434  n64 pidfd_open  sys_pidfd_open
> +# 435 reserved for clone3
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl 
> b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index dba084c92f14..9653591428ec 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -422,3 +422,4 @@
>  432  o32 fsmount sys_fsmount
>  433  o32 fspick  sys_fspick
>  434  o32 pidfd_open  sys_pidfd_open
> +# 435 reserved for clone3
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl 
> b/arch/parisc/kernel/syscalls/syscall.tbl
> index 5022b9e179c2..c7aadfef5386 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -431,3 +431,4 @@
>  432  common  fsmount sys_fsmount
>  433  common  fspick  sys_fspick
>  434  common  pidfd_open  sys_pidfd_open
> +# 435 reserved for clone3
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
> b/arch/powerpc/kernel/syscalls/syscall.tbl
> index f2c3bda2d39f..3331749a

Re: [PATCH 3/3] fs/core/vmcore: Move sev_active() reference to x86 arch code

2019-07-15 Thread Halil Pasic
On Fri, 12 Jul 2019 18:55:47 -0300
Thiago Jung Bauermann  wrote:

> 
> [ Cc'ing Tom Lendacky which I forgot to do earlier. Sorry about that. ]
> 
> Hello Halil,
> 
> Thanks for the quick review.
> 
> Halil Pasic  writes:
> 
> > On Fri, 12 Jul 2019 02:36:31 -0300
> > Thiago Jung Bauermann  wrote:
> >
> >> Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't
> >> appear in generic kernel code because it forces non-x86 architectures to
> >> define the sev_active() function, which doesn't make a lot of sense.
> >
> > sev_active() might be just bad (too specific) name for a general
> > concept. s390 code defines it drives the right behavior in
> > kernel/dma/direct.c (which uses it).
> 
> I thought about that but couldn't put my finger on a general concept.
> Is it "guest with memory inaccessible to the host"?
> 

Well, force_dma_unencrypted() is a much better name thatn sev_active():
s390 has no AMD SEV, that is sure, but for virtio to work we do need to
make our dma accessible to the hypervisor. Yes, your "guest with memory
inaccessible to the host" shows into the right direction IMHO.
Unfortunately I don't have too many cycles to spend on this right now.

> Since your proposed definiton for force_dma_unencrypted() is simply to
> make it equivalent to sev_active(), I thought it was more
> straightforward to make each arch define force_dma_unencrypted()
> directly.

I did not mean to propose equivalence. I intended to say the name
sev_active() is not suitable for a common concept. On the other hand
we do have a common concept -- as common code needs to do or not do
things depending on whether "memory is protected/encrypted" or not. I'm
fine with the name force_dma_unencrypted(), especially because I don't
have a better name.

> 
> Also, does sev_active() drive the right behavior for s390 in
> elfcorehdr_read() as well?
> 

AFAIU, since s390 does not override it boils down to the same, whether
sev_active() returns true or false. I'm no expert in that area, but I
strongly hope that is the right behavior. @Janosch: can you help me
out with this one?

> >> To solve this problem, add an x86 elfcorehdr_read() function to override
> >> the generic weak implementation. To do that, it's necessary to make
> >> read_from_oldmem() public so that it can be used outside of vmcore.c.
> >>
> >> Signed-off-by: Thiago Jung Bauermann 
> >> ---
> >>  arch/x86/kernel/crash_dump_64.c |  5 +
> >>  fs/proc/vmcore.c|  8 
> >>  include/linux/crash_dump.h  | 14 ++
> >>  include/linux/mem_encrypt.h |  1 -
> >>  4 files changed, 23 insertions(+), 5 deletions(-)
> >
> > Does not seem to apply to today's or yesterdays master.
> 
> It assumes the presence of the two patches I mentioned in the cover
> letter. Only one of them is in master.
> 
> I hadn't realized the s390 virtio patches were on their way to upstream.
> I was keeping an eye on the email thread but didn't see they were picked
> up in the s390 pull request. I'll add a new patch to this series making
> the corresponding changes to s390's  as well.
> 

Being on cc for your patch made me realize that things got broken on
s390. Thanks! I've sent out a patch that fixes protvirt, but we are going
to benefit from your cleanups. I think with your cleanups and that patch
of mine both sev_active() and sme_active() can be removed. Feel free to
do so. If not, I can attend to it as well.

Regards,
Halil



Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-15 Thread Christian Brauner
On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> I think Vasily already has a clone3 patch for s390x with 435. 

Excellent. I'll leave the # 435 reserved for clone3 on s390x in until
this patch has landed. It shouldn't be a merge conflict and if so it
should be trivial.

Christian


Re: [PATCH 3/3] fs/core/vmcore: Move sev_active() reference to x86 arch code

2019-07-15 Thread Christoph Hellwig
On Mon, Jul 15, 2019 at 04:03:17PM +0200, Halil Pasic wrote:
> > I thought about that but couldn't put my finger on a general concept.
> > Is it "guest with memory inaccessible to the host"?
> > 
> 
> Well, force_dma_unencrypted() is a much better name thatn sev_active():
> s390 has no AMD SEV, that is sure, but for virtio to work we do need to
> make our dma accessible to the hypervisor. Yes, your "guest with memory
> inaccessible to the host" shows into the right direction IMHO.
> Unfortunately I don't have too many cycles to spend on this right now.

In x86 it means that we need to remove dma encryption using
set_memory_decrypted before using it for DMA purposes.  In the SEV
case that seems to be so that the hypervisor can access it, in the SME
case that Tom just fixes it is because there is an encrypted bit set
in the physical address, and if the device doesn't support a large
enough DMA address the direct mapping code has to encrypt the pages
used for the contigous allocation.

> Being on cc for your patch made me realize that things got broken on
> s390. Thanks! I've sent out a patch that fixes protvirt, but we are going
> to benefit from your cleanups. I think with your cleanups and that patch
> of mine both sev_active() and sme_active() can be removed. Feel free to
> do so. If not, I can attend to it as well.

Yes, I think with the dma-mapping fix and this series sme_active and
sev_active should be gone from common code.  We should also be able
to remove the exports x86 has for them.

I'll wait a few days and will then feed the dma-mapping fix to Linus,
it might make sense to either rebase Thiagos series on top of the
dma-mapping for-next branch, or wait a few days before reposting.


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Michael S. Tsirkin
On Sun, Jul 14, 2019 at 02:51:18AM -0300, Thiago Jung Bauermann wrote:
> 
> 
> Michael S. Tsirkin  writes:
> 
> > On Thu, Jun 27, 2019 at 10:58:40PM -0300, Thiago Jung Bauermann wrote:
> >>
> >> Michael S. Tsirkin  writes:
> >>
> >> > On Mon, Jun 03, 2019 at 10:13:59PM -0300, Thiago Jung Bauermann wrote:
> >> >>
> >> >>
> >> >> Michael S. Tsirkin  writes:
> >> >>
> >> >> > On Wed, Apr 17, 2019 at 06:42:00PM -0300, Thiago Jung Bauermann wrote:
> >> >> >> I rephrased it in terms of address translation. What do you think of
> >> >> >> this version? The flag name is slightly different too:
> >> >> >>
> >> >> >>
> >> >> >> VIRTIO_F_ACCESS_PLATFORM_NO_TRANSLATION This feature has the same
> >> >> >> meaning as VIRTIO_F_ACCESS_PLATFORM both when set and when not 
> >> >> >> set,
> >> >> >> with the exception that address translation is guaranteed to be
> >> >> >> unnecessary when accessing memory addresses supplied to the 
> >> >> >> device
> >> >> >> by the driver. Which is to say, the device will always use 
> >> >> >> physical
> >> >> >> addresses matching addresses used by the driver (typically 
> >> >> >> meaning
> >> >> >> physical addresses used by the CPU) and not translated further. 
> >> >> >> This
> >> >> >> flag should be set by the guest if offered, but to allow for
> >> >> >> backward-compatibility device implementations allow for it to be
> >> >> >> left unset by the guest. It is an error to set both this flag and
> >> >> >> VIRTIO_F_ACCESS_PLATFORM.
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > OK so VIRTIO_F_ACCESS_PLATFORM is designed to allow unpriveledged
> >> >> > drivers. This is why devices fail when it's not negotiated.
> >> >>
> >> >> Just to clarify, what do you mean by unprivileged drivers? Is it drivers
> >> >> implemented in guest userspace such as with VFIO? Or unprivileged in
> >> >> some other sense such as needing to use bounce buffers for some reason?
> >> >
> >> > I had drivers in guest userspace in mind.
> >>
> >> Great. Thanks for clarifying.
> >>
> >> I don't think this flag would work for guest userspace drivers. Should I
> >> add a note about that in the flag definition?
> >
> > I think you need to clarify access protection rules. Is it only
> > translation that is bypassed or is any platform-specific
> > protection mechanism bypassed too?
> 
> It is only translation. In a secure guest, if the device tries to access
> a memory address that wasn't provided by the driver then the
> architecture will deny that access. If the device accesses addresses
> provided to it by the driver, then there's no protection mechanism or
> translation to get in the way.
> 
> >> >> > This confuses me.
> >> >> > If driver is unpriveledged then what happens with this flag?
> >> >> > It can supply any address it wants. Will that corrupt kernel
> >> >> > memory?
> >> >>
> >> >> Not needing address translation doesn't necessarily mean that there's no
> >> >> IOMMU. On powerpc we don't use VIRTIO_F_ACCESS_PLATFORM but there's
> >> >> always an IOMMU present. And we also support VFIO drivers. The VFIO API
> >> >> for pseries (sPAPR section in Documentation/vfio.txt) has extra ioctls
> >> >> to program the IOMMU.
> >> >>
> >> >> For our use case, we don't need address translation because we set up an
> >> >> identity mapping in the IOMMU so that the device can use guest physical
> >> >> addresses.
> >
> > OK so I think I am beginning to see it in a different light.  Right now the 
> > specific
> > platform creates an identity mapping. That in turn means DMA API can be
> > fast - it does not need to do anything.  What you are looking for is a
> > way to tell host it's an identity mapping - just as an optimization.
> >
> > Is that right?
> 
> Almost. Theoretically it is just an optimization. But in practice the
> pseries boot firmware (SLOF) doesn't support IOMMU_PLATFORM so it's not
> possible to boot a guest from a device with that flag set.
> 
> > So this is what I would call this option:
> >
> > VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS
> >
> > and the explanation should state that all device
> > addresses are translated by the platform to identical
> > addresses.
> >
> > In fact this option then becomes more, not less restrictive
> > than VIRTIO_F_ACCESS_PLATFORM - it's a promise
> > by guest to only create identity mappings,
> > and only before driver_ok is set.
> > This option then would always be negotiated together with
> > VIRTIO_F_ACCESS_PLATFORM.
> >
> > Host then must verify that
> > 1. full 1:1 mappings are created before driver_ok
> > or can we make sure this happens before features_ok?
> > that would be ideal as we could require that features_ok fails
> > 2. mappings are not modified between driver_ok and reset
> > i guess attempts to change them will fail -
> > possibly by causing a guest crash
> > or some other kind of platform-specific error
> 
> I think VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS is good, but requiring
> it

Re: [PATCH 1/3] x86, s390: Move ARCH_HAS_MEM_ENCRYPT definition to arch/Kconfig

2019-07-15 Thread janani

On 2019-07-12 23:45, Thiago Jung Bauermann wrote:
powerpc is also going to use this feature, so put it in a generic 
location.


Signed-off-by: Thiago Jung Bauermann 
Reviewed-by: Thomas Gleixner 
---
 arch/Kconfig  | 3 +++
 arch/s390/Kconfig | 3 ---
 arch/x86/Kconfig  | 4 +---
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index c47b328eada0..4ef3499d4480 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -927,6 +927,9 @@ config LOCK_EVENT_COUNTS
  the chance of application behavior change because of timing
  differences. The counts are reported via debugfs.

+config ARCH_HAS_MEM_ENCRYPT
+   bool
+
 source "kernel/gcov/Kconfig"

 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5d8570ed6cab..f820e631bf89 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -1,7 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
-config ARCH_HAS_MEM_ENCRYPT
-def_bool y
-


 Since you are removing the "def_bool y" when ARCH_HAS_MEM_ENCRYPT is 
moved to arch/Kconfig, does the s390/Kconfig need "select 
ARCH_HAS_MEM_ENCRYPT" added like you do for x86/Kconfig?


 - Janani


 config MMU
def_bool y

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c9f331bb538b..5d3295f2df94 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -68,6 +68,7 @@ config X86
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOVif X86_64
+   select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_APIif X86_64
select ARCH_HAS_PTE_SPECIAL
@@ -1520,9 +1521,6 @@ config X86_CPA_STATISTICS
  helps to determine the effectiveness of preserving large and huge
  page mappings when mapping protections are changed.

-config ARCH_HAS_MEM_ENCRYPT
-   def_bool y
-
 config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD


Re: [PATCH 3/3] fs/core/vmcore: Move sev_active() reference to x86 arch code

2019-07-15 Thread Lendacky, Thomas
On 7/15/19 9:30 AM, Christoph Hellwig wrote:
> On Mon, Jul 15, 2019 at 04:03:17PM +0200, Halil Pasic wrote:
>>> I thought about that but couldn't put my finger on a general concept.
>>> Is it "guest with memory inaccessible to the host"?
>>>
>>
>> Well, force_dma_unencrypted() is a much better name thatn sev_active():
>> s390 has no AMD SEV, that is sure, but for virtio to work we do need to
>> make our dma accessible to the hypervisor. Yes, your "guest with memory
>> inaccessible to the host" shows into the right direction IMHO.
>> Unfortunately I don't have too many cycles to spend on this right now.
> 
> In x86 it means that we need to remove dma encryption using
> set_memory_decrypted before using it for DMA purposes.  In the SEV
> case that seems to be so that the hypervisor can access it, in the SME
> case that Tom just fixes it is because there is an encrypted bit set
> in the physical address, and if the device doesn't support a large
> enough DMA address the direct mapping code has to encrypt the pages
> used for the contigous allocation.

Just a correction/clarification...

For SME, when a device doesn't support a large enough DMA address to
accommodate the encryption bit as part of the DMA address, the direct
mapping code has to provide un-encrypted pages. For un-encrypted pages,
the DMA address now does not include the encryption bit, making it
acceptable to the device. Since the device is now using a DMA address
without the encryption bit, the physical address in the CPU page table
must match (the call to set_memory_decrypted) so that both the device and
the CPU interact in the same way with the memory.

Thanks,
Tom

> 
>> Being on cc for your patch made me realize that things got broken on
>> s390. Thanks! I've sent out a patch that fixes protvirt, but we are going
>> to benefit from your cleanups. I think with your cleanups and that patch
>> of mine both sev_active() and sme_active() can be removed. Feel free to
>> do so. If not, I can attend to it as well.
> 
> Yes, I think with the dma-mapping fix and this series sme_active and
> sev_active should be gone from common code.  We should also be able
> to remove the exports x86 has for them.
> 
> I'll wait a few days and will then feed the dma-mapping fix to Linus,
> it might make sense to either rebase Thiagos series on top of the
> dma-mapping for-next branch, or wait a few days before reposting.
> 


Re: [PATCH 2/4] powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE

2019-07-15 Thread Christophe Leroy

Michael Ellerman  a écrit :


Christophe Leroy  writes:

PPC32 also have flush_dcache_range() so it can also support
ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE without changes.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d7996cfaceca..cf6e30f637be 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -127,13 +127,13 @@ config PPC
select ARCH_HAS_KCOV
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
-   select ARCH_HAS_PMEM_APIif PPC64
+   select ARCH_HAS_PMEM_API
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
 	select ARCH_HAS_STRICT_KERNEL_RWX	if ((PPC_BOOK3S_64 || PPC32) &&  
!RELOCATABLE && !HIBERNATION)

select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
-   select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
+   select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG


This didn't build for me, probably due to something that's changed in
the long period between you posting it and me applying it?

corenet32_smp_defconfig:

  powerpc64-unknown-linux-gnu-ld: lib/iov_iter.o: in function  
`_copy_from_iter_flushcache':
  powerpc64-unknown-linux-gnu-ld:  
/scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined  
reference to `memcpy_page_flushcache'
  powerpc64-unknown-linux-gnu-ld:  
/scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined  
reference to `memcpy_flushcache'
  powerpc64-unknown-linux-gnu-ld:  
/scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined  
reference to `__copy_from_user_flushcache'
  powerpc64-unknown-linux-gnu-ld:  
/scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined  
reference to `memcpy_flushcache'





Looks like arch/powerpc/lib/Makefile only builds pmem.o for ppc64

Moving it from obj64-y to obj-y should do it.

I'll update the patch when I'm back in two weeks unless you can fix it before.

Christophe




Re: [PATCH v2 0/5] mm: Enable CONFIG_NODES_SPAN_OTHER_NODES by default for NUMA

2019-07-15 Thread Hoan Tran OS
Hi,

On 7/12/19 10:00 PM, Michal Hocko wrote:
> On Fri 12-07-19 15:37:30, Will Deacon wrote:
>> Hi all,
>>
>> On Fri, Jul 12, 2019 at 02:12:23PM +0200, Michal Hocko wrote:
>>> On Fri 12-07-19 10:56:47, Hoan Tran OS wrote:
>>> [...]
 It would be good if we can enable it by-default. Otherwise, let arch
 enables it by them-self. Do you have any suggestions?
>>>
>>> I can hardly make any suggestions when it is not really clear _why_ you
>>> want to remove this config option in the first place. Please explain
>>> what motivated you to make this change.
>>
>> Sorry, I think this confusion might actually be my fault and Hoan has just
>> been implementing my vague suggestion here:
>>
>> https://lore.kernel.org/linux-arm-kernel/20190625101245.s4vxfosoop52gl4e@willie-the-truck/
>>
>> If the preference of the mm folks is to leave CONFIG_NODES_SPAN_OTHER_NODES
>> as it is, then we can define it for arm64. I just find it a bit weird that
>> the majority of NUMA-capable architectures have to add a symbol in the arch
>> Kconfig file, for what appears to be a performance optimisation applicable
>> only to ia64, mips and sh.
>>
>> At the very least we could make the thing selectable.
> 
> Hmm, I thought this was selectable. But I am obviously wrong here.
> Looking more closely, it seems that this is indeed only about
> __early_pfn_to_nid and as such not something that should add a config
> symbol. This should have been called out in the changelog though.

Yes, do you have any other comments about my patch?

> 
> Also while at it, does HAVE_MEMBLOCK_NODE_MAP fall into a similar
> bucket? Do we have any NUMA architecture that doesn't enable it?
> 

As I checked with arch Kconfig files, there are 2 architectures, riscv 
and microblaze, do not support NUMA but enable this config.

And 1 architecture, alpha, supports NUMA but does not enable this config.

Thanks and Regards
Hoan

> Thanks!
> 


Re: [PATCH] powerpc: remove meaningless KBUILD_ARFLAGS addition

2019-07-15 Thread Segher Boessenkool
On Mon, Jul 15, 2019 at 09:03:46PM +0900, Masahiro Yamada wrote:
> On Mon, Jul 15, 2019 at 4:30 PM Segher Boessenkool
>  wrote:
> >
> > On Mon, Jul 15, 2019 at 05:05:34PM +1000, Michael Ellerman wrote:
> > > Segher Boessenkool  writes:
> > > > Yes, that is why I used the environment variable, all binutils work
> > > > with that.  There was no --target option in GNU ar before 2.22.
> 
> I use binutils 2.30
> It does not understand --target option.
> 
> $ powerpc-linux-ar --version
> GNU ar (GNU Binutils) 2.30
> Copyright (C) 2018 Free Software Foundation, Inc.
> This program is free software; you may redistribute it under the terms of
> the GNU General Public License version 3 or (at your option) any later 
> version.
> This program has absolutely no warranty.
> 
> If I give --target=elf$(BITS)-$(GNUTARGET) option, I see this:
> powerpc-linux-ar: -t: No such file or directory

You need to provide a valid command line, like

$ powerpc-linux-ar tv smth.a --target=elf32-powerpc

ar is a bit weird.


Segher


[PATCH v4] powerpc/setup_64: fix -Wempty-body warnings

2019-07-15 Thread Qian Cai
At the beginning of setup_64.c, it has,

  #ifdef DEBUG
  #define DBG(fmt...) udbg_printf(fmt)
  #else
  #define DBG(fmt...)
  #endif

where DBG() could be compiled away, and generate warnings,

arch/powerpc/kernel/setup_64.c: In function 'initialize_cache_info':
arch/powerpc/kernel/setup_64.c:579:49: warning: suggest braces around
empty body in an 'if' statement [-Wempty-body]
DBG("Argh, can't find dcache properties !\n");
 ^
arch/powerpc/kernel/setup_64.c:582:49: warning: suggest braces around
empty body in an 'if' statement [-Wempty-body]
DBG("Argh, can't find icache properties !\n");

Fix it by using the suggestions from Michael:

"Neither of those sites should use DBG(), that's not really early boot
code, they should just use pr_warn().

And the other uses of DBG() in initialize_cache_info() should just be
removed.

In smp_release_cpus() the entry/exit DBG's should just be removed, and
the spinning_secondaries line should just be pr_debug().

That would just leave the two calls in early_setup(). If we taught
udbg_printf() to return early when udbg_putc is NULL, then we could just
call udbg_printf() unconditionally and get rid of the DBG macro
entirely."

Suggested-by: Michael Ellerman 
Signed-off-by: Qian Cai 
---

v4: Use the suggestions from Michael and __func__ per checkpatch.
v3: Use no_printk() macro, and make sure that format and argument are always
verified by the compiler using a more generic form ##__VA_ARGS__ per Joe.
v2: Fix it by using a NOP while loop per Tyrel.

 arch/powerpc/kernel/setup_64.c | 26 ++
 arch/powerpc/kernel/udbg.c | 14 --
 2 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 44b4c432a273..d2af4c228970 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -68,12 +68,6 @@
 
 #include "setup.h"
 
-#ifdef DEBUG
-#define DBG(fmt...) udbg_printf(fmt)
-#else
-#define DBG(fmt...)
-#endif
-
 int spinning_secondaries;
 u64 ppc64_pft_size;
 
@@ -305,7 +299,7 @@ void __init early_setup(unsigned long dt_ptr)
/* Enable early debugging if any specified (see udbg.h) */
udbg_early_init();
 
-   DBG(" -> early_setup(), dt_ptr: 0x%lx\n", dt_ptr);
+   udbg_printf(" -> %s(), dt_ptr: 0x%lx\n", __func__, dt_ptr);
 
/*
 * Do early initialization using the flattened device
@@ -362,11 +356,11 @@ void __init early_setup(unsigned long dt_ptr)
 */
this_cpu_enable_ftrace();
 
-   DBG(" <- early_setup()\n");
+   udbg_printf(" <- %s()\n", __func__);
 
 #ifdef CONFIG_PPC_EARLY_DEBUG_BOOTX
/*
-* This needs to be done *last* (after the above DBG() even)
+* This needs to be done *last* (after the above udbg_printf() even)
 *
 * Right after we return from this function, we turn on the MMU
 * which means the real-mode access trick that btext does will
@@ -436,8 +430,6 @@ void smp_release_cpus(void)
if (!use_spinloop())
return;
 
-   DBG(" -> smp_release_cpus()\n");
-
/* All secondary cpus are spinning on a common spinloop, release them
 * all now so they can start to spin on their individual paca
 * spinloops. For non SMP kernels, the secondary cpus never get out
@@ -456,9 +448,7 @@ void smp_release_cpus(void)
break;
udelay(1);
}
-   DBG("spinning_secondaries = %d\n", spinning_secondaries);
-
-   DBG(" <- smp_release_cpus()\n");
+   pr_debug("spinning_secondaries = %d\n", spinning_secondaries);
 }
 #endif /* CONFIG_SMP || CONFIG_KEXEC_CORE */
 
@@ -551,8 +541,6 @@ void __init initialize_cache_info(void)
struct device_node *cpu = NULL, *l2, *l3 = NULL;
u32 pvr;
 
-   DBG(" -> initialize_cache_info()\n");
-
/*
 * All shipping POWER8 machines have a firmware bug that
 * puts incorrect information in the device-tree. This will
@@ -576,10 +564,10 @@ void __init initialize_cache_info(void)
 */
if (cpu) {
if (!parse_cache_info(cpu, false, &ppc64_caches.l1d))
-   DBG("Argh, can't find dcache properties !\n");
+   pr_warn("Argh, can't find dcache properties !\n");
 
if (!parse_cache_info(cpu, true, &ppc64_caches.l1i))
-   DBG("Argh, can't find icache properties !\n");
+   pr_warn("Argh, can't find icache properties !\n");
 
/*
 * Try to find the L2 and L3 if any. Assume they are
@@ -604,8 +592,6 @@ void __init initialize_cache_info(void)
 
cur_cpu_spec->dcache_bsize = dcache_bsize;
cur_cpu_spec->icache_bsize = icache_bsize;
-
-   DBG(" <- initialize_cache_info()\n");
 }
 
 /*
diff --git a/arch/powerpc/kernel/udbg.c b/arch/powerpc/kernel/udbg.c
index a384e7c8b01c..01595e8c

Re: [PATCH v2] powerpc/imc: Dont create debugfs files for cpu-less nodes

2019-07-15 Thread Qian Cai
On Thu, 2019-07-11 at 14:53 +1000, Michael Ellerman wrote:
> Hi Maddy,
> 
> Madhavan Srinivasan  writes:
> > diff --git a/arch/powerpc/platforms/powernv/opal-imc.c
> > b/arch/powerpc/platforms/powernv/opal-imc.c
> > index 186109bdd41b..e04b20625cb9 100644
> > --- a/arch/powerpc/platforms/powernv/opal-imc.c
> > +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> > @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct device_node
> > *node,
> >     if (of_property_read_u32(node, "cb_offset", &cb_offset))
> >     cb_offset = IMC_CNTL_BLK_OFFSET;
> >  
> > -   for_each_node(nid) {
> > -   loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset;
> > +   while (ptr->vbase != NULL) {
> 
> This means you'll bail out as soon as you find a node with no vbase, but
> it's possible we could have a CPU-less node intermingled with other
> nodes.
> 
> So I think you want to keep the for loop, but continue if you see a NULL
> vbase?

Not sure if this will also takes care of some of those messages during the boot
on today's linux-next even without this patch.


[   18.077780][T1] debugfs: Directory 'imc' with parent 'powerpc' already
present!

where it has the following layout,
# numactl -H
available: 6 nodes (0,8,252-255)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
53 54 55 56 57 58 59 60 61 62 63
node 0 size: 130197 MB
node 0 free: 127453 MB
node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 8 size: 130778 MB
node 8 free: 128119 MB
node 252 cpus:
node 252 size: 0 MB
node 252 free: 0 MB
node 253 cpus:
node 253 size: 0 MB
node 253 free: 0 MB
node 254 cpus:
node 254 size: 0 MB
node 254 free: 0 MB
node 255 cpus:
node 255 size: 0 MB
node 255 free: 0 MB
node distances:
node   0   8  252  253  254  255 
  0:  10  40  80  80  80  80 
  8:  40  10  80  80  80  80 
 252:  80  80  10  80  80  80 
 253:  80  80  80  10  80  80 
 254:  80  80  80  80  10  80 
 255:  80  80  80  80  80  10

> 
> 
> > +   loc = (u64)(ptr->vbase) + cb_offset;
> >     imc_mode_addr = (u64 *)(loc + IMC_CNTL_BLK_MODE_OFFSET);
> > -   sprintf(mode, "imc_mode_%d", nid);
> > +   sprintf(mode, "imc_mode_%d", (u32)(ptr->id));
> >     if (!imc_debugfs_create_x64(mode, 0600, imc_debugfs_parent,
> >     imc_mode_addr))
> >     goto err;
> >  
> >     imc_cmd_addr = (u64 *)(loc + IMC_CNTL_BLK_CMD_OFFSET);
> > -   sprintf(cmd, "imc_cmd_%d", nid);
> > +   sprintf(cmd, "imc_cmd_%d", (u32)(ptr->id));
> >     if (!imc_debugfs_create_x64(cmd, 0600, imc_debugfs_parent,
> >     imc_cmd_addr))
> >     goto err;
> > -   chip++;
> > +   ptr++;
> >     }
> >     return;
> 
> cheers


Re: [PATCH v2 3/5] x86: Kconfig: Remove CONFIG_NODES_SPAN_OTHER_NODES

2019-07-15 Thread Thomas Gleixner
On Thu, 11 Jul 2019, Hoan Tran OS wrote:

> Remove CONFIG_NODES_SPAN_OTHER_NODES as it's enabled
> by default with NUMA.

As I told you before this does not mention that the option is now enabled
even for x86(32bit) configurations which did not enable it before and does
not longer depend on X86_64_ACPI_NUMA.

And there is still no rationale why this makes sense.

Thanks,

tglx


Re: [PATCH 1/3] x86, s390: Move ARCH_HAS_MEM_ENCRYPT definition to arch/Kconfig

2019-07-15 Thread Thiago Jung Bauermann


Hello Janani,

Thanks for reviewing the patch.

janani  writes:

> On 2019-07-12 23:45, Thiago Jung Bauermann wrote:
>> powerpc is also going to use this feature, so put it in a generic location.
>>
>> Signed-off-by: Thiago Jung Bauermann 
>> Reviewed-by: Thomas Gleixner 
>> ---
>>  arch/Kconfig  | 3 +++
>>  arch/s390/Kconfig | 3 ---
>>  arch/x86/Kconfig  | 4 +---
>>  3 files changed, 4 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index c47b328eada0..4ef3499d4480 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -927,6 +927,9 @@ config LOCK_EVENT_COUNTS
>>the chance of application behavior change because of timing
>>differences. The counts are reported via debugfs.
>>
>> +config ARCH_HAS_MEM_ENCRYPT
>> +bool
>> +
>>  source "kernel/gcov/Kconfig"
>>
>>  source "scripts/gcc-plugins/Kconfig"
>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
>> index 5d8570ed6cab..f820e631bf89 100644
>> --- a/arch/s390/Kconfig
>> +++ b/arch/s390/Kconfig
>> @@ -1,7 +1,4 @@
>>  # SPDX-License-Identifier: GPL-2.0
>> -config ARCH_HAS_MEM_ENCRYPT
>> -def_bool y
>> -
>
>  Since you are removing the "def_bool y" when ARCH_HAS_MEM_ENCRYPT is moved to
> arch/Kconfig, does the s390/Kconfig need "select ARCH_HAS_MEM_ENCRYPT" added
> like you do for x86/Kconfig?

Indeed, I missed that. Thanks for spotting it!

>
>  - Janani
>
>>  config MMU
>>  def_bool y
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c9f331bb538b..5d3295f2df94 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -68,6 +68,7 @@ config X86
>>  select ARCH_HAS_FORTIFY_SOURCE
>>  select ARCH_HAS_GCOV_PROFILE_ALL
>>  select ARCH_HAS_KCOVif X86_64
>> +select ARCH_HAS_MEM_ENCRYPT
>>  select ARCH_HAS_MEMBARRIER_SYNC_CORE
>>  select ARCH_HAS_PMEM_APIif X86_64
>>  select ARCH_HAS_PTE_SPECIAL
>> @@ -1520,9 +1521,6 @@ config X86_CPA_STATISTICS
>>helps to determine the effectiveness of preserving large and huge
>>page mappings when mapping protections are changed.
>>
>> -config ARCH_HAS_MEM_ENCRYPT
>> -def_bool y
>> -
>>  config AMD_MEM_ENCRYPT
>>  bool "AMD Secure Memory Encryption (SME) support"
>>  depends on X86_64 && CPU_SUP_AMD


-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH 3/3] fs/core/vmcore: Move sev_active() reference to x86 arch code

2019-07-15 Thread Thiago Jung Bauermann


Christoph Hellwig  writes:

> On Mon, Jul 15, 2019 at 04:03:17PM +0200, Halil Pasic wrote:
>> > I thought about that but couldn't put my finger on a general concept.
>> > Is it "guest with memory inaccessible to the host"?
>> >
>>
>> Well, force_dma_unencrypted() is a much better name thatn sev_active():
>> s390 has no AMD SEV, that is sure, but for virtio to work we do need to
>> make our dma accessible to the hypervisor. Yes, your "guest with memory
>> inaccessible to the host" shows into the right direction IMHO.
>> Unfortunately I don't have too many cycles to spend on this right now.
>
> In x86 it means that we need to remove dma encryption using
> set_memory_decrypted before using it for DMA purposes.  In the SEV
> case that seems to be so that the hypervisor can access it, in the SME
> case that Tom just fixes it is because there is an encrypted bit set
> in the physical address, and if the device doesn't support a large
> enough DMA address the direct mapping code has to encrypt the pages
> used for the contigous allocation.
>
>> Being on cc for your patch made me realize that things got broken on
>> s390. Thanks! I've sent out a patch that fixes protvirt, but we are going
>> to benefit from your cleanups. I think with your cleanups and that patch
>> of mine both sev_active() and sme_active() can be removed. Feel free to
>> do so. If not, I can attend to it as well.
>
> Yes, I think with the dma-mapping fix and this series sme_active and
> sev_active should be gone from common code.  We should also be able
> to remove the exports x86 has for them.
>
> I'll wait a few days and will then feed the dma-mapping fix to Linus,
> it might make sense to either rebase Thiagos series on top of the
> dma-mapping for-next branch, or wait a few days before reposting.

I'll rebase on top of dma-mapping/for-next and do the break up of patch
2 that you mentioned as well.

--
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Thiago Jung Bauermann


Michael S. Tsirkin  writes:

> On Sun, Jul 14, 2019 at 02:51:18AM -0300, Thiago Jung Bauermann wrote:
>>
>>
>> Michael S. Tsirkin  writes:
>>
>> > So this is what I would call this option:
>> >
>> > VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS
>> >
>> > and the explanation should state that all device
>> > addresses are translated by the platform to identical
>> > addresses.
>> >
>> > In fact this option then becomes more, not less restrictive
>> > than VIRTIO_F_ACCESS_PLATFORM - it's a promise
>> > by guest to only create identity mappings,
>> > and only before driver_ok is set.
>> > This option then would always be negotiated together with
>> > VIRTIO_F_ACCESS_PLATFORM.
>> >
>> > Host then must verify that
>> > 1. full 1:1 mappings are created before driver_ok
>> > or can we make sure this happens before features_ok?
>> > that would be ideal as we could require that features_ok fails
>> > 2. mappings are not modified between driver_ok and reset
>> > i guess attempts to change them will fail -
>> > possibly by causing a guest crash
>> > or some other kind of platform-specific error
>>
>> I think VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS is good, but requiring
>> it to be accompanied by ACCESS_PLATFORM can be a problem. One reason is
>> SLOF as I mentioned above, another is that we would be requiring all
>> guests running on the machine (secure guests or not, since we would use
>> the same configuration for all guests) to support it. But
>> ACCESS_PLATFORM is relatively recent so it's a bit early for that. For
>> instance, Ubuntu 16.04 LTS (which is still supported) doesn't know about
>> it and wouldn't be able to use the device.
>
> OK and your target is to enable use with kernel drivers within
> guests, right?

Right.

> My question is, we are defining a new flag here, I guess old guests
> then do not set it. How does it help old guests? Or maybe it's
> not designed to ...

Indeed. The idea is that QEMU can offer the flag, old guests can reject
it (or even new guests can reject it, if they decide not to convert into
secure VMs) and the feature negotiation will succeed with the flag
unset.

--
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Michael S. Tsirkin
On Mon, Jul 15, 2019 at 05:29:06PM -0300, Thiago Jung Bauermann wrote:
> 
> Michael S. Tsirkin  writes:
> 
> > On Sun, Jul 14, 2019 at 02:51:18AM -0300, Thiago Jung Bauermann wrote:
> >>
> >>
> >> Michael S. Tsirkin  writes:
> >>
> >> > So this is what I would call this option:
> >> >
> >> > VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS
> >> >
> >> > and the explanation should state that all device
> >> > addresses are translated by the platform to identical
> >> > addresses.
> >> >
> >> > In fact this option then becomes more, not less restrictive
> >> > than VIRTIO_F_ACCESS_PLATFORM - it's a promise
> >> > by guest to only create identity mappings,
> >> > and only before driver_ok is set.
> >> > This option then would always be negotiated together with
> >> > VIRTIO_F_ACCESS_PLATFORM.
> >> >
> >> > Host then must verify that
> >> > 1. full 1:1 mappings are created before driver_ok
> >> > or can we make sure this happens before features_ok?
> >> > that would be ideal as we could require that features_ok fails
> >> > 2. mappings are not modified between driver_ok and reset
> >> > i guess attempts to change them will fail -
> >> > possibly by causing a guest crash
> >> > or some other kind of platform-specific error
> >>
> >> I think VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS is good, but requiring
> >> it to be accompanied by ACCESS_PLATFORM can be a problem. One reason is
> >> SLOF as I mentioned above, another is that we would be requiring all
> >> guests running on the machine (secure guests or not, since we would use
> >> the same configuration for all guests) to support it. But
> >> ACCESS_PLATFORM is relatively recent so it's a bit early for that. For
> >> instance, Ubuntu 16.04 LTS (which is still supported) doesn't know about
> >> it and wouldn't be able to use the device.
> >
> > OK and your target is to enable use with kernel drivers within
> > guests, right?
> 
> Right.
> 
> > My question is, we are defining a new flag here, I guess old guests
> > then do not set it. How does it help old guests? Or maybe it's
> > not designed to ...
> 
> Indeed. The idea is that QEMU can offer the flag, old guests can reject
> it (or even new guests can reject it, if they decide not to convert into
> secure VMs) and the feature negotiation will succeed with the flag
> unset.

OK. And then what does QEMU do? Assume guest is not encrypted I guess?

> --
> Thiago Jung Bauermann
> IBM Linux Technology Center


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Thiago Jung Bauermann


Michael S. Tsirkin  writes:

> On Mon, Jul 15, 2019 at 05:29:06PM -0300, Thiago Jung Bauermann wrote:
>>
>> Michael S. Tsirkin  writes:
>>
>> > On Sun, Jul 14, 2019 at 02:51:18AM -0300, Thiago Jung Bauermann wrote:
>> >>
>> >>
>> >> Michael S. Tsirkin  writes:
>> >>
>> >> > So this is what I would call this option:
>> >> >
>> >> > VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS
>> >> >
>> >> > and the explanation should state that all device
>> >> > addresses are translated by the platform to identical
>> >> > addresses.
>> >> >
>> >> > In fact this option then becomes more, not less restrictive
>> >> > than VIRTIO_F_ACCESS_PLATFORM - it's a promise
>> >> > by guest to only create identity mappings,
>> >> > and only before driver_ok is set.
>> >> > This option then would always be negotiated together with
>> >> > VIRTIO_F_ACCESS_PLATFORM.
>> >> >
>> >> > Host then must verify that
>> >> > 1. full 1:1 mappings are created before driver_ok
>> >> > or can we make sure this happens before features_ok?
>> >> > that would be ideal as we could require that features_ok fails
>> >> > 2. mappings are not modified between driver_ok and reset
>> >> > i guess attempts to change them will fail -
>> >> > possibly by causing a guest crash
>> >> > or some other kind of platform-specific error
>> >>
>> >> I think VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS is good, but requiring
>> >> it to be accompanied by ACCESS_PLATFORM can be a problem. One reason is
>> >> SLOF as I mentioned above, another is that we would be requiring all
>> >> guests running on the machine (secure guests or not, since we would use
>> >> the same configuration for all guests) to support it. But
>> >> ACCESS_PLATFORM is relatively recent so it's a bit early for that. For
>> >> instance, Ubuntu 16.04 LTS (which is still supported) doesn't know about
>> >> it and wouldn't be able to use the device.
>> >
>> > OK and your target is to enable use with kernel drivers within
>> > guests, right?
>>
>> Right.
>>
>> > My question is, we are defining a new flag here, I guess old guests
>> > then do not set it. How does it help old guests? Or maybe it's
>> > not designed to ...
>>
>> Indeed. The idea is that QEMU can offer the flag, old guests can reject
>> it (or even new guests can reject it, if they decide not to convert into
>> secure VMs) and the feature negotiation will succeed with the flag
>> unset.
>
> OK. And then what does QEMU do? Assume guest is not encrypted I guess?

There's nothing different that QEMU needs to do, with or without the
flag. the perspective of the host, a secure guest and a regular guest
work the same way with respect to virtio.

--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Michael S. Tsirkin
On Mon, Jul 15, 2019 at 07:03:03PM -0300, Thiago Jung Bauermann wrote:
> 
> Michael S. Tsirkin  writes:
> 
> > On Mon, Jul 15, 2019 at 05:29:06PM -0300, Thiago Jung Bauermann wrote:
> >>
> >> Michael S. Tsirkin  writes:
> >>
> >> > On Sun, Jul 14, 2019 at 02:51:18AM -0300, Thiago Jung Bauermann wrote:
> >> >>
> >> >>
> >> >> Michael S. Tsirkin  writes:
> >> >>
> >> >> > So this is what I would call this option:
> >> >> >
> >> >> > VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS
> >> >> >
> >> >> > and the explanation should state that all device
> >> >> > addresses are translated by the platform to identical
> >> >> > addresses.
> >> >> >
> >> >> > In fact this option then becomes more, not less restrictive
> >> >> > than VIRTIO_F_ACCESS_PLATFORM - it's a promise
> >> >> > by guest to only create identity mappings,
> >> >> > and only before driver_ok is set.
> >> >> > This option then would always be negotiated together with
> >> >> > VIRTIO_F_ACCESS_PLATFORM.
> >> >> >
> >> >> > Host then must verify that
> >> >> > 1. full 1:1 mappings are created before driver_ok
> >> >> > or can we make sure this happens before features_ok?
> >> >> > that would be ideal as we could require that features_ok fails
> >> >> > 2. mappings are not modified between driver_ok and reset
> >> >> > i guess attempts to change them will fail -
> >> >> > possibly by causing a guest crash
> >> >> > or some other kind of platform-specific error
> >> >>
> >> >> I think VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS is good, but requiring
> >> >> it to be accompanied by ACCESS_PLATFORM can be a problem. One reason is
> >> >> SLOF as I mentioned above, another is that we would be requiring all
> >> >> guests running on the machine (secure guests or not, since we would use
> >> >> the same configuration for all guests) to support it. But
> >> >> ACCESS_PLATFORM is relatively recent so it's a bit early for that. For
> >> >> instance, Ubuntu 16.04 LTS (which is still supported) doesn't know about
> >> >> it and wouldn't be able to use the device.
> >> >
> >> > OK and your target is to enable use with kernel drivers within
> >> > guests, right?
> >>
> >> Right.
> >>
> >> > My question is, we are defining a new flag here, I guess old guests
> >> > then do not set it. How does it help old guests? Or maybe it's
> >> > not designed to ...
> >>
> >> Indeed. The idea is that QEMU can offer the flag, old guests can reject
> >> it (or even new guests can reject it, if they decide not to convert into
> >> secure VMs) and the feature negotiation will succeed with the flag
> >> unset.
> >
> > OK. And then what does QEMU do? Assume guest is not encrypted I guess?
> 
> There's nothing different that QEMU needs to do, with or without the
> flag. the perspective of the host, a secure guest and a regular guest
> work the same way with respect to virtio.

OK. So now let's get back to implementation. What will
Linux guest driver do? It can't activate DMA API blindly since that
will assume translation also works, right?
Or do we somehow limit it to just a specific platform?

> --
> Thiago Jung Bauermann
> IBM Linux Technology Center


Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Thiago Jung Bauermann


Michael S. Tsirkin  writes:

> On Mon, Jul 15, 2019 at 07:03:03PM -0300, Thiago Jung Bauermann wrote:
>> 
>> Michael S. Tsirkin  writes:
>> 
>> > On Mon, Jul 15, 2019 at 05:29:06PM -0300, Thiago Jung Bauermann wrote:
>> >>
>> >> Michael S. Tsirkin  writes:
>> >>
>> >> > On Sun, Jul 14, 2019 at 02:51:18AM -0300, Thiago Jung Bauermann wrote:
>> >> >>
>> >> >>
>> >> >> Michael S. Tsirkin  writes:
>> >> >>
>> >> >> > So this is what I would call this option:
>> >> >> >
>> >> >> > VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS
>> >> >> >
>> >> >> > and the explanation should state that all device
>> >> >> > addresses are translated by the platform to identical
>> >> >> > addresses.
>> >> >> >
>> >> >> > In fact this option then becomes more, not less restrictive
>> >> >> > than VIRTIO_F_ACCESS_PLATFORM - it's a promise
>> >> >> > by guest to only create identity mappings,
>> >> >> > and only before driver_ok is set.
>> >> >> > This option then would always be negotiated together with
>> >> >> > VIRTIO_F_ACCESS_PLATFORM.
>> >> >> >
>> >> >> > Host then must verify that
>> >> >> > 1. full 1:1 mappings are created before driver_ok
>> >> >> > or can we make sure this happens before features_ok?
>> >> >> > that would be ideal as we could require that features_ok fails
>> >> >> > 2. mappings are not modified between driver_ok and reset
>> >> >> > i guess attempts to change them will fail -
>> >> >> > possibly by causing a guest crash
>> >> >> > or some other kind of platform-specific error
>> >> >>
>> >> >> I think VIRTIO_F_ACCESS_PLATFORM_IDENTITY_ADDRESS is good, but 
>> >> >> requiring
>> >> >> it to be accompanied by ACCESS_PLATFORM can be a problem. One reason is
>> >> >> SLOF as I mentioned above, another is that we would be requiring all
>> >> >> guests running on the machine (secure guests or not, since we would use
>> >> >> the same configuration for all guests) to support it. But
>> >> >> ACCESS_PLATFORM is relatively recent so it's a bit early for that. For
>> >> >> instance, Ubuntu 16.04 LTS (which is still supported) doesn't know 
>> >> >> about
>> >> >> it and wouldn't be able to use the device.
>> >> >
>> >> > OK and your target is to enable use with kernel drivers within
>> >> > guests, right?
>> >>
>> >> Right.
>> >>
>> >> > My question is, we are defining a new flag here, I guess old guests
>> >> > then do not set it. How does it help old guests? Or maybe it's
>> >> > not designed to ...
>> >>
>> >> Indeed. The idea is that QEMU can offer the flag, old guests can reject
>> >> it (or even new guests can reject it, if they decide not to convert into
>> >> secure VMs) and the feature negotiation will succeed with the flag
>> >> unset.
>> >
>> > OK. And then what does QEMU do? Assume guest is not encrypted I guess?
>> 
>> There's nothing different that QEMU needs to do, with or without the
>> flag. the perspective of the host, a secure guest and a regular guest
>> work the same way with respect to virtio.
>
> OK. So now let's get back to implementation. What will
> Linux guest driver do? It can't activate DMA API blindly since that
> will assume translation also works, right?

It can on pseries, because we always have a 1:1 window mapping the whole
guest memory.

> Or do we somehow limit it to just a specific platform?

Yes, we want to accept the new flag only on secure pseries guests.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Benjamin Herrenschmidt
On Mon, 2019-07-15 at 19:03 -0300, Thiago Jung Bauermann wrote:
> > > Indeed. The idea is that QEMU can offer the flag, old guests can
> > > reject
> > > it (or even new guests can reject it, if they decide not to
> > > convert into
> > > secure VMs) and the feature negotiation will succeed with the
> > > flag
> > > unset.
> > 
> > OK. And then what does QEMU do? Assume guest is not encrypted I
> > guess?
> 
> There's nothing different that QEMU needs to do, with or without the
> flag. the perspective of the host, a secure guest and a regular guest
> work the same way with respect to virtio.

This is *precisely* why I was against adding a flag and touch the
protocol negociation with qemu in the first place, back when I cared
about that stuff...

Guys, this has gone in circles over and over again.

This has nothing to do with qemu. Qemu doesn't need to know about this.
It's entirely guest local. This is why the one-liner in virtio was a
far better and simpler solution.

This is something the guest does to itself (with the participation of a
ultravisor but that's not something qemu cares about at this stage, at
least not as far as virtio is concerned).

Basically, the guest "hides" its memory from the host using a HW secure
memory facility. As a result, it needs to ensure that all of its DMA
pages are bounced through insecure pages that aren't hidden. That's it,
it's all guest side. Qemu shouldn't have to care about it at all.

Cheers,
Ben.




[PATCH] PPC: Set reserved PCR bits

2019-07-15 Thread Jordan Niethe
Currently the reserved bits of the Processor Compatibility Register
(PCR) are cleared as per the Programming Note in Section 1.3.3 of the
ISA.  An update is planned for the ISA so that PCR reserved bits should
be set. Set the reserved bits of the PCR as required.

Acked-by: Alistair Popple 
Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/reg.h  |  3 +++
 arch/powerpc/kernel/cpu_setup_power.S   |  6 ++
 arch/powerpc/kernel/dt_cpu_ftrs.c   |  3 ++-
 arch/powerpc/kvm/book3s_hv.c| 11 +++
 arch/powerpc/kvm/book3s_hv_nested.c |  6 +++---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 10 ++
 6 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 10caa145f98b..2e0815994f4d 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -475,6 +475,7 @@
 #define   PCR_VEC_DIS  (1ul << (63-0)) /* Vec. disable (bit NA since POWER8) */
 #define   PCR_VSX_DIS  (1ul << (63-1)) /* VSX disable (bit NA since POWER8) */
 #define   PCR_TM_DIS   (1ul << (63-2)) /* Trans. memory disable (POWER8) */
+#define   PCR_HIGH_BITS(PCR_VEC_DIS | PCR_VSX_DIS | PCR_TM_DIS)
 /*
  * These bits are used in the function kvmppc_set_arch_compat() to specify and
  * determine both the compatibility level which we want to emulate and the
@@ -483,6 +484,8 @@
 #define   PCR_ARCH_207 0x8 /* Architecture 2.07 */
 #define   PCR_ARCH_206 0x4 /* Architecture 2.06 */
 #define   PCR_ARCH_205 0x2 /* Architecture 2.05 */
+#define   PCR_LOW_BITS (PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205)
+#define   PCR_MASK ~(PCR_HIGH_BITS | PCR_LOW_BITS) /* PCR Reserved Bits */
 #defineSPRN_HEIR   0x153   /* Hypervisor Emulated Instruction 
Register */
 #define SPRN_TLBINDEXR 0x154   /* P7 TLB control register */
 #define SPRN_TLBVPNR   0x155   /* P7 TLB control register */
diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 3239a9fe6c1c..a460298c7ddb 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -23,6 +23,7 @@ _GLOBAL(__setup_cpu_power7)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
@@ -37,6 +38,7 @@ _GLOBAL(__restore_cpu_power7)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
@@ -54,6 +56,7 @@ _GLOBAL(__setup_cpu_power8)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
@@ -76,6 +79,7 @@ _GLOBAL(__restore_cpu_power8)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
@@ -98,6 +102,7 @@ _GLOBAL(__setup_cpu_power9)
mtspr   SPRN_PSSCR,r0
mtspr   SPRN_LPID,r0
mtspr   SPRN_PID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE  | 
LPCR_HEIC)
@@ -123,6 +128,7 @@ _GLOBAL(__restore_cpu_power9)
mtspr   SPRN_PSSCR,r0
mtspr   SPRN_LPID,r0
mtspr   SPRN_PID,r0
+   LOAD_REG_IMMEDIATE(r0, PCR_MASK)
mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE | 
LPCR_HEIC)
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 0e4c7c1f5c3e..7f2858e3e56a 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -102,7 +102,7 @@ static void __restore_cpu_cpufeatures(void)
if (hv_mode) {
mtspr(SPRN_LPID, 0);
mtspr(SPRN_HFSCR, system_registers.hfscr);
-   mtspr(SPRN_PCR, 0);
+   mtspr(SPRN_PCR, PCR_MASK);
}
mtspr(SPRN_FSCR, system_registers.fscr);
 
@@ -145,6 +145,7 @@ static void __init cpufeatures_setup_cpu(void)
mtspr(SPRN_HFSCR, 0);
}
mtspr(SPRN_FSCR, 0);
+   mtspr(SPRN_PCR, PCR_MASK);
 
/*
 * LPCR does not get cleared, to match behaviour with secondaries
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 76b1801aa44a..fb1debaa5a7c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -401,8 +401,11 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
 
spin_lock(&vc->lock);
vc->arch_compat = arch_compat;
-   /* Set all PCR bits for whi

Re: [PATCH] powerpc/nvdimm: Pick the nearby online node if the device node is not online

2019-07-15 Thread Aneesh Kumar K.V
"Aneesh Kumar K.V"  writes:

> This is similar to what ACPI does. Nvdimm layer doesn't bring the SCM device
> numa node online. Hence we need to make sure we always use an online node
> as ndr_desc.numa_node. Otherwise this result in kernel crashes. The target
> node is used by dax/kmem and that will bring up the numa node online 
> correctly.
>
> Without this patch, we do hit kernel crash as below because we try to access
> uninitialized NODE_DATA in different code paths.
>
> cpu 0x0: Vector: 300 (Data Access) at [c000fac53170]
> pc: c04bbc50: ___slab_alloc+0x120/0xca0
> lr: c04bc834: __slab_alloc+0x64/0xc0
> sp: c000fac53400
>msr: 82009033
>dar: 73e8
>  dsisr: 8
>   current = 0xc000fabb6d80
>   paca= 0xc387   irqmask: 0x03   irq_happened: 0x01
> pid   = 7, comm = kworker/u16:0
> Linux version 5.2.0-06234-g76bd729b2644 (kvaneesh@ltc-boston123) (gcc version 
> 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #135 SMP Thu Jul 11 05:36:30 CDT 2019
> enter ? for help
> [link register   ] c04bc834 __slab_alloc+0x64/0xc0
> [c000fac53400] c000fac53480 (unreliable)
> [c000fac53500] c04bc818 __slab_alloc+0x48/0xc0
> [c000fac53560] c04c30a0 __kmalloc_node_track_caller+0x3c0/0x6b0
> [c000fac535d0] c0cfafe4 devm_kmalloc+0x74/0xc0
> [c000fac53600] c0d69434 nd_region_activate+0x144/0x560
> [c000fac536d0] c0d6b19c nd_region_probe+0x17c/0x370
> [c000fac537b0] c0d6349c nvdimm_bus_probe+0x10c/0x230
> [c000fac53840] c0cf3cc4 really_probe+0x254/0x4e0
> [c000fac538d0] c0cf429c driver_probe_device+0x16c/0x1e0
> [c000fac53950] c0cf0b44 bus_for_each_drv+0x94/0x130
> [c000fac539b0] c0cf392c __device_attach+0xdc/0x200
> [c000fac53a50] c0cf231c bus_probe_device+0x4c/0xf0
> [c000fac53a90] c0ced268 device_add+0x528/0x810
> [c000fac53b60] c0d62a58 nd_async_device_register+0x28/0xa0
> [c000fac53bd0] c01ccb8c async_run_entry_fn+0xcc/0x1f0
> [c000fac53c50] c01bcd9c process_one_work+0x46c/0x860
> [c000fac53d20] c01bd4f4 worker_thread+0x364/0x5f0
> [c000fac53db0] c01c7260 kthread+0x1b0/0x1c0
> [c000fac53e20] c000b954 ret_from_kernel_thread+0x5c/0x68
>
> With the patch we get
>
>  # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> node 1 size: 130865 MB
> node 1 free: 129130 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>  # cat /sys/bus/nd/devices/region0/numa_node
> 0
>  # dmesg | grep papr_scm
> [   91.332305] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Region 
> registered with target node 2 and online node 0
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 30 +--
>  1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index c8ec670ee924..4abb0ecda30a 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -255,12 +255,32 @@ static const struct attribute_group 
> *papr_scm_dimm_groups[] = {
>   NULL,
>  };
>  
> +static inline int papr_scm_node(int node)
> +{
> + int min_dist = INT_MAX, dist;
> + int nid, min_node;
> +
> + if (node_online(node))
> + return node;

We should handle NUMA_NO_NODE here.

modified   arch/powerpc/platforms/pseries/papr_scm.c
@@ -260,7 +260,7 @@ static inline int papr_scm_node(int node)
int min_dist = INT_MAX, dist;
int nid, min_node;
 
-   if (node_online(node))
+   if ((node == NUMA_NO_NODE) || node_online(node))
return node;
 
min_node = first_online_node;

Will send an updated patch.

> +
> + min_node = first_online_node;
> + for_each_online_node(nid) {
> + dist = node_distance(node, nid);
> + if (dist < min_dist) {
> + min_dist = dist;
> + min_node = nid;
> + }
> + }
> + return min_node;
> +}
> +
>  static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  {
>   struct device *dev = &p->pdev->dev;
>   struct nd_mapping_desc mapping;
>   struct nd_region_desc ndr_desc;
>   unsigned long dimm_flags;
> + int target_nid, online_nid;
>  
>   p->bus_desc.ndctl = papr_scm_ndctl;
>   p->bus_desc.module = THIS_MODULE;
> @@ -299,8 +319,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>  
>   memset(&ndr_desc, 0, sizeof(ndr_desc));
>   ndr_desc.attr_groups = region_attr_groups;
> - ndr_desc.numa_node = dev_to_node(&p->pdev->dev);
> - ndr_desc.target_node = ndr_desc.numa_node;
> + target_nid = dev_to_n

Re: [PATCH 1/5] powerpc/eeh_cache: Don't use pci_dn when inserting new ranges

2019-07-15 Thread Sam Bobroff
On Mon, Jul 15, 2019 at 06:56:08PM +1000, Oliver O'Halloran wrote:
> At the point where we start inserting ranges into the EEH address cache the
> binding between pci_dev and eeh_dev has already been set up. Instead of
> consulting the pci_dn tree we can retrieve the eeh_dev directly using
> pci_dev_to_eeh_dev().
> 
> Signed-off-by: Oliver O'Halloran 

In fact the binding between pci_dev and eeh_dev is set up right before
the calls to eeh_addr_cache_insert_dev() (see eeh_addr_cache_build() and
eeh_add_device_tree_late()) so this looks clearly correct to me.

A few simple tests of EEH recovery still succeed as well.

Reviewed-by: Sam Bobroff 
Tested-by: Sam Bobroff 

> ---
>  arch/powerpc/kernel/eeh_cache.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
> index 3204723..908ba69 100644
> --- a/arch/powerpc/kernel/eeh_cache.c
> +++ b/arch/powerpc/kernel/eeh_cache.c
> @@ -156,18 +156,10 @@ eeh_addr_cache_insert(struct pci_dev *dev, 
> resource_size_t alo,
>  
>  static void __eeh_addr_cache_insert_dev(struct pci_dev *dev)
>  {
> - struct pci_dn *pdn;
>   struct eeh_dev *edev;
>   int i;
>  
> - pdn = pci_get_pdn_by_devfn(dev->bus, dev->devfn);
> - if (!pdn) {
> - pr_warn("PCI: no pci dn found for dev=%s\n",
> - pci_name(dev));
> - return;
> - }
> -
> - edev = pdn_to_eeh_dev(pdn);
> + edev = pci_dev_to_eeh_dev(dev);
>   if (!edev) {
>   pr_warn("PCI: no EEH dev found for %s\n",
>   pci_name(dev));
> -- 
> 2.9.5
> 


signature.asc
Description: PGP signature


Re: [PATCH 2/5] powerpc/eeh_sysfs: Fix incorrect comment

2019-07-15 Thread Sam Bobroff
On Mon, Jul 15, 2019 at 06:56:09PM +1000, Oliver O'Halloran wrote:
> The EEH_ATTR_SHOW() helper is used to display fields from struct eeh_dev
> not struct pci_dn.
> 
> Signed-off-by: Oliver O'Halloran 

Good.

Reviewed-by: Sam Bobroff 

> ---
>  arch/powerpc/kernel/eeh_sysfs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
> index 3fa04dd..6a2c2886f 100644
> --- a/arch/powerpc/kernel/eeh_sysfs.c
> +++ b/arch/powerpc/kernel/eeh_sysfs.c
> @@ -30,7 +30,7 @@
>  /**
>   * EEH_SHOW_ATTR -- Create sysfs entry for eeh statistic
>   * @_name: name of file in sysfs directory
> - * @_memb: name of member in struct pci_dn to access
> + * @_memb: name of member in struct eeh_dev to access
>   * @_format: printf format for display
>   *
>   * All of the attributes look very similar, so just
> -- 
> 2.9.5
> 


signature.asc
Description: PGP signature


Re: [PATCH 3/5] powerpc/eeh_sysfs: ifdef pseries sr-iov sysfs properties

2019-07-15 Thread Sam Bobroff
On Mon, Jul 15, 2019 at 06:56:10PM +1000, Oliver O'Halloran wrote:
> There are several EEH sysfs properties that only exists when the
> "ibm,is-open-sriov-pf" property appears in the device tree node of the PCI
> device. This used on pseries to indicate to the guest that the hypervisor
> allows the guest to configure the SR-IOV capability. Doing this requires
> some handshaking between the guest, hypervisor and userspace when a VF is
> EEH frozen which is why these properties exist.
> 
> This is all dead code on non-pseries platforms so wrap it in an #ifdef
> CONFIG_PPC_PSERIES to make the dependency clearer.

Looks good and a test compile without pSeries platform support
configured seemed to work, so:

Tested-by: Sam Bobroff 
Reviewed-by: Sam Bobroff 

> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/kernel/eeh_sysfs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
> index 6a2c2886f..3adf8cd 100644
> --- a/arch/powerpc/kernel/eeh_sysfs.c
> +++ b/arch/powerpc/kernel/eeh_sysfs.c
> @@ -91,7 +91,7 @@ static ssize_t eeh_pe_state_store(struct device *dev,
>  
>  static DEVICE_ATTR_RW(eeh_pe_state);
>  
> -#ifdef CONFIG_PCI_IOV
> +#if defined(CONFIG_PCI_IOV) && defined(CONFIG_PPC_PSERIES)
>  static ssize_t eeh_notify_resume_show(struct device *dev,
> struct device_attribute *attr, char *buf)
>  {
> @@ -148,7 +148,7 @@ static void eeh_notify_resume_remove(struct pci_dev *pdev)
>  #else
>  static inline int eeh_notify_resume_add(struct pci_dev *pdev) { return 0; }
>  static inline void eeh_notify_resume_remove(struct pci_dev *pdev) { }
> -#endif /* CONFIG_PCI_IOV */
> +#endif /* CONFIG_PCI_IOV && CONFIG PPC_PSERIES*/
>  
>  void eeh_sysfs_add_device(struct pci_dev *pdev)
>  {
> -- 
> 2.9.5
> 


signature.asc
Description: PGP signature


Re: [PATCH 4/5] powerpc/eeh_sysfs: Remove double pci_dn lookup.

2019-07-15 Thread Sam Bobroff
On Mon, Jul 15, 2019 at 06:56:11PM +1000, Oliver O'Halloran wrote:
> In eeh_notify_resume_show() the pci_dn for the device is looked up once in
> the declaration block and then once after checking for a NULL eeh_dev.
> Remove the second lookup since it's pointless.
> 
> Signed-off-by: Oliver O'Halloran 

Reviewed-by: Sam Bobroff 
Tested-by: Sam Bobroff 

> ---
>  arch/powerpc/kernel/eeh_sysfs.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
> index 3adf8cd..c4cc8fc 100644
> --- a/arch/powerpc/kernel/eeh_sysfs.c
> +++ b/arch/powerpc/kernel/eeh_sysfs.c
> @@ -102,7 +102,6 @@ static ssize_t eeh_notify_resume_show(struct device *dev,
>   if (!edev || !edev->pe)
>   return -ENODEV;
>  
> - pdn = pci_get_pdn(pdev);
>   return sprintf(buf, "%d\n", pdn->last_allow_rc);
>  }
>  
> -- 
> 2.9.5
> 


signature.asc
Description: PGP signature


Re: [PATCH 5/5] powerpc/eeh_sysfs: Make clearing EEH_DEV_SYSFS saner

2019-07-15 Thread Sam Bobroff
On Mon, Jul 15, 2019 at 06:56:12PM +1000, Oliver O'Halloran wrote:
> The eeh_sysfs_remove_device() function is supposed to clear the
> EEH_DEV_SYSFS flag since it indicates the EEH sysfs entries have been added
> for a pci_dev.
> 
> When the sysfs files are removed eeh_remove_device() the eeh_dev and the
> pci_dev have already been de-associated. This then causes the
> pci_dev_to_eeh_dev() call in eeh_sysfs_remove_device() to return NULL so
> the flag can't be cleared from the still-live eeh_dev. This problem is
> worked around in the caller by clearing the flag manually. However, this
> behaviour doesn't make a whole lot of sense, so this patch fixes it by:
> 
> a) Re-ordering eeh_remove_device() so that eeh_sysfs_remove_device() is
>called before de-associating the pci_dev and eeh_dev.
> 
> b) Making eeh_sysfs_remove_device() emit a warning if there's no
>corresponding eeh_dev for a pci_dev. The paths where the sysfs
>files are only reachable if EEH was setup for the device
>for the device in the first place so hitting this warning
>indicates a programming error.
> 
> Signed-off-by: Oliver O'Halloran 

Good cleanup, although it looks like "for the device" got duplicated in
the last part of the commit message.

Simple EEH tests still succeeed.

Reviewed-by: Sam Bobroff 
Tested-by: Sam Bobroff 

> + if (!(edev->pe->state & EEH_PE_KEEP))
> + eeh_rmv_from_parent_pe(edev);
> + else
> + edev->mode |= EEH_DEV_DISCONNECTED;

> ---
>  arch/powerpc/kernel/eeh.c   | 30 +-
>  arch/powerpc/kernel/eeh_sysfs.c | 15 ---
>  2 files changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index f192d57..6e24896 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1203,7 +1203,6 @@ void eeh_add_device_late(struct pci_dev *dev)
>   eeh_rmv_from_parent_pe(edev);
>   eeh_addr_cache_rmv_dev(edev->pdev);
>   eeh_sysfs_remove_device(edev->pdev);
> - edev->mode &= ~EEH_DEV_SYSFS;
>  
>   /*
>* We definitely should have the PCI device removed
> @@ -1306,17 +1305,11 @@ void eeh_remove_device(struct pci_dev *dev)
>   edev->pdev = NULL;
>  
>   /*
> -  * The flag "in_error" is used to trace EEH devices for VFs
> -  * in error state or not. It's set in eeh_report_error(). If
> -  * it's not set, eeh_report_{reset,resume}() won't be called
> -  * for the VF EEH device.
> +  * eeh_sysfs_remove_device() uses pci_dev_to_eeh_dev() so we need to
> +  * remove the sysfs files before clearing dev.archdata.edev
>*/
> - edev->in_error = false;
> - dev->dev.archdata.edev = NULL;
> - if (!(edev->pe->state & EEH_PE_KEEP))
> - eeh_rmv_from_parent_pe(edev);
> - else
> - edev->mode |= EEH_DEV_DISCONNECTED;
> + if (edev->mode & EEH_DEV_SYSFS)
> + eeh_sysfs_remove_device(dev);
>  
>   /*
>* We're removing from the PCI subsystem, that means
> @@ -1327,8 +1320,19 @@ void eeh_remove_device(struct pci_dev *dev)
>   edev->mode |= EEH_DEV_NO_HANDLER;
>  
>   eeh_addr_cache_rmv_dev(dev);
> - eeh_sysfs_remove_device(dev);
> - edev->mode &= ~EEH_DEV_SYSFS;
> +
> + /*
> +  * The flag "in_error" is used to trace EEH devices for VFs
> +  * in error state or not. It's set in eeh_report_error(). If
> +  * it's not set, eeh_report_{reset,resume}() won't be called
> +  * for the VF EEH device.
> +  */
> + edev->in_error = false;
> + dev->dev.archdata.edev = NULL;
> + if (!(edev->pe->state & EEH_PE_KEEP))
> + eeh_rmv_from_parent_pe(edev);
> + else
> + edev->mode |= EEH_DEV_DISCONNECTED;
>  }
>  
>  int eeh_unfreeze_pe(struct eeh_pe *pe)
> diff --git a/arch/powerpc/kernel/eeh_sysfs.c b/arch/powerpc/kernel/eeh_sysfs.c
> index c4cc8fc..5614fd83 100644
> --- a/arch/powerpc/kernel/eeh_sysfs.c
> +++ b/arch/powerpc/kernel/eeh_sysfs.c
> @@ -175,22 +175,23 @@ void eeh_sysfs_remove_device(struct pci_dev *pdev)
>  {
>   struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
>  
> + if (!edev) {
> + WARN_ON(eeh_enabled());
> + return;
> + }
> +
> + edev->mode &= ~EEH_DEV_SYSFS;
> +
>   /*
>* The parent directory might have been removed. We needn't
>* continue for that case.
>*/
> - if (!pdev->dev.kobj.sd) {
> - if (edev)
> - edev->mode &= ~EEH_DEV_SYSFS;
> + if (!pdev->dev.kobj.sd)
>   return;
> - }
>  
>   device_remove_file(&pdev->dev, &dev_attr_eeh_mode);
>   device_remove_file(&pdev->dev, &dev_attr_eeh_pe_config_addr);
>   device_remove_file(&pdev->dev, &dev_attr_eeh_pe_state);
>  
>   eeh_notify_resume_remove(pdev);
> -
> - if (edev)
> - edev->mode &= ~EEH_DEV_SYSFS;
>  }
> -- 
> 2.9.5
> 

[PATCH V3] cpufreq: Make cpufreq_generic_init() return void

2019-07-15 Thread Viresh Kumar
It always returns 0 (success) and its return type should really be void.
Over that, many drivers have added error handling code based on its
return value, which is not required at all.

change its return type to void and update all the callers.

Signed-off-by: Viresh Kumar 
---
V2->V3:
- Update bmips cpufreq driver to avoid "warning: 'ret' may be used
  uninitialized".
- Build bot reported this issue almost after 4 days of posting this
  patch, I was expecting this a lot earlier :)

 drivers/cpufreq/bmips-cpufreq.c | 17 ++---
 drivers/cpufreq/cpufreq.c   |  4 +---
 drivers/cpufreq/davinci-cpufreq.c   |  3 ++-
 drivers/cpufreq/imx6q-cpufreq.c |  6 ++
 drivers/cpufreq/kirkwood-cpufreq.c  |  3 ++-
 drivers/cpufreq/loongson1-cpufreq.c |  8 +++-
 drivers/cpufreq/loongson2_cpufreq.c |  3 ++-
 drivers/cpufreq/maple-cpufreq.c |  3 ++-
 drivers/cpufreq/omap-cpufreq.c  | 15 +--
 drivers/cpufreq/pasemi-cpufreq.c|  3 ++-
 drivers/cpufreq/pmac32-cpufreq.c|  3 ++-
 drivers/cpufreq/pmac64-cpufreq.c|  3 ++-
 drivers/cpufreq/s3c2416-cpufreq.c   |  9 ++---
 drivers/cpufreq/s3c64xx-cpufreq.c   | 15 +++
 drivers/cpufreq/s5pv210-cpufreq.c   |  3 ++-
 drivers/cpufreq/sa1100-cpufreq.c|  3 ++-
 drivers/cpufreq/sa1110-cpufreq.c|  3 ++-
 drivers/cpufreq/spear-cpufreq.c |  3 ++-
 drivers/cpufreq/tegra20-cpufreq.c   |  8 +---
 include/linux/cpufreq.h |  2 +-
 20 files changed, 46 insertions(+), 71 deletions(-)

diff --git a/drivers/cpufreq/bmips-cpufreq.c b/drivers/cpufreq/bmips-cpufreq.c
index 56a4ebbf00e0..f7c23fa468f0 100644
--- a/drivers/cpufreq/bmips-cpufreq.c
+++ b/drivers/cpufreq/bmips-cpufreq.c
@@ -131,23 +131,18 @@ static int bmips_cpufreq_exit(struct cpufreq_policy 
*policy)
 static int bmips_cpufreq_init(struct cpufreq_policy *policy)
 {
struct cpufreq_frequency_table *freq_table;
-   int ret;
 
freq_table = bmips_cpufreq_get_freq_table(policy);
if (IS_ERR(freq_table)) {
-   ret = PTR_ERR(freq_table);
-   pr_err("%s: couldn't determine frequency table (%d).\n",
-   BMIPS_CPUFREQ_NAME, ret);
-   return ret;
+   pr_err("%s: couldn't determine frequency table (%ld).\n",
+   BMIPS_CPUFREQ_NAME, PTR_ERR(freq_table));
+   return PTR_ERR(freq_table);
}
 
-   ret = cpufreq_generic_init(policy, freq_table, TRANSITION_LATENCY);
-   if (ret)
-   bmips_cpufreq_exit(policy);
-   else
-   pr_info("%s: registered\n", BMIPS_CPUFREQ_NAME);
+   cpufreq_generic_init(policy, freq_table, TRANSITION_LATENCY);
+   pr_info("%s: registered\n", BMIPS_CPUFREQ_NAME);
 
-   return ret;
+   return 0;
 }
 
 static struct cpufreq_driver bmips_cpufreq_driver = {
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 4d6043ee7834..8dda62367816 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -159,7 +159,7 @@ EXPORT_SYMBOL_GPL(arch_set_freq_scale);
  * - set policies transition latency
  * - policy->cpus with all possible CPUs
  */
-int cpufreq_generic_init(struct cpufreq_policy *policy,
+void cpufreq_generic_init(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int transition_latency)
 {
@@ -171,8 +171,6 @@ int cpufreq_generic_init(struct cpufreq_policy *policy,
 * share the clock and voltage and clock.
 */
cpumask_setall(policy->cpus);
-
-   return 0;
 }
 EXPORT_SYMBOL_GPL(cpufreq_generic_init);
 
diff --git a/drivers/cpufreq/davinci-cpufreq.c 
b/drivers/cpufreq/davinci-cpufreq.c
index 3de48ae60c29..297d23cad8b5 100644
--- a/drivers/cpufreq/davinci-cpufreq.c
+++ b/drivers/cpufreq/davinci-cpufreq.c
@@ -90,7 +90,8 @@ static int davinci_cpu_init(struct cpufreq_policy *policy)
 * Setting the latency to 2000 us to accommodate addition of drivers
 * to pre/post change notification list.
 */
-   return cpufreq_generic_init(policy, freq_table, 2000 * 1000);
+   cpufreq_generic_init(policy, freq_table, 2000 * 1000);
+   return 0;
 }
 
 static struct cpufreq_driver davinci_driver = {
diff --git a/drivers/cpufreq/imx6q-cpufreq.c b/drivers/cpufreq/imx6q-cpufreq.c
index 47ccfa6b17b7..648a09a1778a 100644
--- a/drivers/cpufreq/imx6q-cpufreq.c
+++ b/drivers/cpufreq/imx6q-cpufreq.c
@@ -190,14 +190,12 @@ static int imx6q_set_target(struct cpufreq_policy 
*policy, unsigned int index)
 
 static int imx6q_cpufreq_init(struct cpufreq_policy *policy)
 {
-   int ret;
-
policy->clk = clks[ARM].clk;
-   ret = cpufreq_generic_init(policy, freq_table, transition_latency);
+   cpufreq_generic_init(policy, freq_table, transition_latency);
policy->suspend_freq = max_freq;
dev_pm_opp_of_register_em(policy->cpus);
 
-   return ret;
+   return 0;
 }
 
 static struct cpufreq_driver imx6

Re: [PATCH] PPC: Set reserved PCR bits

2019-07-15 Thread Joel Stanley
On Tue, 16 Jul 2019 at 02:55, Jordan Niethe  wrote:
>
> Currently the reserved bits of the Processor Compatibility Register
> (PCR) are cleared as per the Programming Note in Section 1.3.3 of the
> ISA.  An update is planned for the ISA so that PCR reserved bits should
> be set. Set the reserved bits of the PCR as required.
>
> Acked-by: Alistair Popple 
> Signed-off-by: Jordan Niethe 

Tested-by: Joel Stanley 

I gave a powernv_defconfig build a spin in a qemu powernv machine.

Cheers,

Joel


[PATCH v2] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

2019-07-15 Thread Bharata B Rao
We could potentially hit the following BUG_ON when memory hotplugged
before reboot is unplugged after reboot:

kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!

 remove_pagetable+0x594/0x6a0
 (unreliable)
 remove_pagetable+0x94/0x6a0
 vmemmap_free+0x394/0x410
 sparse_remove_one_section+0x26c/0x2e8
 __remove_pages+0x428/0x540
 arch_remove_memory+0xd0/0x170
 __remove_memory+0xd4/0x1a0
 dlpar_remove_lmb+0xbc/0x110
 dlpar_memory+0xa80/0xd20
 handle_dlpar_errorlog+0xa8/0x160
 pseries_hp_work_fn+0x2c/0x60
 process_one_work+0x46c/0x860
 worker_thread+0x364/0x5e0
 kthread+0x1b0/0x1c0
 ret_from_kernel_thread+0x5c/0x68

This occurs because, during reboot-after-hotplug, the hotplugged
memory range gets initialized as regular memory and page
tables are setup using memblock allocator. This means that we
wouldn't have initialized the PMD or PTE fragment count for
those PMD or PTE pages.

This BUG_ON was hitting quite regularly until we changed the vmemmap
mapping size from 64K to 2M via the commit
'89a3496e0664 ("powerpc/mm/radix: Use the right page size for
vmemmap mapping")'

After this commit, the bug still exists but it is hard to hit now
because vmemmap array is now mapped at PMD level (level 3) and we would
have to unplug a huge amount of memory to free one PMD table.

A point to note here is that we encounter this bug only while removing
the mappings for vmemmap and not when unmapping the memory that we
are unplugging. This is because for the memory that is to be unplugged
we split the mapping into smaller 2M mappings and hence PMD and PTE
fragments will be initialized properly as it is no longer early mapping.

Fixing this includes 3 parts:

- Re-walk the init_mm page tables from mem_init() and initialize
  the PMD and PTE fragment counts appropriately. So PMD and PTE
  table pages allocated during early allocation will have a
  fragment count of 1.
- Convert the pages from memblock pages to regular pages so that
  they can be freed back to buddy allocator seamlessly. However
  we do this for only PMD and PTE pages and not for PUD pages.
  PUD pages are freed using kmem_cache_free() and we need to
  identify memblock pages and free them differently.
- When we do early memblock based allocation of PMD and PUD pages,
  allocate in PAGE_SIZE granularity so that we are sure the
  complete page is used as pagetable page. PAGE_SIZE allocations will
  have an implication on the amount of memory used for page tables,
  an example of which is shown below:

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is an example of how much more we end up allocating for
page tables in case of 64T system RAM:

1. Mapping system RAM

With each PGD entry spanning 512G, 64TB RAM would need 128 entries
and hence 128 PUD tables. We use 1G mapping at PUD level (level 2)

With default PUD_TABLE_SIZE(4K), 128*4K=512K (8 64K pages)
With PAGE_SIZE(64K) allocations, 128*64K=8192K (128 64K pages)

2. Mapping struct pages (memmap)

64T RAM would need 64G for memmap with struct page size being 64B.
Since memmap array is mapped using 64K mappings, we would need
64 PUD entries or 64 PMD tables (level 3) in total.

With default PMD_TABLE_SIZE(4K), 64*4K=256K (4 64K pages)
With PAGE_SIZE(64K) allocations, 64*64K=4096K (64 64K pages)

There is no change in PTE table (level 4) allocation requirement as
early page table allocation is already using PAGE_SIZE PTE tables.

So essentially with this change we would use 180 64K pages
more for 64T system.

Reported-by: Srikanth Aithal 
Signed-off-by: Bharata B Rao 
---
v1: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg152511.html
Changes in v2:
- Replaced [pgd,pud,pmd]_huge() with [pgd,pud,pmd]_is_leaf()
- Updated commit description

 arch/powerpc/include/asm/book3s/64/pgalloc.h |  7 +-
 arch/powerpc/include/asm/book3s/64/radix.h   |  1 +
 arch/powerpc/include/asm/sparsemem.h |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 15 +++-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 79 +++-
 arch/powerpc/mm/mem.c|  5 ++
 6 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index d45e4449619f..282f6620c514 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -114,7 +114,12 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, 
unsigned long addr)
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+   struct page *page = virt_to_page(pud);
+
+   if (PageReserved(page))
+   free_reserved_page(page);
+   else
+   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, p

Re: [EXTERNAL] Re: [PATCH v2 3/6] powerpc/eeh: Improve debug messages around device addition

2019-07-15 Thread Sam Bobroff
On Thu, Jun 20, 2019 at 01:45:24PM +1000, Oliver O'Halloran wrote:
> On Thu, Jun 20, 2019 at 12:40 PM Alexey Kardashevskiy  wrote:
> >
> > On 19/06/2019 14:27, Sam Bobroff wrote:
> > > On Tue, Jun 11, 2019 at 03:47:58PM +1000, Alexey Kardashevskiy wrote:
> > >>
> > >> On 07/05/2019 14:30, Sam Bobroff wrote:
> > >>> Also remove useless comment.
> > >>>
> > >>> Signed-off-by: Sam Bobroff 
> > >>> Reviewed-by: Alexey Kardashevskiy 
> > >>> ---
> > *snip*
> > >
> > > I can see that edev will be non-NULL here, but that pr_debug() pattern
> > > (using the PDN information to form the PCI address) is quite common
> > > across the EEH code, so I think rather than changing a couple of
> > > specific cases, I should do a separate cleanup patch and introduce
> > > something like pdn_debug(pdn, ""). What do you think?
> >
> > I'd switch them all to already existing dev_dbg/pci_debug rather than
> > adding pdn_debug as imho it should not have been used in the first place
> > really...
> >
> > > (I don't know exactly when edev->pdev can be NULL.)
> >
> > ... and if you switch to dev_dbg/pci_debug, I think quite soon you'll
> > know if it can or cannot be NULL :)
> 
> As far as I can tell edev->pdev is NULL in two cases:
> 
> 1. Before eeh_device_add_late() has been called on the pdev. The late
> part of the add maps the pdev to an edev and sets the pdev's edev
> pointer and vis a vis.
> 2. While recoverying EEH unaware devices. Unaware devices are
> destroyed and rescanned and the edev->pdev pointer is cleared by
> pcibios_device_release()
> 
> In most of these cases it should be safe to use the pci_*() functions
> rather than making a new one up for printing pdns. In the cases where
> we might not have a PCI dev i'd make a new set of prints that take an
> EEH dev rather than a pci_dn since i'd like pci_dn to die sooner
> rather than later.
> 
> Oliver

I'll change the calls in {pnv,pseries}_pcibios_bus_add_device() and
eeh_add_device_late() to use dev_dbg() and post a new version.

For {pnv,pseries}_eeh_probe() I'm not sure what we can do; there's no
pci_dev available yet and while it would be nice to use the eeh_dev
rather than the pdn, it doesn't seem to have the bus/device/fn
information we need. Am I missing something there?  (The code in the
probe functions seems to get it from the pci_dn.)

If there isn't an easy way around this, would it therefore be reasonable
to just leave them open-coded as they are?

Cheers,
Sam.


signature.asc
Description: PGP signature