date:20160914

Re: [PATCH] powerpc: Ensure .mem(init|exit).text are within _stext/_etext

2016-09-14 Thread Anton Blanchard

Hi,

> In our linker script we open code the list of text sections, because
> we need to include the __ftr_alt sections, which are arch-specific.
> 
> This means we can't use TEXT_TEXT as defined in vmlinux.lds.h, and so
> we don't have the MEM_KEEP() logic for memory hotplug sections.
> 
> If we build the kernel with the gold linker, and with
> CONFIG_MEMORY_HOTPLUG=y, we see that functions marked __meminit can
> end up outside of the _stext/_etext range, and also outside of
> _sinittext/_einittext, eg:
> 
> c000 T _stext
> c09e A _etext
> c09e3f18 T hash__vmemmap_create_mapping
> c0ca T _sinittext
> c0d00844 T _einittext
> 
> This causes them to not be recognised as text by is_kernel_text(), and
> prevents them being patched by jump_label (and presumably
> ftrace/kprobes etc.).
> 
> Fix it by adding MEM_KEEP() directives, mirroring what TEXT_TEXT does.
> 
> This isn't a problem when CONFIG_MEMORY_HOTPLUG=n, because we use the
> standard INIT_TEXT_SECTION() and EXIT_TEXT macros from vmlinux.lds.h.

Thanks Michael, looks good:

Tested-by: Anton Blanchard 

Anton
--
 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/vmlinux.lds.S | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/vmlinux.lds.S
> b/arch/powerpc/kernel/vmlinux.lds.S index b5fba689fca6..b59d75e194a5
> 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S
> +++ b/arch/powerpc/kernel/vmlinux.lds.S
> @@ -56,6 +56,8 @@ SECTIONS
>   KPROBES_TEXT
>   IRQENTRY_TEXT
>   SOFTIRQENTRY_TEXT
> + MEM_KEEP(init.text)
> + MEM_KEEP(exit.text)
>  
>  #ifdef CONFIG_PPC32
>   *(.got1)

[PATCH] powerpc: Ensure .mem(init|exit).text are within _stext/_etext

2016-09-14 Thread Michael Ellerman

In our linker script we open code the list of text sections, because we
need to include the __ftr_alt sections, which are arch-specific.

This means we can't use TEXT_TEXT as defined in vmlinux.lds.h, and so we
don't have the MEM_KEEP() logic for memory hotplug sections.

If we build the kernel with the gold linker, and with CONFIG_MEMORY_HOTPLUG=y,
we see that functions marked __meminit can end up outside of the
_stext/_etext range, and also outside of _sinittext/_einittext, eg:

c000 T _stext
c09e A _etext
c09e3f18 T hash__vmemmap_create_mapping
c0ca T _sinittext
c0d00844 T _einittext

This causes them to not be recognised as text by is_kernel_text(), and
prevents them being patched by jump_label (and presumably ftrace/kprobes
etc.).

Fix it by adding MEM_KEEP() directives, mirroring what TEXT_TEXT does.

This isn't a problem when CONFIG_MEMORY_HOTPLUG=n, because we use the
standard INIT_TEXT_SECTION() and EXIT_TEXT macros from vmlinux.lds.h.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/vmlinux.lds.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index b5fba689fca6..b59d75e194a5 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -56,6 +56,8 @@ SECTIONS
KPROBES_TEXT
IRQENTRY_TEXT
SOFTIRQENTRY_TEXT
+   MEM_KEEP(init.text)
+   MEM_KEEP(exit.text)
 
 #ifdef CONFIG_PPC32
*(.got1)
-- 
2.7.4

Re: [PATCH] powerpc/64: whitelist unresolved modversions CRCs

2016-09-14 Thread Michael Ellerman

Nicholas Piggin  writes:

> These are a symptom of CRC generation failure in generic
> build code, and not powerpc specific.
>
> Signed-off-by: Nicholas Piggin 

Acked-by: Michael Ellerman 

cheers

Re: [PATCH 0/2] Enable MSR_TM lazily

2016-09-14 Thread Nicholas Piggin

On Wed, 14 Sep 2016 11:10:22 -0300
Carlos Eduardo Seo  wrote:

> On 9/14/16 8:28 AM, Nicholas Piggin wrote:
> >
> > How common it is for glibc to be built with elision?
> >  
> 
> Not that common. We have it built with TLE support in Ubuntu (starting 
> in 15.04), SLES 12 (since SP2) and AT 9.0-0.
> 
> However, it is only enabled by default in Ubuntu. For SLES and AT 9.0, 
> the user has to set an env var to enable it (it's a hack).
> 
> There is some work upstream to add a tunables framework to glibc. That 
> will allow us to properly provide a way to users enable/disable TLE as 
> they wish. That patch is almost in, and as soon as it's committed, we'll 
> start working on the tunable for TLE.

Okay, but for TLE-enabled case, we still want to skip the tabort
before syscall on recent kernels, so we could add that now with a 
relatively small patch couldn't we?

Thanks,
Nick

Re: [PATCH] powerpc: Don't change the section in _GLOBAL()

2016-09-14 Thread Nicholas Piggin

On Thu, 15 Sep 2016 10:40:20 +1000
Michael Ellerman  wrote:

> Currently the _GLOBAL() macro unilaterally sets the assembler section to
> ".text" at the start of the macro. This is rude as the caller may be
> using a different section.
> 
> So let the caller decide which section to emit the code into. On big
> endian we do need to switch to the ".opd" section to emit the OPD, but
> do that with pushsection/popsection, thereby leaving the original
> section intact.
> 
> The only place I could find where this requires changes to the code is
> in misc_32.S, where we need to switch back to ".text" after
> flush_icache_range() which is in ".kprobes.text".
> 
> I verified that the order of all entries in System.map is unchanged
> after this patch. The actual addresses shift around slightly so you
> can't just diff the System.map.
> 
> Signed-off-by: Michael Ellerman 

Excellent, thanks for going through it.

Reviewed-by: Nicholas Piggin 

> ---
> 
> If anyone can think of a better method to verify we are still emitting
> everything in the same sections let me know.
> 
> 
>  arch/powerpc/include/asm/ppc_asm.h | 8 ++--
>  arch/powerpc/kernel/misc_32.S  | 3 +++
>  2 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/ppc_asm.h 
> b/arch/powerpc/include/asm/ppc_asm.h
> index d5d5b5e348f2..479287045166 100644
> --- a/arch/powerpc/include/asm/ppc_asm.h
> +++ b/arch/powerpc/include/asm/ppc_asm.h
> @@ -201,14 +201,12 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>  #ifdef PPC64_ELF_ABI_v2
>  
>  #define _GLOBAL(name) \
> - .section ".text"; \
>   .align 2 ; \
>   .type name,@function; \
>   .globl name; \
>  name:
>  
>  #define _GLOBAL_TOC(name) \
> - .section ".text"; \
>   .align 2 ; \
>   .type name,@function; \
>   .globl name; \
> @@ -232,16 +230,15 @@ name:
>  #define GLUE(a,b) XGLUE(a,b)
>  
>  #define _GLOBAL(name) \
> - .section ".text"; \
>   .align 2 ; \
>   .globl name; \
>   .globl GLUE(.,name); \
> - .section ".opd","aw"; \
> + .pushsection ".opd","aw"; \
>  name: \
>   .quad GLUE(.,name); \
>   .quad .TOC.@tocbase; \
>   .quad 0; \
> - .previous; \
> + .popsection; \

I think you can still use .section and .previous here, but it's
much of a muchness.

[PATCH 4/4] drivers/pci/hotplug: Support surprise hotplug

2016-09-14 Thread Gavin Shan

This supports PCI surprise hotplug. The design is highlighted as
below:

   * The PCI slot's surprise hotplug capability is exposed through
 device node property "ibm,slot-surprise-pluggable", meaning
 PCI surprise hotplug will be disabled if skiboot doesn't support
 it yet.
   * The interrupt because of presence or link state change is raised
 on surprise hotplug event. One event is allocated and queued to
 the PCI slot for workqueue to pick it up and process in serialized
 fashion. The code flow for surprise hotplug is same to that for
 managed hotplug except: the affected PEs are put into frozen state
 to avoid unexpected EEH error reporting in surprise hot remove path.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/include/asm/pnv-pci.h |   9 ++
 drivers/pci/hotplug/pnv_php.c  | 219 +
 2 files changed, 228 insertions(+)

diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index 0cbd813..4ccd2b4 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -48,6 +48,12 @@ void pnv_cxl_phb_set_peer_afu(struct pci_dev *dev, struct 
cxl_afu *afu);
 
 #endif
 
+struct pnv_php_event {
+   booladded;
+   struct pnv_php_slot *php_slot;
+   struct work_struct  work;
+};
+
 struct pnv_php_slot {
struct hotplug_slot slot;
struct hotplug_slot_infoslot_info;
@@ -60,6 +66,9 @@ struct pnv_php_slot {
 #define PNV_PHP_STATE_POPULATED2
 #define PNV_PHP_STATE_OFFLINE  3
int state;
+   int irq;
+   struct workqueue_struct *wq;
+   struct pnv_php_event*event;
struct device_node  *dn;
struct pci_dev  *pdev;
struct pci_bus  *bus;
diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
index 21f1f9d..0358aa7 100644
--- a/drivers/pci/hotplug/pnv_php.c
+++ b/drivers/pci/hotplug/pnv_php.c
@@ -30,13 +30,42 @@ static void pnv_php_register(struct device_node *dn);
 static void pnv_php_unregister_one(struct device_node *dn);
 static void pnv_php_unregister(struct device_node *dn);
 
+static void pnv_php_disable_irq(struct pnv_php_slot *php_slot)
+{
+   struct pci_dev *pdev = php_slot->pdev;
+   u16 ctrl;
+
+   if (php_slot->irq > 0) {
+   pcie_capability_read_word(pdev, PCI_EXP_SLTCTL, );
+   ctrl &= ~(PCI_EXP_SLTCTL_HPIE |
+ PCI_EXP_SLTCTL_PDCE |
+ PCI_EXP_SLTCTL_DLLSCE);
+   pcie_capability_write_word(pdev, PCI_EXP_SLTCTL, ctrl);
+
+   free_irq(php_slot->irq, php_slot);
+   php_slot->irq = 0;
+   }
+
+   if (php_slot->wq) {
+   destroy_workqueue(php_slot->wq);
+   php_slot->wq = NULL;
+   }
+
+   if (pdev->msix_enabled)
+   pci_disable_msix(pdev);
+   else if (pdev->msi_enabled)
+   pci_disable_msi(pdev);
+}
+
 static void pnv_php_free_slot(struct kref *kref)
 {
struct pnv_php_slot *php_slot = container_of(kref,
struct pnv_php_slot, kref);
 
WARN_ON(!list_empty(_slot->children));
+   pnv_php_disable_irq(php_slot);
kfree(php_slot->name);
+   kfree(php_slot->event);
kfree(php_slot);
 }
 
@@ -536,9 +565,16 @@ static struct pnv_php_slot *pnv_php_alloc_slot(struct 
device_node *dn)
if (unlikely(!php_slot))
return NULL;
 
+   php_slot->event = kzalloc(sizeof(struct pnv_php_event), GFP_KERNEL);
+   if (unlikely(!php_slot->event)) {
+   kfree(php_slot);
+   return NULL;
+   }
+
php_slot->name = kstrdup(label, GFP_KERNEL);
if (unlikely(!php_slot->name)) {
kfree(php_slot);
+   kfree(php_slot->event);
return NULL;
}
 
@@ -616,6 +652,184 @@ static int pnv_php_register_slot(struct pnv_php_slot 
*php_slot)
return 0;
 }
 
+static int pnv_php_enable_msix(struct pnv_php_slot *php_slot)
+{
+   struct pci_dev *pdev = php_slot->pdev;
+   struct msix_entry entry;
+   int nr_entries, ret;
+   u16 pcie_flag;
+
+   /* Get total number of MSIx entries */
+   nr_entries = pci_msix_vec_count(pdev);
+   if (nr_entries < 0)
+   return nr_entries;
+
+   /* Check hotplug MSIx entry is in range */
+   pcie_capability_read_word(pdev, PCI_EXP_FLAGS, _flag);
+   entry.entry = (pcie_flag & PCI_EXP_FLAGS_IRQ) >> 9;
+   if (entry.entry >= nr_entries)
+   return -ERANGE;
+
+   /* Enable MSIx */
+   ret = pci_enable_msix_exact(pdev, , 1);
+   if (ret) {
+   dev_warn(>dev, "Error %d enabling MSIx\n", ret);
+   return ret;
+   }
+
+

[PATCH 3/4] powerpc/powernv: Unfreeze PE on allocation

2016-09-14 Thread Gavin Shan

This unfreezes PE when it's initialized because the PE might be put
into frozen state in the last hot remove path. It's not harmful to
do so if the PE is already in unfrozen state.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c38a6a1..5122257 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -126,9 +126,21 @@ static inline bool pnv_pci_is_m64(struct pnv_phb *phb, 
struct resource *r)
 
 static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
 {
+   s64 rc;
+
phb->ioda.pe_array[pe_no].phb = phb;
phb->ioda.pe_array[pe_no].pe_number = pe_no;
 
+   /* Clear the PE frozen state as it might be put into frozen state
+* in the last PCI remove path. It's not harmful to do so when the
+* PE is already in unfrozen state.
+*/
+   rc = opal_pci_eeh_freeze_clear(phb->opal_id, pe_no,
+  OPAL_EEH_ACTION_CLEAR_FREEZE_ALL);
+   if (rc != OPAL_SUCCESS)
+   pr_warn("%s: Error %lld unfreezing PHB#%d-PE#%d\n",
+   __func__, rc, phb->hose->global_number, pe_no);
+
return >ioda.pe_array[pe_no];
 }
 
-- 
2.1.0

[PATCH 0/4] powerpc/powernv: PCI Surprise Hotplug Support

2016-09-14 Thread Gavin Shan

This series of patches supports PCI surprise hotplug on PowerNV platform.

   * This newly added functionality depends on skiboot's changes. However,
 the functionality is disabled simply when skiboot doesn't support it.
 For one specific slot, property "ibm,slot-surprise-pluggable" of the
 slot's device node is set to 1 when surprise hotplug is claimed by
 skiboot. 
   * The interrupts because of presence and link state change are enabled
 in order to support PCI surprise hotplug. The surprise hotplug events
 are queued to the PCI slot and they're picked up for further processing
 in serialized fashion. The surprise and managed hotplug share same code
 flow except: the affected PEs are put into frozen state to avoid unexpected
 EEH error reporting in surprise hot remove path.

PATCH[1/4] and PATCH[2/4] allows to freeze PEs to avoid unexpected EEH error
reporting in PCI surprise hot remove path. PATCH[3/4] clears PE's frozen state
on initializing it because the PE might have been put into frozen state in last
PCI surprise hot remove. PATCH[4/4] supports PCI surprise hotplug in the PowerNV
PCI hotplug driver.

Gavin Shan (4):
  powerpc/eeh: Allow to freeze PE in eeh_pe_set_option()
  powerpc/eeh: Export eeh_pe_state_mark()
  powerpc/powernv: Unfreeze PE on allocation
  drivers/pci/hotplug: Support surprise hotplug

 arch/powerpc/include/asm/pnv-pci.h|   9 ++
 arch/powerpc/kernel/eeh.c |   1 +
 arch/powerpc/kernel/eeh_pe.c  |   1 +
 arch/powerpc/platforms/powernv/pci-ioda.c |  12 ++
 drivers/pci/hotplug/pnv_php.c | 219 ++
 5 files changed, 242 insertions(+)

-- 
2.1.0

[PATCH 2/4] powerpc/eeh: Export eeh_pe_state_mark()

2016-09-14 Thread Gavin Shan

This exports eeh_pe_state_mark(). It will be used to mark the surprise
hot removed PE as isolated to avoid unexpected EEH error reporting in
surprise remove path.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/kernel/eeh_pe.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index f0520da..de7d091 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -581,6 +581,7 @@ void eeh_pe_state_mark(struct eeh_pe *pe, int state)
 {
eeh_pe_traverse(pe, __eeh_pe_state_mark, );
 }
+EXPORT_SYMBOL_GPL(eeh_pe_state_mark);
 
 static void *__eeh_pe_dev_mode_mark(void *data, void *flag)
 {
-- 
2.1.0

[PATCH 1/4] powerpc/eeh: Allow to freeze PE in eeh_pe_set_option()

2016-09-14 Thread Gavin Shan

Function eeh_pe_set_option() is used to apply the requested options
(enable, disable, unfreeze) in EEH virtualization path. The semantics
of this function isn't complete until freezing is supported.

This allows to freeze the indicated PE. The new semantics is going to
be used in PCI surprise hot remove path, to freeze removed PCI devices
(PE) to avoid unexpected EEH error reporting.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/kernel/eeh.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 7429556..0699f15 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1502,6 +1502,7 @@ int eeh_pe_set_option(struct eeh_pe *pe, int option)
break;
case EEH_OPT_THAW_MMIO:
case EEH_OPT_THAW_DMA:
+   case EEH_OPT_FREEZE_PE:
if (!eeh_ops || !eeh_ops->set_option) {
ret = -ENOENT;
break;
-- 
2.1.0

Re: [PATCH v5 1/5] kexec_file: Include the purgatory segment in the kexec image checksum.

2016-09-14 Thread Thiago Jung Bauermann

Hello Stephen,

Am Donnerstag, 15 September 2016, 11:43:08 schrieb Stephen Rothwell:
> Hi Thiago,
> 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 2a1f0ce7c59a..dcd1679f3005 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1792,6 +1792,11 @@ config SECCOMP
> > 
> >  source kernel/Kconfig.hz
> > 
> > +# x86 needs to relocate the purgatory after the checksum is calculated,
> > +# therefore the purgatory cannot be part of the kexec image checksum.
> > +config ARCH_MODIFIES_KEXEC_PURGATORY
> > +   bool
> > +
> 
> The above should probably be in arch/Kconfig (with an appropriately
> changed comment) since it is used in generic code.

Thanks for your quick response! I'll make that change tomorrow and send an 
updated version of just this patch.

-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH v5 1/5] kexec_file: Include the purgatory segment in the kexec image checksum.

2016-09-14 Thread Stephen Rothwell

Hi Thiago,

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2a1f0ce7c59a..dcd1679f3005 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1792,6 +1792,11 @@ config SECCOMP
>  
>  source kernel/Kconfig.hz
>  
> +# x86 needs to relocate the purgatory after the checksum is calculated,
> +# therefore the purgatory cannot be part of the kexec image checksum.
> +config ARCH_MODIFIES_KEXEC_PURGATORY
> + bool
> +

The above should probably be in arch/Kconfig (with an appropriately
changed comment) since it is used in generic code.
-- 
Cheers,
Stephen Rothwell

[PATCH v5 5/5] IMA: Demonstration code for kexec buffer passing.

2016-09-14 Thread Thiago Jung Bauermann

This shows how kernel code can use the kexec buffer passing mechanism
to pass information to the next kernel.

This patch is not intended to be committed.

[a...@linux-foundation.org: coding-style fixes]
Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Andrew Morton 

Signed-off-by: Thiago Jung Bauermann 
---
 include/linux/ima.h   | 11 +
 kernel/kexec_file.c   |  4 ++
 security/integrity/ima/ima.h  |  5 +++
 security/integrity/ima/ima_init.c | 26 +++
 security/integrity/ima/ima_template.c | 85 +++
 5 files changed, 131 insertions(+)

diff --git a/include/linux/ima.h b/include/linux/ima.h
index 0eb7c2e7f0d6..96528d007139 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -11,6 +11,7 @@
 #define _LINUX_IMA_H
 
 #include 
+#include 
 struct linux_binprm;
 
 #ifdef CONFIG_IMA
@@ -23,6 +24,10 @@ extern int ima_post_read_file(struct file *file, void *buf, 
loff_t size,
  enum kernel_read_file_id id);
 extern void ima_post_path_mknod(struct dentry *dentry);
 
+#ifdef CONFIG_KEXEC_FILE
+extern void ima_add_kexec_buffer(struct kimage *image);
+#endif
+
 #else
 static inline int ima_bprm_check(struct linux_binprm *bprm)
 {
@@ -60,6 +65,12 @@ static inline void ima_post_path_mknod(struct dentry *dentry)
return;
 }
 
+#ifdef CONFIG_KEXEC_FILE
+static inline void ima_add_kexec_buffer(struct kimage *image)
+{
+}
+#endif
+
 #endif /* CONFIG_IMA */
 
 #ifdef CONFIG_IMA_APPRAISE
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index fbcec07bb3f5..0146619479a6 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -252,6 +253,9 @@ kimage_file_prepare_segments(struct kimage *image, int 
kernel_fd, int initrd_fd,
}
}
 
+   /* IMA needs to pass the measurement list to the next kernel. */
+   ima_add_kexec_buffer(image);
+
/* Call arch image load handlers */
ldata = arch_kexec_kernel_image_load(image);
 
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index db25f54a04fe..0334001055d7 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -102,6 +102,11 @@ struct ima_queue_entry {
 };
 extern struct list_head ima_measurements;  /* list of all measurements */
 
+#ifdef CONFIG_KEXEC_FILE
+extern void *kexec_buffer;
+extern size_t kexec_buffer_size;
+#endif
+
 /* Internal IMA function definitions */
 int ima_init(void);
 int ima_fs_init(void);
diff --git a/security/integrity/ima/ima_init.c 
b/security/integrity/ima/ima_init.c
index 32912bd54ead..a1924d0f3b2b 100644
--- a/security/integrity/ima/ima_init.c
+++ b/security/integrity/ima/ima_init.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -104,6 +105,29 @@ void __init ima_load_x509(void)
 }
 #endif
 
+#ifdef CONFIG_KEXEC_FILE
+static void ima_load_kexec_buffer(void)
+{
+   int rc;
+
+   /* Fetch the buffer from the previous kernel, if any. */
+   rc = kexec_get_handover_buffer(_buffer, _buffer_size);
+   if (rc == 0) {
+   /* Demonstrate that buffer handover works. */
+   pr_err("kexec buffer contents: %s\n", (char *) kexec_buffer);
+   pr_err("kexec buffer contents after update: %s\n",
+  (char *) kexec_buffer + 4 * PAGE_SIZE + 10);
+
+   kexec_free_handover_buffer();
+   } else if (rc == -ENOENT)
+   pr_debug("No kexec buffer from the previous kernel.\n");
+   else
+   pr_debug("Error restoring kexec buffer: %d\n", rc);
+}
+#else
+static void ima_load_kexec_buffer(void) { }
+#endif
+
 int __init ima_init(void)
 {
u8 pcr_i[TPM_DIGEST_SIZE];
@@ -134,5 +158,7 @@ int __init ima_init(void)
 
ima_init_policy();
 
+   ima_load_kexec_buffer();
+
return ima_fs_init();
 }
diff --git a/security/integrity/ima/ima_template.c 
b/security/integrity/ima/ima_template.c
index febd12ed9b55..92ea3afd9a1f 100644
--- a/security/integrity/ima/ima_template.c
+++ b/security/integrity/ima/ima_template.c
@@ -15,6 +15,8 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include 
+#include 
 #include "ima.h"
 #include "ima_template_lib.h"
 
@@ -182,6 +184,89 @@ static int template_desc_init_fields(const char 
*template_fmt,
return 0;
 }
 
+#ifdef CONFIG_KEXEC_FILE
+void *kexec_buffer;
+size_t kexec_buffer_size;
+
+/* Physical address of the measurement buffer in the next kernel. */
+unsigned long kexec_buffer_load_addr;
+
+/*
+ * Called during reboot. IMA can add here new events that were generated after
+ * the kexec image was loaded.
+ */
+static int ima_update_kexec_buffer(struct notifier_block *self,
+  unsigned long action, void *data)
+{
+   int

[PATCH v5 4/5] kexec_file: Add mechanism to update kexec segments.

2016-09-14 Thread Thiago Jung Bauermann

kexec_update_segment allows a given segment in kexec_image to have
its contents updated. This is useful if the current kernel wants to
send information to the next kernel that is up-to-date at the time of
reboot.

Before modifying the segment the image checksum is verified, and after
the segment is updated the checksum is recalculated and updated in the
kexec image.

Suggested-by: Mimi Zohar 
Signed-off-by: Thiago Jung Bauermann 
---
 include/linux/kexec.h   |   2 +
 kernel/kexec_core.c |   5 -
 kernel/kexec_file.c | 331 
 kernel/kexec_internal.h |   6 +
 4 files changed, 339 insertions(+), 5 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 768245aa76bf..81aca6acc3b0 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -183,6 +183,8 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
   int (*func)(u64, u64, void *));
 extern int kexec_add_buffer(struct kexec_buf *kbuf);
 int kexec_locate_mem_hole(struct kexec_buf *kbuf);
+int kexec_update_segment(const char *buffer, size_t bufsz,
+unsigned long load_addr, size_t memsz);
 #endif /* CONFIG_KEXEC_FILE */
 
 struct kimage {
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 561675589511..a86596984454 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -551,11 +551,6 @@ void kimage_terminate(struct kimage *image)
*image->entry = IND_DONE;
 }
 
-#define for_each_kimage_entry(image, ptr, entry) \
-   for (ptr = >head; (entry = *ptr) && !(entry & IND_DONE); \
-   ptr = (entry & IND_INDIRECTION) ? \
-   boot_phys_to_virt((entry & PAGE_MASK)) : ptr + 1)
-
 static void kimage_free_entry(kimage_entry_t entry)
 {
struct page *page;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 35b04296484b..fbcec07bb3f5 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -591,6 +592,336 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
return 0;
 }
 
+/**
+ * @kexec_image_visit_segments() - call function on each segment page
+ * @image: kexec image to inspect.
+ * @func:  Function to call on each page.
+ * @data:  Data pointer to pass to @func.
+ *
+ * Iterate through the @image entries, calling @func with the given @data
+ * on each segment page. dest is the start address of the page in the next
+ * kernel's address space, and addr is the address of the page in this kernel's
+ * address space.
+ *
+ * Stop iterating if @func returns non-zero, and return that value.
+ *
+ * Return: zero if all pages were visited, @func return value if non-zero.
+ */
+static int kexec_image_visit_segments(struct kimage *image,
+ int (*func)(void *data,
+ unsigned long dest,
+ void *addr),
+ void *data)
+{
+   int ret;
+   unsigned long entry, dest = 0;
+   unsigned long *ptr = NULL;
+
+   for_each_kimage_entry(image, ptr, entry) {
+   void *addr = (void *) (entry & PAGE_MASK);
+
+   switch (entry & IND_FLAGS) {
+   case IND_DESTINATION:
+   dest = (unsigned long) addr;
+   break;
+   case IND_SOURCE:
+   /* Shouldn't happen, but verify just to be safe. */
+   if (WARN_ON(!dest)) {
+   pr_err("Invalid kexec entries list.");
+   return -EINVAL;
+   }
+
+   ret = func(data, dest, addr);
+   if (ret)
+   break;
+
+   dest += PAGE_SIZE;
+   }
+
+   /* Shouldn't happen, but verify just to be safe. */
+   if (WARN_ON(ptr == NULL)) {
+   pr_err("Invalid kexec entries list.");
+   return -EINVAL;
+   }
+   }
+
+   return ret;
+}
+
+struct image_digest_data {
+   unsigned long digest_load_addr;
+   struct shash_desc *desc;
+};
+
+static int calculate_image_digest(void *data, unsigned long dest, void *addr)
+{
+   struct image_digest_data *d = (struct image_digest_data *) data;
+   void *page_addr;
+   unsigned long offset;
+   int ret;
+
+   /* Assumption: the digest segment is PAGE_SIZE long. */
+   if (dest == d->digest_load_addr)
+   return 0;
+
+   page_addr = kmap_atomic(kmap_to_page(addr));
+
+   offset = dest & ~PAGE_MASK;
+   ret = crypto_shash_update(d->desc, page_addr + offset,
+ PAGE_SIZE - offset);
+
+   kunmap_atomic(page_addr);
+
+

[PATCH v5 3/5] powerpc: kexec_file: Add buffer hand-over support for the next kernel

2016-09-14 Thread Thiago Jung Bauermann

The buffer hand-over mechanism allows the currently running kernel to pass
data to kernel that will be kexec'd via a kexec segment. The second kernel
can check whether the previous kernel sent data and retrieve it.

This is the architecture-specific part.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/kexec.h   |  12 +-
 arch/powerpc/kernel/kexec_elf_64.c |   2 +-
 arch/powerpc/kernel/machine_kexec_64.c | 274 +++--
 3 files changed, 240 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 73f88b5f9bd1..b8e32194ce63 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -92,12 +92,20 @@ static inline bool kdump_in_progress(void)
 }
 
 #ifdef CONFIG_KEXEC_FILE
+#define ARCH_HAS_KIMAGE_ARCH
+
+struct kimage_arch {
+   phys_addr_t handover_buffer_addr;
+   unsigned long handover_buffer_size;
+};
+
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
unsigned long fdt_load_addr, unsigned long stack_top,
int debug);
-int setup_new_fdt(void *fdt, unsigned long initrd_load_addr,
- unsigned long initrd_len, const char *cmdline);
+int setup_new_fdt(const struct kimage *image, void *fdt,
+ unsigned long initrd_load_addr, unsigned long initrd_len,
+ const char *cmdline);
 bool find_debug_console(const void *fdt);
 #endif /* CONFIG_KEXEC_FILE */
 
diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec_elf_64.c
index 3cc8ebce1a86..0c576e300384 100644
--- a/arch/powerpc/kernel/kexec_elf_64.c
+++ b/arch/powerpc/kernel/kexec_elf_64.c
@@ -208,7 +208,7 @@ void *elf64_load(struct kimage *image, char *kernel_buf,
goto out;
}
 
-   ret = setup_new_fdt(fdt, initrd_load_addr, initrd_len, cmdline);
+   ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline);
if (ret)
goto out;
 
diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
b/arch/powerpc/kernel/machine_kexec_64.c
index 3879b6d91c0b..d6077898200a 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -489,6 +489,77 @@ int arch_kimage_file_post_load_cleanup(struct kimage 
*image)
return image->fops->cleanup(image->image_loader_data);
 }
 
+bool kexec_can_hand_over_buffer(void)
+{
+   return true;
+}
+
+int arch_kexec_add_handover_buffer(struct kimage *image,
+  unsigned long load_addr, unsigned long size)
+{
+   image->arch.handover_buffer_addr = load_addr;
+   image->arch.handover_buffer_size = size;
+
+   return 0;
+}
+
+int kexec_get_handover_buffer(void **addr, unsigned long *size)
+{
+   int ret;
+   u64 start_addr, end_addr;
+
+   ret = of_property_read_u64(of_chosen,
+  "linux,kexec-handover-buffer-start",
+  _addr);
+   if (ret == -EINVAL)
+   return -ENOENT;
+   else if (ret)
+   return -EINVAL;
+
+   ret = of_property_read_u64(of_chosen, "linux,kexec-handover-buffer-end",
+  _addr);
+   if (ret == -EINVAL)
+   return -ENOENT;
+   else if (ret)
+   return -EINVAL;
+
+   *addr =  __va(start_addr);
+   /* -end is the first address after the buffer. */
+   *size = end_addr - start_addr;
+
+   return 0;
+}
+
+int kexec_free_handover_buffer(void)
+{
+   int ret;
+   void *addr;
+   unsigned long size;
+   struct property *prop;
+
+   ret = kexec_get_handover_buffer(, );
+   if (ret)
+   return ret;
+
+   ret = memblock_free(__pa(addr), size);
+   if (ret)
+   return ret;
+
+   prop = of_find_property(of_chosen, "linux,kexec-handover-buffer-start",
+   NULL);
+   ret = of_remove_property(of_chosen, prop);
+   if (ret)
+   return ret;
+
+   prop = of_find_property(of_chosen, "linux,kexec-handover-buffer-end",
+   NULL);
+   ret = of_remove_property(of_chosen, prop);
+   if (ret)
+   return ret;
+
+   return 0;
+}
+
 /**
  * arch_kexec_walk_mem() - call func(data) for each unreserved memory block
  * @kbuf:  Context info for the search. Also passed to @func.
@@ -686,26 +757,16 @@ int setup_purgatory(struct kimage *image, const void 
*slave_code,
return 0;
 }
 
-/*
- * setup_new_fdt() - modify /chosen and memory reservation for the next kernel
- * @fdt:
- * @initrd_load_addr:  Address where the next initrd will be loaded.
- * @initrd_len:Size of the next initrd, or 0 if there will be 
none.
- * @cmdline:   Command line for the next kernel,

[PATCH v5 2/5] kexec_file: Add buffer hand-over support for the next kernel

2016-09-14 Thread Thiago Jung Bauermann

The buffer hand-over mechanism allows the currently running kernel to pass
data to kernel that will be kexec'd via a kexec segment. The second kernel
can check whether the previous kernel sent data and retrieve it.

This is the architecture-independent part of the feature.

Signed-off-by: Thiago Jung Bauermann 
---
 include/linux/kexec.h | 31 +++
 kernel/kexec_file.c   | 68 +++
 2 files changed, 99 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 2a96292ee544..768245aa76bf 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -389,6 +389,37 @@ static inline void *boot_phys_to_virt(unsigned long entry)
return phys_to_virt(boot_phys_to_phys(entry));
 }
 
+#ifdef CONFIG_KEXEC_FILE
+bool __weak kexec_can_hand_over_buffer(void);
+int __weak arch_kexec_add_handover_buffer(struct kimage *image,
+ unsigned long load_addr,
+ unsigned long size);
+int kexec_add_handover_buffer(struct kexec_buf *kbuf);
+int __weak kexec_get_handover_buffer(void **addr, unsigned long *size);
+int __weak kexec_free_handover_buffer(void);
+#else
+struct kexec_buf;
+
+static inline bool kexec_can_hand_over_buffer(void)
+{
+   return false;
+}
+
+static inline int kexec_add_handover_buffer(struct kexec_buf *kbuf)
+{
+   return -ENOTSUPP;
+}
+
+static inline int kexec_get_handover_buffer(void **addr, unsigned long *size)
+{
+   return -ENOTSUPP;
+}
+
+static inline int kexec_free_handover_buffer(void)
+{
+   return -ENOTSUPP;
+}
+#endif /* CONFIG_KEXEC_FILE */
 #else /* !CONFIG_KEXEC_CORE */
 struct pt_regs;
 struct task_struct;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 6f7fa8901171..35b04296484b 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -116,6 +116,74 @@ void kimage_file_post_load_cleanup(struct kimage *image)
image->image_loader_data = NULL;
 }
 
+/**
+ * kexec_can_hand_over_buffer() - can we pass data to the kexec'd kernel?
+ */
+bool __weak kexec_can_hand_over_buffer(void)
+{
+   return false;
+}
+
+/**
+ * arch_kexec_add_handover_buffer() - do arch-specific steps to handover buffer
+ *
+ * Architectures should use this function to pass on the handover buffer
+ * information to the next kernel.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int __weak arch_kexec_add_handover_buffer(struct kimage *image,
+ unsigned long load_addr,
+ unsigned long size)
+{
+   return -ENOTSUPP;
+}
+
+/**
+ * kexec_add_handover_buffer() - add buffer to be used by the next kernel
+ * @kbuf:  Buffer contents and memory parameters.
+ *
+ * This function assumes that kexec_mutex is held.
+ * On successful return, @kbuf->mem will have the physical address of
+ * the buffer in the next kernel.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int kexec_add_handover_buffer(struct kexec_buf *kbuf)
+{
+   int ret;
+
+   if (!kexec_can_hand_over_buffer())
+   return -ENOTSUPP;
+
+   ret = kexec_add_buffer(kbuf);
+   if (ret)
+   return ret;
+
+   return arch_kexec_add_handover_buffer(kbuf->image, kbuf->mem,
+ kbuf->memsz);
+}
+
+/**
+ * kexec_get_handover_buffer() - get handover buffer from the previous kernel
+ * @addr:  On successful return, set to point to the buffer contents.
+ * @size:  On successful return, set to the buffer size.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int __weak kexec_get_handover_buffer(void **addr, unsigned long *size)
+{
+   return -ENOTSUPP;
+}
+
+/**
+ * kexec_free_handover_buffer() - free memory used by the handover buffer
+ */
+int __weak kexec_free_handover_buffer(void)
+{
+   return -ENOTSUPP;
+}
+
 /*
  * In file mode list of segments is prepared by kernel. Copy relevant
  * data from user space, do error checking, prepare segment list
-- 
1.9.1

[PATCH v5 1/5] kexec_file: Include the purgatory segment in the kexec image checksum.

2016-09-14 Thread Thiago Jung Bauermann

Currently, the purgatory segment is skipped from the kexec image checksum
because it is modified to include the calculated digest.

By putting the digest in a separate kexec segment, we can include the
purgatory segment in the kexec image verification since it won't need
to be modified anymore.

With this change, the only part of the kexec image that is not covered
by the checksum is the digest itself.

Even with the digest stored separately, x86 needs to leave the purgatory
segment out of the checksum calculation because it modifies the purgatory
code in relocate_kernel. We use CONFIG_ARCH_MODIFIES_KEXEC_PURGATORY to
allow the powerpc purgatory to be protected by the checksum while still
preserving x86 behavior.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/purgatory/purgatory.c |   4 +-
 arch/x86/Kconfig   |   6 +++
 arch/x86/purgatory/purgatory.c |   2 +-
 include/linux/kexec.h  |   6 +++
 kernel/kexec_file.c| 100 +
 5 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/purgatory/purgatory.c 
b/arch/powerpc/purgatory/purgatory.c
index 5b006d685cf2..f19ac3d5a7d5 100644
--- a/arch/powerpc/purgatory/purgatory.c
+++ b/arch/powerpc/purgatory/purgatory.c
@@ -17,7 +17,7 @@
 #include "kexec-sha256.h"
 
 struct kexec_sha_region sha_regions[SHA256_REGIONS] = {};
-u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
+u8 *sha256_digest = NULL;
 
 int verify_sha256_digest(void)
 {
@@ -40,7 +40,7 @@ int verify_sha256_digest(void)
printf("\n");
 
printf("sha256_digest: ");
-   for (i = 0; i < sizeof(sha256_digest); i++)
+   for (i = 0; i < SHA256_DIGEST_SIZE; i++)
printf("%hhx ", sha256_digest[i]);
 
printf("\n");
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2a1f0ce7c59a..dcd1679f3005 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1792,6 +1792,11 @@ config SECCOMP
 
 source kernel/Kconfig.hz
 
+# x86 needs to relocate the purgatory after the checksum is calculated,
+# therefore the purgatory cannot be part of the kexec image checksum.
+config ARCH_MODIFIES_KEXEC_PURGATORY
+   bool
+
 config KEXEC
bool "kexec system call"
select KEXEC_CORE
@@ -1812,6 +1817,7 @@ config KEXEC
 config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
+   select ARCH_MODIFIES_KEXEC_PURGATORY
select BUILD_BIN2C
depends on X86_64
depends on CRYPTO=y
diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
index 25e068ba3382..391c6a66cb03 100644
--- a/arch/x86/purgatory/purgatory.c
+++ b/arch/x86/purgatory/purgatory.c
@@ -22,7 +22,7 @@ unsigned long backup_dest = 0;
 unsigned long backup_src = 0;
 unsigned long backup_sz = 0;
 
-u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
+u8 *sha256_digest = NULL;
 
 struct sha_region sha_regions[16] = {};
 
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d419d0e51fe5..2a96292ee544 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -124,8 +124,14 @@ struct purgatory_info {
 */
void *purgatory_buf;
 
+   /* Digest of the contents of segments. */
+   void *digest_buf;
+
/* Address where purgatory is finally loaded and is executed from */
unsigned long purgatory_load_addr;
+
+   /* Address where the digest is loaded. */
+   unsigned long digest_load_addr;
 };
 
 typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 0c2df7f73792..6f7fa8901171 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -98,6 +98,9 @@ void kimage_file_post_load_cleanup(struct kimage *image)
vfree(pi->purgatory_buf);
pi->purgatory_buf = NULL;
 
+   kfree(pi->digest_buf);
+   pi->digest_buf = NULL;
+
vfree(pi->sechdrs);
pi->sechdrs = NULL;
 
@@ -527,7 +530,6 @@ static int kexec_calculate_store_digests(struct kimage 
*image)
struct shash_desc *desc;
int ret = 0, i, j, zero_buf_sz, sha_region_sz;
size_t desc_size, nullsz;
-   char *digest;
void *zero_buf;
struct kexec_sha_region *sha_regions;
struct purgatory_info *pi = >purgatory_info;
@@ -553,6 +555,37 @@ static int kexec_calculate_store_digests(struct kimage 
*image)
if (!sha_regions)
goto out_free_desc;
 
+   /*
+* Set sha_regions early so that we can write it to the purgatory
+* and include it in the checksum.
+*/
+   for (j = i = 0; i < image->nr_segments; i++) {
+   struct kexec_segment *ksegment = >segment[i];
+
+   if (ksegment->kbuf == pi->digest_buf)
+   continue;
+
+   if (IS_ENABLED(CONFIG_ARCH_MODIFIES_KEXEC_PURGATORY) &&
+

[PATCH v5 0/5] kexec_file: Add buffer hand-over for the next kernel

2016-09-14 Thread Thiago Jung Bauermann

Hello,

This version of the patch series fixes two issues:

1. The previous version modified struct kexec_segment, but that broke
   the ABI for at least 32-bit ARM.

2. The previous version didn't include the hand-over buffer in the kexec
   image checksum verification.

Now the kexec image checksum covers the hand-over buffer, and
kexec_update_segment verifes the checksum before updating the segment,
and calculates the new checksum afterwards.

In fact, for powerpc the image checksum now covers everything except the
digest itself so even the purgatory is verified. This is accomplished in
patch 1, and is an improvement that can be considered separately from the
other patches in the series. Unfortunately, x86 modifies the purgatory
segment during kernel_kexec so it still has to skip the purgatory from
the checksum.

Original cover letter:

This patch series implements a mechanism which allows the kernel to pass
on a buffer to the kernel that will be kexec'd. This buffer is passed
as a segment which is added to the kimage when it is being prepared
by kexec_file_load.

How the second kernel is informed of this buffer is architecture-specific.
On powerpc, this is done via the device tree, by checking
the properties /chosen/linux,kexec-handover-buffer-start and
/chosen/linux,kexec-handover-buffer-end, which is analogous to how the
kernel finds the initrd.

This is needed because the Integrity Measurement Architecture subsystem
needs to preserve its measurement list accross the kexec reboot. The
following patch series for the IMA subsystem uses this feature for that
purpose:

https://lists.infradead.org/pipermail/kexec/2016-August/016745.html

This is so that IMA can implement trusted boot support on the OpenPower
platform, because on such systems an intermediary Linux instance running
as part of the firmware is used to boot the target operating system via
kexec. Using this mechanism, IMA on this intermediary instance can
hand over to the target OS the measurements of the components that were
used to boot it.

Because there could be additional measurement events between the
kexec_file_load call and the actual reboot, IMA needs a way to update the
buffer with those additional events before rebooting. One can minimize
the interval between the kexec_file_load and the reboot syscalls, but as
small as it can be, there is always the possibility that the measurement
list will be out of date at the time of reboot.

To address this issue, this patch series also introduces
kexec_update_segment, which allows a reboot notifier to change the
contents of the image segment during the reboot process.

The last patch is not intended to be merged, it just demonstrates how
this feature can be used.

This series applies on top of v8 of the "kexec_file_load implementation
for PowerPC" patch series (which applies on top of v4.8-rc5 and -rc6):

https://lists.infradead.org/pipermail/kexec/2016-September/017123.html

Changes for v5:
- Rebased series on kexec_file_load patch series v8.
- Patch "kexec_file: Include the purgatory segment in the kexec image checksum."
  - New patch.
- Patch "kexec_file: Allow skipping checksum calculation for some segments."
  - Dropped patch.
- Patch "kexec_file: Add mechanism to update kexec segments."
  - Mostly rewritten.
  - Verify the kexec image checksum before updating the segment, and calculate
the new checksum afterwards.

Changes for v4:
- Rebased series on kexec_file_load patch series v7.
- Patch "powerpc: kexec_file: Add buffer hand-over support for the next kernel"
  - Convert hand-over buffer address to physical address when calling
memblock_free in kexec_free_handover_buffer.
  - Delete hand-over buffer properties from the live device tree in
kexec_free_handover_buffer.
  - Remove the memory reservation and the properties for the hand-over
buffer received from the previous kernel in setup_handover_buffer.
- Patch "IMA: Demonstration code for kexec buffer passing."
  - Fix checkpatch warnings. (Andrew Morton)

Changes for v3:
- Rebased series on kexec_file_load patch series v6.
  Both patch series apply cleanly on todays' Linus master branch, except
  for a few lines of fuzz in arch/powerpc/Makefile and arch/powerpc/Kconfig.
- Patch "kexec_file: Add buffer hand-over support for the next kernel"
  - Fix compilation warning in  by adding a struct kexec_buf
forward declaration when CONFIG_KEXEC_FILE=n. (Fenguang Wu)
- Patch "kexec_file: Allow skipping checksum calculation for some segments."
  - Substitute checksum argument in kexec_add_buffer with skip_checksum
member in struct kexec_buf, as suggested by Dave Young.
- Patch "kexec_file: Add mechanism to update kexec segments."
  - Use kmap_atomic in kexec_update_segment, as suggested by Andrew Morton.
  - Fix build warning on m68k by passing unsigned long value to __va instead
of void *. (Fenguang Wu)
  - Change bufsz and memsz arguments of kexec_update_segment to size_t to fix
compilation warning. (Fenguang

[PATCH] powerpc: Don't change the section in _GLOBAL()

2016-09-14 Thread Michael Ellerman

Currently the _GLOBAL() macro unilaterally sets the assembler section to
".text" at the start of the macro. This is rude as the caller may be
using a different section.

So let the caller decide which section to emit the code into. On big
endian we do need to switch to the ".opd" section to emit the OPD, but
do that with pushsection/popsection, thereby leaving the original
section intact.

The only place I could find where this requires changes to the code is
in misc_32.S, where we need to switch back to ".text" after
flush_icache_range() which is in ".kprobes.text".

I verified that the order of all entries in System.map is unchanged
after this patch. The actual addresses shift around slightly so you
can't just diff the System.map.

Signed-off-by: Michael Ellerman 
---

If anyone can think of a better method to verify we are still emitting
everything in the same sections let me know.


 arch/powerpc/include/asm/ppc_asm.h | 8 ++--
 arch/powerpc/kernel/misc_32.S  | 3 +++
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h 
b/arch/powerpc/include/asm/ppc_asm.h
index d5d5b5e348f2..479287045166 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -201,14 +201,12 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
 #ifdef PPC64_ELF_ABI_v2
 
 #define _GLOBAL(name) \
-   .section ".text"; \
.align 2 ; \
.type name,@function; \
.globl name; \
 name:
 
 #define _GLOBAL_TOC(name) \
-   .section ".text"; \
.align 2 ; \
.type name,@function; \
.globl name; \
@@ -232,16 +230,15 @@ name:
 #define GLUE(a,b) XGLUE(a,b)
 
 #define _GLOBAL(name) \
-   .section ".text"; \
.align 2 ; \
.globl name; \
.globl GLUE(.,name); \
-   .section ".opd","aw"; \
+   .pushsection ".opd","aw"; \
 name: \
.quad GLUE(.,name); \
.quad .TOC.@tocbase; \
.quad 0; \
-   .previous; \
+   .popsection; \
.type GLUE(.,name),@function; \
 GLUE(.,name):
 
@@ -272,7 +269,6 @@ GLUE(.,name):
 n:
 
 #define _GLOBAL(n) \
-   .text;  \
.stabs __stringify(n:F-1),N_FUN,0,0,n;\
.globl n;   \
 n:
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index d9c912b6e632..64fb1138961f 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -358,6 +358,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_ICACHE)
sync/* additional sync needed on g4 */
isync
blr
+
+.previous
+
 /*
  * Flush a particular page from the data cache to RAM.
  * Note: this is necessary because the instruction cache does *not*
-- 
2.7.4

Re: [PATCH] powerpc/64: whitelist unresolved modversions CRCs

2016-09-14 Thread Stephen Rothwell

Hi Nick,

On Wed, 14 Sep 2016 12:45:07 +1000 Nicholas Piggin  wrote:
>
> These are a symptom of CRC generation failure in generic
> build code, and not powerpc specific.
> 
> Signed-off-by: Nicholas Piggin 

This fixes my build problems.

Tested-by: Stephen Rothwell 

-- 
Cheers,
Stephen Rothwell

Re: [PATCH] powerpc/64: whitelist unresolved modversions CRCs

2016-09-14 Thread Stephen Rothwell

Hi Nick,

On Wed, 14 Sep 2016 12:45:07 +1000 Nicholas Piggin  wrote:
>
> These are a symptom of CRC generation failure in generic
> build code, and not powerpc specific.
> 
> Signed-off-by: Nicholas Piggin 

OK, so I will use this as a merge fix patch for the kbuild tree today
(instead of the revert I have been doing).

-- 
Cheers,
Stephen Rothwell

[PATCH v2 3/3] mm: enable CONFIG_MOVABLE_NODE on powerpc

2016-09-14 Thread Reza Arbab

Onlining memory into ZONE_MOVABLE requires CONFIG_MOVABLE_NODE. Enable
the use of this config option on PPC64 platforms.

Signed-off-by: Reza Arbab 
---
 Documentation/kernel-parameters.txt | 2 +-
 mm/Kconfig  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index a4f4d69..3d8460d 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2344,7 +2344,7 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
that the amount of memory usable for all allocations
is not too small.
 
-   movable_node[KNL,X86] Boot-time switch to enable the effects
+   movable_node[KNL,X86,PPC] Boot-time switch to enable the effects
of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
 
MTD_Partition=  [MTD]
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11..4b19cd3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,7 +153,7 @@ config MOVABLE_NODE
bool "Enable to assign a node which has only movable memory"
depends on HAVE_MEMBLOCK
depends on NO_BOOTMEM
-   depends on X86_64
+   depends on X86_64 || PPC64
depends on NUMA
default n
help
-- 
1.8.3.1

[PATCH v2 1/3] drivers/of: recognize status property of dt memory nodes

2016-09-14 Thread Reza Arbab

Respect the standard dt "status" property when scanning memory nodes in
early_init_dt_scan_memory(), so that if the property is present and not
"okay", no memory will be added.

The use case at hand is accelerator or device memory, which may be
unusable until post-boot initialization of the memory link. Such a node
can be described in the dt as any other, given its status is "disabled".
Per the device tree specification,

"disabled"
Indicates that the device is not presently operational, but it
might become operational in the future (for example, something
is not plugged in, or switched off).

Once such memory is made operational, it can then be hotplugged.

Signed-off-by: Reza Arbab 
---
 drivers/of/fdt.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index 085c638..fc19590 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -1022,8 +1022,10 @@ int __init early_init_dt_scan_memory(unsigned long node, 
const char *uname,
 int depth, void *data)
 {
const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
+   const char *status;
const __be32 *reg, *endp;
int l;
+   bool add_memory;
 
/* We are scanning "memory" nodes only */
if (type == NULL) {
@@ -1044,6 +1046,9 @@ int __init early_init_dt_scan_memory(unsigned long node, 
const char *uname,
 
endp = reg + (l / sizeof(__be32));
 
+   status = of_get_flat_dt_prop(node, "status", NULL);
+   add_memory = !status || !strcmp(status, "okay");
+
pr_debug("memory scan node %s, reg size %d,\n", uname, l);
 
while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) {
@@ -1057,6 +1062,9 @@ int __init early_init_dt_scan_memory(unsigned long node, 
const char *uname,
pr_debug(" - %llx ,  %llx\n", (unsigned long long)base,
(unsigned long long)size);
 
+   if (!add_memory)
+   continue;
+
early_init_dt_add_memory_arch(base, size);
}
 
-- 
1.8.3.1

[PATCH v2 2/3] powerpc/mm: allow memory hotplug into a memoryless node

2016-09-14 Thread Reza Arbab

Remove the check which prevents us from hotplugging into an empty node.

This limitation has been questioned before [1], and judging by the
response, there doesn't seem to be a reason we can't remove it. No issues
have been found in light testing.

[1] 
http://lkml.kernel.org/r/cagzkibrmksa1yyhbf5hwgxubcjse5smksmy4tpanerme2ug...@mail.gmail.com
http://lkml.kernel.org/r/20160511215051.gf22...@arbab-laptop.austin.ibm.com

Signed-off-by: Reza Arbab 
Acked-by: Balbir Singh 
Cc: Nathan Fontenot 
Cc: Bharata B Rao 
---
 arch/powerpc/mm/numa.c | 13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 75b9cd6..d7ac419 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1121,7 +1121,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
 int hot_add_scn_to_nid(unsigned long scn_addr)
 {
struct device_node *memory = NULL;
-   int nid, found = 0;
+   int nid;
 
if (!numa_enabled || (min_common_depth < 0))
return first_online_node;
@@ -1137,17 +1137,6 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
if (nid < 0 || !node_online(nid))
nid = first_online_node;
 
-   if (NODE_DATA(nid)->node_spanned_pages)
-   return nid;
-
-   for_each_online_node(nid) {
-   if (NODE_DATA(nid)->node_spanned_pages) {
-   found = 1;
-   break;
-   }
-   }
-
-   BUG_ON(!found);
return nid;
 }
 
-- 
1.8.3.1

[PATCH v2 0/3] powerpc/mm: movable hotplug memory nodes

2016-09-14 Thread Reza Arbab

These changes enable onlining memory into ZONE_MOVABLE on power, and the 
creation of discrete nodes of movable memory.

We provide a way to describe the extents and numa associativity of such 
a node in the device tree, yet still defer the memory addition to take 
place post-boot through hotplug.

In v1, this patchset introduced a new dt compatible id to explicitly 
create a memoryless node at boot. Here, things have been simplified to 
be applicable regardless of the status of node hotplug on power. We 
still intend to enable hotadding a pgdat, but that's now untangled as a 
separate topic.

v2:
* Use the "status" property of standard dt memory nodes instead of 
  introducing a new "ibm,hotplug-aperture" compatible id.

* Remove the patch which explicitly creates a memoryless node. This set 
  no longer has any bearing on whether the pgdat is created at boot or 
  at the time of memory addition.

v1:
* 
http://lkml.kernel.org/r/1470680843-28702-1-git-send-email-ar...@linux.vnet.ibm.com

Reza Arbab (3):
  drivers/of: recognize status property of dt memory nodes
  powerpc/mm: allow memory hotplug into a memoryless node
  mm: enable CONFIG_MOVABLE_NODE on powerpc

 Documentation/kernel-parameters.txt |  2 +-
 arch/powerpc/mm/numa.c  | 13 +
 drivers/of/fdt.c|  8 
 mm/Kconfig  |  2 +-
 4 files changed, 11 insertions(+), 14 deletions(-)

-- 
1.8.3.1

[PATCH V5 8/8] powerpc: Enable support for new DRC devtree properties

2016-09-14 Thread Michael Bringmann

prom_init.c: Enable support for new DRC device tree properties
"ibm,drc-info" and "ibm,dynamic-memory-v2" in initial handshake
between the Linux kernel and the front end processor.

[V2: Revise constant names.]
[V3: No change.]
[V4: Update comments]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff -Naur linux-rhel/arch/powerpc/kernel/prom_init.c 
linux-rhel-patch/arch/powerpc/kernel/prom_init.c
--- linux-rhel/arch/powerpc/kernel/prom_init.c  2016-03-03 07:36:25.0 
-0600
+++ linux-rhel-patch/arch/powerpc/kernel/prom_init.c2016-06-20 
15:59:58.016373676 -0500
@@ -695,7 +695,7 @@ unsigned char ibm_architecture_vec[] = {
OV4_MIN_ENT_CAP,/* minimum VP entitled capacity */
 
/* option vector 5: PAPR/OF options */
-   VECTOR_LENGTH(18),  /* length */
+   VECTOR_LENGTH(22),  /* length */
0,  /* don't ignore, don't halt */
OV5_FEAT(OV5_LPAR) | OV5_FEAT(OV5_SPLPAR) | OV5_FEAT(OV5_LARGE_PAGES) |
OV5_FEAT(OV5_DRCONF_MEMORY) | OV5_FEAT(OV5_DONATE_DEDICATE_CPU) |
@@ -728,6 +728,10 @@ unsigned char ibm_architecture_vec[] = {
OV5_FEAT(OV5_PFO_HW_RNG) | OV5_FEAT(OV5_PFO_HW_ENCR) |
OV5_FEAT(OV5_PFO_HW_842),
OV5_FEAT(OV5_SUB_PROCESSORS),
+   0,
+   0,
+   0,
+   OV5_FEAT(OV5_DYN_MEM_V2) | OV5_FEAT(OV5_DRC_INFO),
 
/* option vector 6: IBM PAPR hints */
VECTOR_LENGTH(3),   /* length */

[PATCH V5 7/8] powerpc: Check arch.vec earlier during boot for memory features

2016-09-14 Thread Michael Bringmann

architecture.vec5 features: The boot-time memory management needs to
know the form of the "ibm,dynamic-memory-v2" property early during
scanning of the flattened device tree.  This patch moves execution of
the function pseries_probe_fw_features() early enough to be before
the scanning of the memory properties in the device tree to allow
recognition of the supported properties.

[V2: No change]
[V3: Updated after commit 3808a88985b4f5f5e947c364debce4441a380fb8.]
[V4: Update comments]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 946e34f..2034edc 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -753,6 +753,9 @@ void __init early_init_devtree(void *params)
 */
of_scan_flat_dt(early_init_dt_scan_chosen_ppc, boot_command_line);
 
+   /* Now try to figure out if we are running on LPAR and so on */
+   pseries_probe_fw_features();
+
/* Scan memory nodes and rebuild MEMBLOCKs */
of_scan_flat_dt(early_init_dt_scan_root, NULL);
of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
@@ -823,9 +823,6 @@ void __init early_init_devtree(void *params)
 #endif
epapr_paravirt_early_init();
 
-   /* Now try to figure out if we are running on LPAR and so on */
-   pseries_probe_fw_features();
-
 #ifdef CONFIG_PPC_PS3
/* Identify PS3 firmware */
if (of_flat_dt_is_compatible(of_get_flat_dt_root(), "sony,ps3"))

[PATCH V5 6/8] hotplug/drc-info: Add code to search new devtree properties

2016-09-14 Thread Michael Bringmann

rpadlpar_core.c: Provide parallel routines to search the older device-
tree properties ("ibm,drc-indexes", "ibm,drc-names", "ibm,drc-types"
and "ibm,drc-power-domains"), or the new property "ibm,drc-info".

The interface to examine the DRC information is changed from a "get"
function that returns values for local verification elsewhere, to a
"check" function that validates the 'name' and/or 'type' of a device
node.  This update hides the format of the underlying device-tree
properties, and concentrates the value checks into a single function
without requiring the user to verify whether a search was successful.

[V2: Revise contant names.]
[V3: Amend comments.  Simplify code cleanup.]
[V4: Update comments.]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
b/drivers/pci/hotplug/rpadlpar_core.c
index dc67f39..bea9723 100644
--- a/drivers/pci/hotplug/rpadlpar_core.c
+++ b/drivers/pci/hotplug/rpadlpar_core.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "../pci.h"
 #include "rpaphp.h"
@@ -44,15 +45,14 @@ static struct device_node *find_vio_slot_node(char 
*drc_name)
 {
struct device_node *parent = of_find_node_by_name(NULL, "vdevice");
struct device_node *dn = NULL;
-   char *name;
int rc;
 
if (!parent)
return NULL;
 
while ((dn = of_get_next_child(parent, dn))) {
-   rc = rpaphp_get_drc_props(dn, NULL, , NULL, NULL);
-   if ((rc == 0) && (!strcmp(drc_name, name)))
+   rc = rpaphp_check_drc_props(dn, drc_name, NULL);
+   if (rc == 0)
break;
}
 
@@ -64,15 +64,12 @@ static struct device_node *find_php_slot_pci_node(char 
*drc_name,
  char *drc_type)
 {
struct device_node *np = NULL;
-   char *name;
-   char *type;
int rc;
 
while ((np = of_find_node_by_name(np, "pci"))) {
-   rc = rpaphp_get_drc_props(np, NULL, , , NULL);
+   rc = rpaphp_check_drc_props(np, drc_name, drc_type);
if (rc == 0)
-   if (!strcmp(drc_name, name) && !strcmp(drc_type, type))
-   break;
+   break;
}
 
return np;
diff --git a/drivers/pci/hotplug/rpaphp.h b/drivers/pci/hotplug/rpaphp.h
index 7db024e..8db5f2e 100644
--- a/drivers/pci/hotplug/rpaphp.h
+++ b/drivers/pci/hotplug/rpaphp.h
@@ -91,8 +91,8 @@ int rpaphp_get_sensor_state(struct slot *slot, int *state);
 
 /* rpaphp_core.c */
 int rpaphp_add_slot(struct device_node *dn);
-int rpaphp_get_drc_props(struct device_node *dn, int *drc_index,
-   char **drc_name, char **drc_type, int *drc_power_domain);
+int rpaphp_check_drc_props(struct device_node *dn, char *drc_name,
+   char *drc_type);
 
 /* rpaphp_slot.c */
 void dealloc_slot_struct(struct slot *slot);
diff --git a/drivers/pci/hotplug/rpaphp_core.c 
b/drivers/pci/hotplug/rpaphp_core.c
index 8d13202..0cfdbd9 100644
--- a/drivers/pci/hotplug/rpaphp_core.c
+++ b/drivers/pci/hotplug/rpaphp_core.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include/* for eeh_add_device() */
 #include   /* rtas_call */
 #include /* for pci_controller */
@@ -196,25 +188,21 @@ static int get_children_props(struct device_node *dn, 
const int **drc_indexes,
return 0;
 }
 
-/* To get the DRC props describing the current node, first obtain it's
- * my-drc-index property.  Next obtain the DRC list from it's parent.  Use
- * the my-drc-index for correlation, and obtain the requested properties.
+
+/* Verify the existence of 'drc_name' and/or 'drc_type' within the
+ * current node.  First obtain it's my-drc-index property.  Next,
+ * obtain the DRC info from it's parent.  Use the my-drc-index for
+ * correlation, and obtain/validate the requested properties.
  */
-int rpaphp_get_drc_props(struct device_node *dn, int *drc_index,
-   char **drc_name, char **drc_type, int *drc_power_domain)
+
+static int rpaphp_check_drc_props_v1(struct device_node *dn, char *drc_name,
+   char *drc_type, unsigned int my_index)
 {
+   char *name_tmp, *type_tmp;
const int *indexes, *names;
const int *types, *domains;
-   const unsigned int *my_index;
-   char *name_tmp, *type_tmp;
int i, rc;
 
-   my_index = of_get_property(dn, "ibm,my-drc-index", NULL);
-   if (!my_index) {
-   /* Node isn't DLPAR/hotplug capable */
-   return -EINVAL;
-   }
-
rc = get_children_props(dn->parent, , , , );
if (rc < 0) {
return -EINVAL;
@@ -225,24 +213,83 @@ int rpaphp_get_drc_props(struct device_node *dn, int 
*drc_index,
 
/* Iterate through parent properties, looking for my-drc-index */
for (i = 0; i <

[PATCH V5 5/8] pseries/drc-info: Search new DRC properties for CPU indexes

2016-09-14 Thread Michael Bringmann

pseries/drc-info: Provide parallel routines to convert between
drc_index and CPU numbers at runtime, using the older device-tree
properties ("ibm,drc-indexes", "ibm,drc-names", "ibm,drc-types"
and "ibm,drc-power-domains"), or the new property "ibm,drc-info".

[V2: Revise contant names.]
[V3: No change.]
[V4: No change.]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
index 9276779..10c4200 100644
--- a/arch/powerpc/platforms/pseries/pseries_energy.c
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -35,10 +35,68 @@ static int sysfs_entries;
 
 /* Helper Routines to convert between drc_index to cpu numbers */
 
+void read_one_drc_info(int **info, char **dtype, char **dname,
+   unsigned long int *fdi_p, unsigned long int *nsl_p,
+   unsigned long int *si_p, unsigned long int *ldi_p)
+{
+   char *drc_type, *drc_name, *pc;
+   u32 fdi, nsl, si, ldi;
+
+   fdi = nsl = si = ldi = 0;
+
+   /* Get drc-type:encode-string */
+   pc = (char *)info;
+   drc_type = pc;
+   pc += (strlen(drc_type) + 1);
+
+   /* Get drc-name-prefix:encode-string */
+   drc_name = (char *)pc;
+   pc += (strlen(drc_name) + 1);
+
+   /* Get drc-index-start:encode-int */
+   memcpy(, pc, 4);
+   fdi = be32_to_cpu(fdi);
+   pc += 4;
+
+   /* Get/skip drc-name-suffix-start:encode-int */
+   pc += 4;
+
+   /* Get number-sequential-elements:encode-int */
+   memcpy(, pc, 4);
+   nsl = be32_to_cpu(nsl);
+   pc += 4;
+
+   /* Get sequential-increment:encode-int */
+   memcpy(, pc, 4);
+   si = be32_to_cpu(si);
+   pc += 4;
+
+   /* Get/skip drc-power-domain:encode-int */
+   pc += 4;
+
+   /* Should now know end of current entry */
+   ldi = fdi + ((nsl-1)*si);
+
+   (*info) = (int *)pc;
+
+   if (dtype)
+   *dtype = drc_type;
+   if (dname)
+   *dname = drc_name;
+   if (fdi_p)
+   *fdi_p = fdi;
+   if (nsl_p)
+   *nsl_p = nsl;
+   if (si_p)
+   *si_p = si;
+   if (ldi_p)
+   *ldi_p = ldi;
+}
+EXPORT_SYMBOL(read_one_drc_info);
+
 static u32 cpu_to_drc_index(int cpu)
 {
struct device_node *dn = NULL;
-   const int *indexes;
int i;
int rc = 1;
u32 ret = 0;
@@ -46,18 +104,54 @@ static u32 cpu_to_drc_index(int cpu)
dn = of_find_node_by_path("/cpus");
if (dn == NULL)
goto err;
-   indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
-   if (indexes == NULL)
-   goto err_of_node_put;
+
/* Convert logical cpu number to core number */
i = cpu_core_index_of_thread(cpu);
-   /*
-* The first element indexes[0] is the number of drc_indexes
-* returned in the list.  Hence i+1 will get the drc_index
-* corresponding to core number i.
-*/
-   WARN_ON(i > indexes[0]);
-   ret = indexes[i + 1];
+
+   if (firmware_has_feature(FW_FEATURE_DRC_INFO)) {
+   int *info = (int *)4;
+   unsigned long int num_set_entries, j, iw = i, fdi = 0;
+   unsigned long int ldi = 0, nsl = 0, si = 0;
+   char *dtype;
+   char *dname;
+
+   info = (int *)of_get_property(dn, "ibm,drc-info", NULL);
+   if (info == NULL)
+   goto err_of_node_put;
+
+   num_set_entries = be32_to_cpu(*info++);
+
+   for (j = 0; j < num_set_entries; j++) {
+
+   read_one_drc_info(, , , ,
+   , , );
+   if (strcmp(dtype, "CPU"))
+   goto err;
+
+   if (iw < ldi)
+   break;
+
+   WARN_ON(((iw-fdi)%si) != 0);
+   }
+   WARN_ON((nsl == 0) | (si == 0));
+
+   ret = ldi + (iw*si);
+   } else {
+   const int *indexes;
+
+   indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
+   if (indexes == NULL)
+   goto err_of_node_put;
+
+   /*
+* The first element indexes[0] is the number of drc_indexes
+* returned in the list.  Hence i+1 will get the drc_index
+* corresponding to core number i.
+*/
+   WARN_ON(i > indexes[0]);
+   ret = indexes[i + 1];
+   }
+
rc = 0;
 
 err_of_node_put:
@@ -78,21 +172,51 @@ static int drc_index_to_cpu(u32 drc_index)
dn = of_find_node_by_path("/cpus");
if (dn == NULL)
goto err;
-   indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
-   if (indexes == NULL)
-   goto

[PATCH V5 4/8] pseries/hotplug init: Convert new DRC memory property for hotplug runtime

2016-09-14 Thread Michael Bringmann

hotplug_init: Simplify the code needed for runtime memory hotplug and
maintenance with a conversion routine that transforms the compressed
property "ibm,dynamic-memory-v2" to the form of "ibm,dynamic-memory"
within the "ibm,dynamic-reconfiguration-memory" property.  Thus only
a single set of routines should be required at runtime to parse, edit,
and manipulate the memory representation in the device tree.  Similarly,
any userspace applications that need this information will only need
to recognize the older format to be able to continue to operate.

[V2: Revise contant names.]
[V3: Replace use of in-code compile flag encompassing file by Makefile mod.]
[V4: Remove unneeded code braces.
 Simplify allocation of a couple of loop index variables.]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 2ce1385..0c46fbc 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -24,6 +24,8 @@
 #include 
 #include "pseries.h"
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+
 static bool rtas_hp_event;
 
 unsigned long pseries_memory_block_size(void)
@@ -887,11 +889,102 @@ static int pseries_memory_notifier(struct notifier_block 
*nb,
 static struct notifier_block pseries_mem_nb = {
.notifier_call = pseries_memory_notifier,
 };
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+static int pseries_rewrite_dynamic_memory_v2(void)
+{
+   unsigned long memblock_size;
+   struct device_node *dn;
+   struct property *prop, *prop_v2;
+   __be32 *p;
+   struct of_drconf_cell *lmbs;
+   u32 num_lmb_desc_sets, num_lmbs;
+   int i, j, k;
+
+   dn = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+   if (!dn)
+   return -EINVAL;
+
+   prop_v2 = of_find_property(dn, "ibm,dynamic-memory-v2", NULL);
+   if (!prop_v2)
+   return -EINVAL;
+
+   memblock_size = pseries_memory_block_size();
+   if (!memblock_size)
+   return -EINVAL;
+
+   /* The first int of the property is the number of lmb sets
+* described by the property.
+*/
+   p = (__be32 *)prop_v2->value;
+   num_lmb_desc_sets = be32_to_cpu(*p++);
+
+   /* Count the number of LMBs for generating the alternate format
+*/
+   for (i = 0, num_lmbs = 0; i < num_lmb_desc_sets; i++) {
+   struct of_drconf_cell_v2 drmem;
+
+   read_drconf_cell_v2(, (const __be32 **));
+   num_lmbs += drmem.num_seq_lmbs;
+   }
+
+   /* Create an empty copy of the new 'ibm,dynamic-memory' property
+*/
+   prop = kzalloc(sizeof(*prop), GFP_KERNEL);
+   if (!prop)
+   return -ENOMEM;
+   prop->name = kstrdup("ibm,dynamic-memory", GFP_KERNEL);
+   prop->length = dyn_mem_v2_len(num_lmbs);
+   prop->value = kzalloc(prop->length, GFP_KERNEL);
+
+   /* Copy/expand the ibm,dynamic-memory-v2 format to produce the
+* ibm,dynamic-memory format.
+*/
+   p = (__be32 *)prop->value;
+   *p = cpu_to_be32(num_lmbs);
+   p++;
+   lmbs = (struct of_drconf_cell *)p;
+
+   p = (__be32 *)prop_v2->value;
+   p++;
+
+   for (i = 0, k = 0; i < num_lmb_desc_sets; i++) {
+   struct of_drconf_cell_v2 drmem;
+
+   read_drconf_cell_v2(, (const __be32 **));
+
+   for (j = 0; j < drmem.num_seq_lmbs; j++) {
+   lmbs[k+j].base_addr = be64_to_cpu(drmem.base_addr);
+   lmbs[k+j].drc_index = be32_to_cpu(drmem.drc_index);
+   lmbs[k+j].reserved  = 0;
+   lmbs[k+j].aa_index  = be32_to_cpu(drmem.aa_index);
+   lmbs[k+i].flags = be32_to_cpu(drmem.flags);
+
+   drmem.base_addr += memblock_size;
+   drmem.drc_index++;
+   }
+
+   k += drmem.num_seq_lmbs;
+   }
+
+   of_remove_property(dn, prop_v2);
+
+   of_add_property(dn, prop);
+
+   /* And disable feature flag since the property has gone away */
+   powerpc_firmware_features &= ~FW_FEATURE_DYN_MEM_V2;
+
+   return 0;
+}
 
 static int __init pseries_memory_hotplug_init(void)
 {
+   if (firmware_has_feature(FW_FEATURE_DYN_MEM_V2))
+   pseries_rewrite_dynamic_memory_v2();
+#ifdef CONFIG_MEMORY_HOTPLUG
if (firmware_has_feature(FW_FEATURE_LPAR))
of_reconfig_notifier_register(_mem_nb);
+#endif /* CONFIG_MEMORY_HOTPLUG */
 
return 0;
 }
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index fedc2ccf0..e74cf6c 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -5,14 +5,14 @@ obj-y := lpar.o hvCall.o nvram.o reconfig.o \

[PATCH V5 2/8] powerpc/memory: Parse new memory property to register blocks.

2016-09-14 Thread Michael Bringmann

powerpc/memory: Add parallel routines to parse the new property
"ibm,dynamic-memory-v2" property when it is present, and then to
register the relevant memory blocks with the operating system.
This property format is intended to provide a more compact
representation of memory when communicating with the front end
processor, especially when describing vast amounts of RAM.

[V2: Revise contant names.]
[V3: Fix error parsing the new memory block sets.]
[V4: Move a couple of function prototypes from header file
 a later patch where first used.
 Amend some comments.
 Change a firmware architure vec check for scan actual device tree.
 Compress some common code.]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 7f436ba..b9a1534 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -69,6 +69,8 @@ struct boot_param_header {
  * OF address retreival & translation
  */
 
+extern int n_mem_addr_cells;
+
 /* Parse the ibm,dma-window property of an OF node into the busno, phys and
  * size parameters.
  */
@@ -81,8 +83,9 @@ extern void of_instantiate_rtc(void);
 extern int of_get_ibm_chip_id(struct device_node *np);
 
 /* The of_drconf_cell struct defines the layout of the LMB array
- * specified in the device tree property
- * ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory
+ * specified in the device tree properties,
+ * ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory
+ * ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory-v2
  */
 struct of_drconf_cell {
u64 base_addr;
@@ -92,9 +95,20 @@ struct of_drconf_cell {
u32 flags;
 };
 
-#define DRCONF_MEM_ASSIGNED0x0008
-#define DRCONF_MEM_AI_INVALID  0x0040
-#define DRCONF_MEM_RESERVED0x0080
+#define DRCONF_MEM_ASSIGNED0x0008
+#define DRCONF_MEM_AI_INVALID  0x0040
+#define DRCONF_MEM_RESERVED0x0080
+
+struct of_drconf_cell_v2 {
+   u32 num_seq_lmbs;
+   u64 base_addr;
+   u32 drc_index;
+   u32 aa_index;
+   u32 flags;
+} __attribute__((packed));
+
+extern void read_drconf_cell_v2(struct of_drconf_cell_v2 *drmem,
+   const __be32 **cellp);
 
 /*
  * There are two methods for telling firmware what our capabilities are.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 669a15e..ad294ce 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -57,8 +57,10 @@
 EXPORT_SYMBOL(node_data);
 
 static int min_common_depth;
-static int n_mem_addr_cells, n_mem_size_cells;
+int n_mem_addr_cells;
+static int n_mem_size_cells;
 static int form1_affinity;
+EXPORT_SYMBOL(n_mem_addr_cells);
 
 #define MAX_DISTANCE_REF_POINTS 4
 static int distance_ref_points_depth;
@@ -405,6 +405,24 @@ static void read_drconf_cell(struct of_drconf_cell *drmem, 
const __be32 **cellp)
 
*cellp = cp + 4;
 }
+ 
+ /*
+ * Retrieve and validate the ibm,dynamic-memory property of the device tree.
+ * Read the next memory block set entry from the ibm,dynamic-memory-v2 property
+ * and return the information in the provided of_drconf_cell_v2 structure.
+ */
+void read_drconf_cell_v2(struct of_drconf_cell_v2 *drmem, const __be32 **cellp)
+{
+   const __be32 *cp = (const __be32 *)*cellp;
+   drmem->num_seq_lmbs = be32_to_cpu(*cp++);
+   drmem->base_addr = read_n_cells(n_mem_addr_cells, );
+   drmem->drc_index = be32_to_cpu(*cp++);
+   drmem->aa_index = be32_to_cpu(*cp++);
+   drmem->flags = be32_to_cpu(*cp++);
+
+   *cellp = cp;
+}
+EXPORT_SYMBOL(read_drconf_cell_v2);
 
 /*
  * Retrieve and validate the ibm,dynamic-memory property of the device tree.
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index b0245be..51330bc 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -443,23 +443,34 @@ static int __init early_init_dt_scan_chosen_ppc(unsigned 
long node,
 
 #ifdef CONFIG_PPC_PSERIES
 /*
- * Interpret the ibm,dynamic-memory property in the
- * /ibm,dynamic-reconfiguration-memory node.
+ * Retrieve and validate the ibm,lmb-size property for drconf memory
+ * from the flattened device tree.
+ */
+static u64 __init get_lmb_size(unsigned long node)
+{
+   const __be32 *ls;
+   int len;
+   ls = of_get_flat_dt_prop(node, "ibm,lmb-size", );
+   if (!ls || len < dt_root_size_cells * sizeof(__be32))
+   return 0;
+   return dt_mem_next_cell(dt_root_size_cells, );
+}
+
+/*
+ * Interpret the ibm,dynamic-memory property/ibm,dynamic-memory-v2
+ * in the /ibm,dynamic-reconfiguration-memory node.
  * This contains a list of memory blocks along with NUMA affinity
  * information.
  */
-static int __init early_init_dt_scan_drconf_memory(unsigned long node)
+static int __init early_init_dt_scan_drconf_memory_v1(unsigned long node)
 {
-   const

[PATCH V5 3/8] powerpc/memory: Parse new memory property to initialize structures.

2016-09-14 Thread Michael Bringmann

powerpc/memory: Add parallel routines to parse the new property
"ibm,dynamic-memory-v2" property when it is present, and then to
finish initialization of the relevant memory structures with the
operating system.  This code is shared between the boot-time
initialization functions and the runtime functions for memory
hotplug, so it needs to be able to handle both formats.

[V2: Revise contant names.]
[V3: Fix loop that needed to scan all blocks defined by new, compressed
 memory definition.
[V4: Added external function prototype definitions to header file
 "prom.h" for use in other files.
 Change a firmware architure vec check for scan actual device tree.
 Delete an unused variable.
 Small cleanups to comments.]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 7f436ba..b9a1534 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -109,6 +109,18 @@ struct of_drconf_cell_v2 {
  
 extern void read_drconf_cell_v2(struct of_drconf_cell_v2 *drmem,
const __be32 **cellp);
+
+extern void read_one_drc_info(int **info, char **drc_type, char **drc_name,
+   unsigned long int *fdi_p, unsigned long int *nsl_p,
+   unsigned long int *si_p, unsigned long int *ldi_p);
+
+static inline int dyn_mem_v2_len(int entries)
+{
+   int drconf_v2_cells = (n_mem_addr_cells + 4);
+   int drconf_v2_cells_len = (drconf_v2_cells * sizeof(unsigned int));
+   return (((entries) * drconf_v2_cells_len) +
+(1 * sizeof(unsigned int)));
+}
 
 /*
  * There are two methods for telling firmware what our capabilities are.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 669a15e..18b4ee7 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -427,30 +426,55 @@
 EXPORT_SYMBOL(read_drconf_cell_v2);
 
 /*
- * Retrieve and validate the ibm,dynamic-memory property of the device tree.
+ * Retrieve and validate the ibm,dynamic-memory[-v2] property of the
+ * device tree.
+ *
+ * The layout of the ibm,dynamic-memory property is a number N of memory
+ * block description list entries followed by N memory block description
+ * list entries.  Each memory block description list entry contains
+ * information as laid out in the of_drconf_cell struct above.
  *
- * The layout of the ibm,dynamic-memory property is a number N of memblock
- * list entries followed by N memblock list entries.  Each memblock list entry
- * contains information as laid out in the of_drconf_cell struct above.
+ * The layout of the ibm,dynamic-memory-v2 property is a number N of memory
+ * block set description list entries, followed by N memory block set
+ * description set entries.
  */
 static int of_get_drconf_memory(struct device_node *memory, const __be32 **dm)
 {
const __be32 *prop;
u32 len, entries;
 
-   prop = of_get_property(memory, "ibm,dynamic-memory", );
-   if (!prop || len < sizeof(unsigned int))
-   return 0;
+   if (firmware_has_feature(FW_FEATURE_DYN_MEM_V2)) {
 
-   entries = of_read_number(prop++, 1);
+   prop = of_get_property(memory, "ibm,dynamic-memory-v2", );
+   if (!prop || len < sizeof(unsigned int))
+   return 0;
 
-   /* Now that we know the number of entries, revalidate the size
-* of the property read in to ensure we have everything
-*/
-   if (len < (entries * (n_mem_addr_cells + 4) + 1) * sizeof(unsigned int))
-   return 0;
+   entries = of_read_number(prop++, 1);
+
+   /* Now that we know the number of set entries, revalidate the
+* size of the property read in to ensure we have everything.
+*/
+   if (len < dyn_mem_v2_len(entries))
+   return 0;
+
+   *dm = prop;
+   } else {
+   prop = of_get_property(memory, "ibm,dynamic-memory", );
+   if (!prop || len < sizeof(unsigned int))
+   return 0;
+
+   entries = of_read_number(prop++, 1);
+
+   /* Now that we know the number of entries, revalidate the size
+* of the property read in to ensure we have everything
+*/
+   if (len < (entries * (n_mem_addr_cells + 4) + 1) *
+  sizeof(unsigned int))
+   return 0;
+
+   *dm = prop;
+   }
 
-   *dm = prop;
return entries;
 }
 
@@ -513,7 +537,7 @@
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
-static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
+static int of_drconf_to_nid_single(u32 drmem_flags, u32 drmem_aa_index,
   struct assoc_arrays

[PATCH V5 1/8] powerpc/firmware: Add definitions for new firmware features.

2016-09-14 Thread Michael Bringmann

Firmware Features: Define new bit flags representing the presence of
new device tree properties "ibm,drc-info", and "ibm,dynamic-memory-v2".
These flags are used to tell the front end processor when the Linux
kernel supports the new properties, and by the front end processor to
tell the Linux kernel that the new properties are present in the devie
tree.

[V2: Revise constant names for improved clarity.]
[V3: Fix comments]
[V4: Fix some spacing]
[V5: Resynchronize/resubmit]

Signed-off-by: Michael Bringmann 
---
diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index b062924..a9d66d5 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -51,6 +51,8 @@
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
 #define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
+#define FW_FEATURE_DYN_MEM_V2  ASM_CONST(0x0004)
+#define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
 
 #ifndef __ASSEMBLY__
 
@@ -66,7 +68,8 @@ enum {
FW_FEATURE_MULTITCE | FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN,
+   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+   FW_FEATURE_DYN_MEM_V2 | FW_FEATURE_DRC_INFO,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 7f436ba..b9a1534 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -155,6 +203,8 @@ struct of_drconf_cell {
 #define OV5_PFO_HW_842 0x0E40  /* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR0x0E20  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x0F01  /* 1,2,or 4 Sub-Processors supported */
+#define OV5_DYN_MEM_V2 0x1680  /* Redef Prop Structures: dyn-mem-v2 */
+#define OV5_DRC_INFO   0x1640  /* Redef Prop Structures: drc-info   */
 
 /* Option Vector 6: IBM PAPR hints */
 #define OV6_LINUX  0x02/* Linux is our OS */
diff --git a/arch/powerpc/platforms/pseries/firmware.c 
b/arch/powerpc/platforms/pseries/firmware.c
index 8c80588..00243ee 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -113,6 +113,8 @@ static __initdata struct vec5_fw_feature
 vec5_fw_features_table[] = {
{FW_FEATURE_TYPE1_AFFINITY, OV5_TYPE1_AFFINITY},
{FW_FEATURE_PRRN,   OV5_PRRN},
+   {FW_FEATURE_DYN_MEM_V2, OV5_DYN_MEM_V2},
+   {FW_FEATURE_DRC_INFO,   OV5_DRC_INFO},
 };
 
 void __init fw_vec5_feature_init(const char *vec5, unsigned long len)

[Patch V5 0/8] powerpc/devtree: Add support for 2 new DRC properties

2016-09-14 Thread Michael Bringmann

Several properties in the DRC device tree format are replaced by
more compact representations to allow, for example, for the encoding
of vast amounts of memory, and or reduced duplication of information
in related data structures.

"ibm,drc-info": This property, when present, replaces the following
four properties: "ibm,drc-indexes", "ibm,drc-names", "ibm,drc-types"
and "ibm,drc-power-domains".  This property is defined for all
dynamically reconfigurable platform nodes.  The "ibm,drc-info" elements
are intended to provide a more compact representation, and reduce some
search overhead.

"ibm,dynamic-memory-v2": This property replaces the "ibm,dynamic-memory"
node representation within the "ibm,dynamic-reconfiguration-memory"
property provided by the BMC.  This element format is intended to provide
a more compact representation of memory, especially, for systems with
massive amounts of RAM.  To simplify portability, this property is
converted to the "ibm,dynamic-memory" property during system boot.

"ibm,architecture.vec": Bidirectional communication mechanism between
the host system and the front end processor indicating what features
the host system supports and what features the front end processor will
actually provide.  In this case, we are indicating that the host system
can support the new device tree structures "ibm,drc-info" and
"ibm,dynamic-memory-v2".

[V1: Initial presentation of PAPR 2.7 changes to device tree.]
[V2: Revise constant names.  Fix some syntax errors.  Improve comments.]
[V3: Revise tests for presence of new properties to always scan devicetree
 instead of depending upon architecture vec, due to reboot issues.]
[V4: Rearrange some code changes in patches to better match application,
 and other code cleanup.]
[V5: Resynchronize patches.]

Signed-off-by: Michael Bringmann 

Michael Bringmann (8):
  powerpc/firmware: Add definitions for new firmware features.
  powerpc/memory: Parse new memory property to register blocks.
  powerpc/memory: Parse new memory property to initialize structures.
  pseries/hotplug init: Convert new DRC memory property for hotplug runtime
  pseries/drc-info: Search new DRC properties for CPU indexes
  hotplug/drc-info: Add code to search new devtree properties
  powerpc: Check arch.vec earlier during boot for memory features
  powerpc: Enable support for new DRC devtree properties

Re: [PATCH 3/6] cxlflash: Fix to avoid EEH and host reset collisions

2016-09-14 Thread Martin K. Petersen

> "Uma" == Uma Krishnan  writes:

Applied patches 3 through 6 to 4.9/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH V2 1/7] dt-bindings: Update QorIQ TMU thermal bindings

2016-09-14 Thread Leo Li

On Wed, Jun 8, 2016 at 2:52 PM, Rob Herring  wrote:
> On Tue, Jun 07, 2016 at 11:27:34AM +0800, Jia Hongtao wrote:
>> For different types of SoC the sensor id and endianness may vary.
>> "#thermal-sensor-cells" is used to provide sensor id information.
>> "little-endian" property is to tell the endianness of TMU.
>>
>> Signed-off-by: Jia Hongtao 
>> ---
>> Changes for V2:
>> * Remove formatting chnages.
>>
>>  Documentation/devicetree/bindings/thermal/qoriq-thermal.txt | 7 +++
>>  1 file changed, 7 insertions(+)
>
> Acked-by: Rob Herring 

Hi Zhang Rui,

Since you have applied the driver patch, can you also apply the
binding patch?  The binding is supposed to go with the driver.

Regards,
Leo

Re: [PATCH 0/2] Enable MSR_TM lazily

2016-09-14 Thread Carlos Eduardo Seo




On 9/14/16 8:28 AM, Nicholas Piggin wrote:


How common it is for glibc to be built with elision?



Not that common. We have it built with TLE support in Ubuntu (starting 
in 15.04), SLES 12 (since SP2) and AT 9.0-0.


However, it is only enabled by default in Ubuntu. For SLES and AT 9.0, 
the user has to set an env var to enable it (it's a hack).


There is some work upstream to add a tunables framework to glibc. That 
will allow us to properly provide a way to users enable/disable TLE as 
they wish. That patch is almost in, and as soon as it's committed, we'll 
start working on the tunable for TLE.


--
Carlos Eduardo Seo
Software Engineer - Linux on Power Toolchain
c...@linux.vnet.ibm.com

Re: [PATCH] drivers/dma: NO_IRQ removal from powerpc-only drivers

2016-09-14 Thread Vinod Koul

On Sat, Sep 10, 2016 at 07:56:04PM +1000, Michael Ellerman wrote:
> We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it
> from powerpc-only drivers.

Applied after fixing subsystem name

-- 
~Vinod

Re: [PATCH 0/2] Enable MSR_TM lazily

2016-09-14 Thread Michael Neuling

On 14 Sep. 2016 10:12 pm, "Nicholas Piggin"  wrote:
>
> On Wed, 14 Sep 2016 21:46:39 +1000
> Michael Neuling  wrote:
>
> > On Wed, 2016-09-14 at 21:28 +1000, Nicholas Piggin wrote:
> > > Cc'ing Carlos
> > >
> > > On Wed, 14 Sep 2016 18:02:14 +1000

> > I think we might be able to detect this case in the kernel. If it's a
tabort
> > that's trapped on, we can't have been transactional.  Hence we can
safely PC+=4
> > and leave off TM off.
> >
> > It would cost us a get_user(inst, regs->nip); but it might be worth it
for this
> > special but common case.
>
> That would take an extra trap for every syscall, I think.

You're right. That wouldn't work.

Mikey

Re: [PATCH 0/2] Enable MSR_TM lazily

2016-09-14 Thread Nicholas Piggin

On Wed, 14 Sep 2016 21:46:39 +1000
Michael Neuling  wrote:

> On Wed, 2016-09-14 at 21:28 +1000, Nicholas Piggin wrote:
> > Cc'ing Carlos
> > 
> > On Wed, 14 Sep 2016 18:02:14 +1000
> > Cyril Bur  wrote:
> >   
> > > 
> > > Currently the kernel checks to see if the hardware is transactional
> > > memory capable and always enables the MSR_TM bit. The problem with
> > > this is that the TM related SPRs become available to userspace,
> > > requiring them to be switched between processes. It turns out these
> > > SPRs are expensive to read and write and if a thread doesn't use TM
> > > (or worse yet isn't even TM aware) then context switching incurs this
> > > penalty for nothing.
> > > 
> > > The solution here is to leave the MSR_TM bit disabled and enable it
> > > more 'on demand'. Leaving MSR_TM disabled cause a thread to take a
> > > facility unavailable fault if and when it does decide to use TM. As
> > > with recent updates to the FPU, VMX and VSX units the MSR_TM bit will
> > > be enabled upon taking the fault and left on for some time afterwards
> > > as the assumption is that if a thread used TM ones it may well use it
> > > again. The kernel will turn the MSR_TM bit off after some number of
> > > context switches of that thread.
> > > 
> > > Performance numbers haven't been completely gathered as yet but early
> > > runs of tools/testing/selftests/powerpc/benchmarks/context_switch
> > > (which doesn't use TM) yields a jump from ~16 switches per second
> > > to ~18 switches per second with patch 3/3 applied.  
> > Cool!
> > 
> > Question: glibc when built with lock elision seems like it will
> > execute tabort. before every syscall, to work around old kernel
> > behaviour. That's always going to fault TM on, isn't it?  
> 
> I think we might be able to detect this case in the kernel. If it's a tabort
> that's trapped on, we can't have been transactional.  Hence we can safely 
> PC+=4
> and leave off TM off. 
> 
> It would cost us a get_user(inst, regs->nip); but it might be worth it for 
> this
> special but common case.

That would take an extra trap for every syscall, I think.


> > How common it is for glibc to be built with elision?  
> 
> IIRC Ubuntu uses it on 16.04 (and maybe 15.10).

Ah yes, but I was wrong: it also has to be linked against -lpthread
because it depends on r13 != 0. That's why I couldn't see it in my
trace. Now I do when using -lpthread. On 16.04.


> > We should probably be testing PPC_FEATURE2_HTM_NOSC to skip the
> > tabort.  
> 
> Agree, that would be idea. Binary patching glic at runtime.

That would be nice. Does glibc support binary patching? I'm not very
familiar with the code. Current syscall code ends up something like
this:

cmpdi13,0
beq  1f
lwz  0,TM_CAPABLE(13)
cmpwi0,0
beq  1f
li   11,_ABORT_SYSCALL
tabort.  11
.align 4
1:
li 0,syscall
sc

Without runtime patching, then if we had another variable that meant
we are TM capable *and* need to issue a tabort., then we can do the
same sequence without extra instructions. That might be the first
step.

Thanks,
Nick

Re: [PATCH 0/2] Enable MSR_TM lazily

2016-09-14 Thread Michael Neuling

On Wed, 2016-09-14 at 21:28 +1000, Nicholas Piggin wrote:
> Cc'ing Carlos
> 
> On Wed, 14 Sep 2016 18:02:14 +1000
> Cyril Bur  wrote:
> 
> > 
> > Currently the kernel checks to see if the hardware is transactional
> > memory capable and always enables the MSR_TM bit. The problem with
> > this is that the TM related SPRs become available to userspace,
> > requiring them to be switched between processes. It turns out these
> > SPRs are expensive to read and write and if a thread doesn't use TM
> > (or worse yet isn't even TM aware) then context switching incurs this
> > penalty for nothing.
> > 
> > The solution here is to leave the MSR_TM bit disabled and enable it
> > more 'on demand'. Leaving MSR_TM disabled cause a thread to take a
> > facility unavailable fault if and when it does decide to use TM. As
> > with recent updates to the FPU, VMX and VSX units the MSR_TM bit will
> > be enabled upon taking the fault and left on for some time afterwards
> > as the assumption is that if a thread used TM ones it may well use it
> > again. The kernel will turn the MSR_TM bit off after some number of
> > context switches of that thread.
> > 
> > Performance numbers haven't been completely gathered as yet but early
> > runs of tools/testing/selftests/powerpc/benchmarks/context_switch
> > (which doesn't use TM) yields a jump from ~16 switches per second
> > to ~18 switches per second with patch 3/3 applied.
> Cool!
> 
> Question: glibc when built with lock elision seems like it will
> execute tabort. before every syscall, to work around old kernel
> behaviour. That's always going to fault TM on, isn't it?

I think we might be able to detect this case in the kernel. If it's a tabort
that's trapped on, we can't have been transactional.  Hence we can safely PC+=4
and leave off TM off. 

It would cost us a get_user(inst, regs->nip); but it might be worth it for this
special but common case.

> How common it is for glibc to be built with elision?

IIRC Ubuntu uses it on 16.04 (and maybe 15.10).

> We should probably be testing PPC_FEATURE2_HTM_NOSC to skip the
> tabort.

Agree, that would be idea. Binary patching glic at runtime.

> (BTW, this is a pretty good writeup, would you consider adding
> a bit more of it to patch 2 so it gets into the changelog?)

Agreed.

Mikey

Re: [PATCH] powernv/pci: Fix m64 checks for SR-IOV and window alignment

2016-09-14 Thread Gavin Shan

On Wed, Sep 14, 2016 at 05:51:08PM +1000, Benjamin Herrenschmidt wrote:
>On Wed, 2016-09-14 at 16:37 +1000, Russell Currey wrote:
>> Commit 5958d19a143e checks for prefetchable m64 BARs by comparing the
>> addresses instead of using resource flags.  This broke SR-IOV as the
>> m64
>> check in pnv_pci_ioda_fixup_iov_resources() fails.
>> 
>> The condition in pnv_pci_window_alignment() also changed to checking
>> only IORESOURCE_MEM_64 instead of both IORESOURCE_MEM_64 and
>> IORESOURCE_PREFETCH.
>
>CC'ing Gavin who might have some insight in the matter.
>
>Why do we check for prefetch ? On PCIe, any 64-bit BAR can live under a
>prefetchable region afaik... Gavin, any idea ?
>

Ben, what I understood for long time: non-prefetchable BAR cannot live under
a prefetchable region (window), but any BAR can live under non-prefetchable
region (window).

>
>> Revert these cases to the previous behaviour, adding a new helper
>> function
>> to do so.  This is named pnv_pci_is_m64_flags() to make it clear this
>> function is only looking at resource flags and should not be relied
>> on for
>> non-SRIOV resources.
>> 
>> Fixes: 5958d19a143e ("Fix incorrect PE reservation attempt on some
>> 64-bit BARs")
>> Reported-by: Alexey Kardashevskiy 
>> Signed-off-by: Russell Currey 
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 11 +--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index c16d790..2f25622 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -124,6 +124,13 @@ static inline bool pnv_pci_is_m64(struct pnv_phb
>> *phb, struct resource *r)
>>  r->start < (phb->ioda.m64_base + phb-
>> >ioda.m64_size));
>>  }
>>  
>> +static inline bool pnv_pci_is_m64_flags(unsigned long
>> resource_flags)
>> +{
>> +unsigned long flags = (IORESOURCE_MEM_64 |
>> IORESOURCE_PREFETCH);
>> +
>> +return (resource_flags & flags) == flags;
>> +}
>> 
>I don't agree. See below.
>
>>  static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int
>> pe_no)
>>  {
>>  phb->ioda.pe_array[pe_no].phb = phb;
>> @@ -2871,7 +2878,7 @@ static void
>> pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>>  res = >resource[i + PCI_IOV_RESOURCES];
>>  if (!res->flags || res->parent)
>>  continue;
>> -if (!pnv_pci_is_m64(phb, res)) {
>> +if (!pnv_pci_is_m64_flags(res->flags)) {
>>  dev_warn(>dev, "Don't support SR-IOV
>> with"
>>  " non M64 VF BAR%d: %pR.
>> \n",
>>   i, res);
>
>What is that function actually doing ? Having IORESOURCE_64 and
>PREFETCHABLE is completely orthogonal to being in the M64 region. This
>is the bug my original patch was fixing in fact as it's possible for
>the allocator to put a 64-bit resource in the M32 region.
>

This function is called before the resoureces are resized and assigned.
So using the resource's start/end addresses to judge it's in M64 or M32
windows are not reliable. Currently, all IOV BARs is required to have
(IORESOURCE_64 | PREFETCHABLE) which is covered by bridge's M64 window
and PHB's M64 windows (BARs).

>> @@ -3096,7 +3103,7 @@ static resource_size_t
>> pnv_pci_window_alignment(struct pci_bus *bus,
>>   * alignment for any 64-bit resource, PCIe doesn't care and
>>   * bridges only do 64-bit prefetchable anyway.
>>   */
>> -if (phb->ioda.m64_segsize && (type & IORESOURCE_MEM_64))
>> +if (phb->ioda.m64_segsize && pnv_pci_is_m64_flags(type))
>>  return phb->ioda.m64_segsize;
>
>I disagree similarly. 64-bit non-prefetchable resources should live in
>the M64 space as well.
>

As I understood, 64-bits non-prefetchable BARs cannot live behind
M64 (64-bits prefetchable) windows.

>>  if (type & IORESOURCE_MEM)
>>  return phb->ioda.m32_segsize;
>
>Something seems to be deeply wrong here and this patch looks to me that
>it's just papering over the problem in way that could bring back the
>bugs I've seen if the generic allocator decides to put things in the
>M32 window.
>
>We need to look at this more closely and understand WTF that code
>intends means to do.
>

Yeah, it seems it partially reverts your changes. The start/end addresses
are usable after resource resizing/assignment is finished. Before that,
we still need to use the flags.

Thanks,
Gavin


>Cheers,
>Ben.
>

Re: [PATCH 0/2] Enable MSR_TM lazily

2016-09-14 Thread Nicholas Piggin

Cc'ing Carlos

On Wed, 14 Sep 2016 18:02:14 +1000
Cyril Bur  wrote:

> Currently the kernel checks to see if the hardware is transactional
> memory capable and always enables the MSR_TM bit. The problem with
> this is that the TM related SPRs become available to userspace,
> requiring them to be switched between processes. It turns out these
> SPRs are expensive to read and write and if a thread doesn't use TM
> (or worse yet isn't even TM aware) then context switching incurs this
> penalty for nothing.
> 
> The solution here is to leave the MSR_TM bit disabled and enable it
> more 'on demand'. Leaving MSR_TM disabled cause a thread to take a
> facility unavailable fault if and when it does decide to use TM. As
> with recent updates to the FPU, VMX and VSX units the MSR_TM bit will
> be enabled upon taking the fault and left on for some time afterwards
> as the assumption is that if a thread used TM ones it may well use it
> again. The kernel will turn the MSR_TM bit off after some number of
> context switches of that thread.
> 
> Performance numbers haven't been completely gathered as yet but early
> runs of tools/testing/selftests/powerpc/benchmarks/context_switch
> (which doesn't use TM) yields a jump from ~16 switches per second
> to ~18 switches per second with patch 3/3 applied.

Cool!

Question: glibc when built with lock elision seems like it will
execute tabort. before every syscall, to work around old kernel
behaviour. That's always going to fault TM on, isn't it?

How common it is for glibc to be built with elision?

We should probably be testing PPC_FEATURE2_HTM_NOSC to skip the
tabort.

(BTW, this is a pretty good writeup, would you consider adding
a bit more of it to patch 2 so it gets into the changelog?)

Thanks,
Nick

Re: [V2] powerpc/Kconfig: Update config option based on page size.

2016-09-14 Thread santhosh




Michael Ellerman  writes:


On Fri, 2016-19-02 at 05:38:47 UTC, Rashmica Gupta wrote:

Currently on PPC64 changing kernel pagesize from 4K to 64K leaves
FORCE_MAX_ZONEORDER set to 13 - which produces a compile error.


...

So, update the range of FORCE_MAX_ZONEORDER from 9-64 to 8-9 for 64K pages
and from 13-64 to 9-13 for 4K pages.

Signed-off-by: Rashmica Gupta 
Reviewed-by: Balbir Singh 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a7ee539584acf4a565b7439cea


HPAGE_PMD_ORDER is not something we should check w.r.t 4k linux page
size. We do have the below constraint w.r.t hugetlb pages

static inline bool hstate_is_gigantic(struct hstate *h)
{
return huge_page_order(h) >= MAX_ORDER;
}

That require MAX_ORDER to be greater than 12.

Did we test hugetlbfs 4k config with this patch ? Will it work if we
start marking hugepage as gigantic page ?

-aneesh


Hello Rashmica,

With upstream linux kernel 4.8.0-rc1-6-gbae9cc6 compiled with linux 
4k page size we are not able set hugepages, Aneesh had a look at the 
problem and he mentioned this commit is causing the issue.


*Details:*
We are using pkvm ubuntu 16.04 guest with upstream kernel 
[4.8.0-rc1-6-gbae9cc6] compiled with  4k page size


o/p from guest:
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:  16384 kB

Page sizes from device-tree: [dmesg]
[0.00] base_shift=12: shift=12, sllp=0x, avpnm=0x, 
tlbiel=1, penc=0
[0.00] base_shift=12: shift=24, sllp=0x, avpnm=0x, 
tlbiel=1, penc=56
[0.00] base_shift=24: shift=24, sllp=0x0100, avpnm=0x0001, 
tlbiel=0, penc=0


while trying to configure the hugepages inside the guest it throws the 
below error:


echo 100 > /proc/sys/vm/nr_hugepages
-bash: echo: write error: Invalid argument

*Note*: we do not see the problem when the linux page is 64k

Thanks,
Santhosh G

Re: [PATCH] powernv/pci: Fix m64 checks for SR-IOV and window alignment

2016-09-14 Thread Benjamin Herrenschmidt

On Wed, 2016-09-14 at 16:37 +1000, Russell Currey wrote:
> Commit 5958d19a143e checks for prefetchable m64 BARs by comparing the
> addresses instead of using resource flags.  This broke SR-IOV as the
> m64
> check in pnv_pci_ioda_fixup_iov_resources() fails.
> 
> The condition in pnv_pci_window_alignment() also changed to checking
> only IORESOURCE_MEM_64 instead of both IORESOURCE_MEM_64 and
> IORESOURCE_PREFETCH.

CC'ing Gavin who might have some insight in the matter.

Why do we check for prefetch ? On PCIe, any 64-bit BAR can live under a
prefetchable region afaik... Gavin, any idea ?

Also:

> Revert these cases to the previous behaviour, adding a new helper
> function
> to do so.  This is named pnv_pci_is_m64_flags() to make it clear this
> function is only looking at resource flags and should not be relied
> on for
> non-SRIOV resources.
> 
> Fixes: 5958d19a143e ("Fix incorrect PE reservation attempt on some
> 64-bit BARs")
> Reported-by: Alexey Kardashevskiy 
> Signed-off-by: Russell Currey 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c16d790..2f25622 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -124,6 +124,13 @@ static inline bool pnv_pci_is_m64(struct pnv_phb
> *phb, struct resource *r)
>   r->start < (phb->ioda.m64_base + phb-
> >ioda.m64_size));
>  }
>  
> +static inline bool pnv_pci_is_m64_flags(unsigned long
> resource_flags)
> +{
> + unsigned long flags = (IORESOURCE_MEM_64 |
> IORESOURCE_PREFETCH);
> +
> + return (resource_flags & flags) == flags;
> +}
> 
I don't agree. See below.

>  static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int
> pe_no)
>  {
>   phb->ioda.pe_array[pe_no].phb = phb;
> @@ -2871,7 +2878,7 @@ static void
> pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>   res = >resource[i + PCI_IOV_RESOURCES];
>   if (!res->flags || res->parent)
>   continue;
> - if (!pnv_pci_is_m64(phb, res)) {
> + if (!pnv_pci_is_m64_flags(res->flags)) {
>   dev_warn(>dev, "Don't support SR-IOV
> with"
>   " non M64 VF BAR%d: %pR.
> \n",
>    i, res);

What is that function actually doing ? Having IORESOURCE_64 and
PREFETCHABLE is completely orthogonal to being in the M64 region. This
is the bug my original patch was fixing in fact as it's possible for
the allocator to put a 64-bit resource in the M32 region.

> @@ -3096,7 +3103,7 @@ static resource_size_t
> pnv_pci_window_alignment(struct pci_bus *bus,
>    * alignment for any 64-bit resource, PCIe doesn't care and
>    * bridges only do 64-bit prefetchable anyway.
>    */
> - if (phb->ioda.m64_segsize && (type & IORESOURCE_MEM_64))
> + if (phb->ioda.m64_segsize && pnv_pci_is_m64_flags(type))
>   return phb->ioda.m64_segsize;

I disagree similarly. 64-bit non-prefetchable resources should live in
the M64 space as well.

>   if (type & IORESOURCE_MEM)
>   return phb->ioda.m32_segsize;

Something seems to be deeply wrong here and this patch looks to me that
it's just papering over the problem in way that could bring back the
bugs I've seen if the generic allocator decides to put things in the
M32 window.

We need to look at this more closely and understand WTF that code
intends means to do.

Cheers,
Ben.

[PATCH 1/2] powerpc: tm: Add TM Unavailable Exception

2016-09-14 Thread Cyril Bur

If the kernel disables transactional memory (TM) and userspace still
tries TM related actions (TM instructions or TM SPR accesses) TM aware
hardware will cause the kernel to take a facility unavailable
exception.

Add checks for the exception being caused by illegal TM access in
userspace.

Signed-off-by: Cyril Bur 
---
 arch/powerpc/kernel/traps.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 0eba74b..cd40130 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1372,6 +1372,13 @@ void vsx_unavailable_exception(struct pt_regs *regs)
 }
 
 #ifdef CONFIG_PPC64
+static void tm_unavailable(struct pt_regs *regs)
+{
+   pr_emerg("Unrecoverable TM Unavailable Exception "
+   "%lx at %lx\n", regs->trap, regs->nip);
+   die("Unrecoverable TM Unavailable Exception", regs, SIGABRT);
+}
+
 void facility_unavailable_exception(struct pt_regs *regs)
 {
static char *facility_strings[] = {
@@ -1451,6 +1458,23 @@ void facility_unavailable_exception(struct pt_regs *regs)
return;
}
 
+   /*
+* TM Unavailable
+*
+* If
+*  - firmware bits say don't do TM or
+*  - CONFIG_PPC_TRANSACTIONAL_MEM was not set and
+*  - hardware is actually TM aware
+* Then userspace can spam the console (even with the use of
+* _ratelimited), just send the SIGILL.
+*/
+   if (status == FSCR_TM_LG) {
+   if (!cpu_has_feature(CPU_FTR_TM))
+   goto out;
+   tm_unavailable(regs);
+   return;
+   }
+
if ((status < ARRAY_SIZE(facility_strings)) &&
facility_strings[status])
facility = facility_strings[status];
@@ -1463,6 +1487,7 @@ void facility_unavailable_exception(struct pt_regs *regs)
"%sFacility '%s' unavailable, exception at 0x%lx, MSR=%lx\n",
hv ? "Hypervisor " : "", facility, regs->nip, regs->msr);
 
+out:
if (user_mode(regs)) {
_exception(SIGILL, regs, ILL_ILLOPC, regs->nip);
return;
-- 
2.9.3

[PATCH 2/2] powerpc: tm: Enable transactional memory (TM) lazily for userspace

2016-09-14 Thread Cyril Bur

Currently the MSR TM bit is always set if the hardware is TM capable.
This adds extra overhead as it means the TM SPRS (TFHAR, TEXASR and
TFAIR) must be swapped for each process regardless of if they use TM.

For processes that don't use TM the TM MSR bit can be turned off
allowing the kernel to avoid the expensive swap of the TM registers.

A TM unavailable exception will occur if a thread does use TM and the
kernel will enable MSR_TM and leave it so for some time afterwards.

Signed-off-by: Cyril Bur 
---
 arch/powerpc/include/asm/processor.h |  1 +
 arch/powerpc/kernel/process.c| 28 +++-
 arch/powerpc/kernel/traps.c  |  9 +
 3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index b3e0cfc..c07c31b 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -257,6 +257,7 @@ struct thread_struct {
int used_spe;   /* set if process has used spe */
 #endif /* CONFIG_SPE */
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   u8  load_tm;
u64 tm_tfhar;   /* Transaction fail handler addr */
u64 tm_texasr;  /* Transaction exception & summary */
u64 tm_tfiar;   /* Transaction fail instr address reg */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 11f7a64..cd81dd4 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -811,6 +811,12 @@ static inline bool hw_brk_match(struct arch_hw_breakpoint 
*a,
 }
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+
+static inline bool tm_enabled(struct task_struct *tsk)
+{
+   return tsk && tsk->thread.regs && (tsk->thread.regs->msr & MSR_TM);
+}
+
 static void tm_reclaim_thread(struct thread_struct *thr,
  struct thread_info *ti, uint8_t cause)
 {
@@ -891,6 +897,9 @@ void tm_recheckpoint(struct thread_struct *thread,
 {
unsigned long flags;
 
+   if (!(thread->regs->msr & MSR_TM))
+   return;
+
/* We really can't be interrupted here as the TEXASR registers can't
 * change and later in the trecheckpoint code, we have a userspace R1.
 * So let's hard disable over this region.
@@ -923,7 +932,7 @@ static inline void tm_recheckpoint_new_task(struct 
task_struct *new)
 * unavailable later, we are unable to determine which set of FP regs
 * need to be restored.
 */
-   if (!new->thread.regs)
+   if (!tm_enabled(new))
return;
 
if (!MSR_TM_ACTIVE(new->thread.regs->msr)){
@@ -954,8 +963,16 @@ static inline void __switch_to_tm(struct task_struct *prev,
struct task_struct *new)
 {
if (cpu_has_feature(CPU_FTR_TM)) {
-   tm_enable();
-   tm_reclaim_task(prev);
+   if (tm_enabled(prev) || tm_enabled(new))
+   tm_enable();
+
+   if (tm_enabled(prev)) {
+   prev->thread.load_tm++;
+   tm_reclaim_task(prev);
+   if (!MSR_TM_ACTIVE(prev->thread.regs->msr) && 
prev->thread.load_tm == 0)
+   prev->thread.regs->msr &= ~MSR_TM;
+   }
+
tm_recheckpoint_new_task(new);
}
 }
@@ -1392,6 +1409,9 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src)
 * transitions the CPU out of TM mode.  Hence we need to call
 * tm_recheckpoint_new_task() (on the same task) to restore the
 * checkpointed state back and the TM mode.
+*
+* Can't pass dst because it isn't ready. Doesn't matter, passing
+* dst is only important for __switch_to()
 */
__switch_to_tm(src, src);
 
@@ -1635,8 +1655,6 @@ void start_thread(struct pt_regs *regs, unsigned long 
start, unsigned long sp)
current->thread.used_spe = 0;
 #endif /* CONFIG_SPE */
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-   if (cpu_has_feature(CPU_FTR_TM))
-   regs->msr |= MSR_TM;
current->thread.tm_tfhar = 0;
current->thread.tm_texasr = 0;
current->thread.tm_tfiar = 0;
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index cd40130..9bb3895 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1374,6 +1374,15 @@ void vsx_unavailable_exception(struct pt_regs *regs)
 #ifdef CONFIG_PPC64
 static void tm_unavailable(struct pt_regs *regs)
 {
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   if (user_mode(regs)) {
+   current->thread.load_tm++;
+   regs->msr |= MSR_TM;
+   tm_enable();
+   tm_restore_sprs(>thread);
+   return;
+   }
+#endif
pr_emerg("Unrecoverable TM Unavailable Exception "
"%lx at %lx\n", regs->trap, regs->nip);

[PATCH 0/2] Enable MSR_TM lazily

2016-09-14 Thread Cyril Bur

Currently the kernel checks to see if the hardware is transactional
memory capable and always enables the MSR_TM bit. The problem with
this is that the TM related SPRs become available to userspace,
requiring them to be switched between processes. It turns out these
SPRs are expensive to read and write and if a thread doesn't use TM
(or worse yet isn't even TM aware) then context switching incurs this
penalty for nothing.

The solution here is to leave the MSR_TM bit disabled and enable it
more 'on demand'. Leaving MSR_TM disabled cause a thread to take a
facility unavailable fault if and when it does decide to use TM. As
with recent updates to the FPU, VMX and VSX units the MSR_TM bit will
be enabled upon taking the fault and left on for some time afterwards
as the assumption is that if a thread used TM ones it may well use it
again. The kernel will turn the MSR_TM bit off after some number of
context switches of that thread.

Performance numbers haven't been completely gathered as yet but early
runs of tools/testing/selftests/powerpc/benchmarks/context_switch
(which doesn't use TM) yields a jump from ~16 switches per second
to ~18 switches per second with patch 3/3 applied.

These patches will need to be applied on top of my recent rework of
TM: http://patchwork.ozlabs.org/patch/666094/ 
I have pushed a branch to github to help with reviews:
https://github.com/cyrilbur-ibm/linux/tree/tm_lazy_v1

Changes since RFC:
Fixed my unability to use & and | correctly.
 - Spotted by Laurent Dufour, thanks
Dropped the selftest patch as it was merged as part of a previous
  series

Cyril Bur (2):
  powerpc: tm: Add TM Unavailable Exception
  powerpc: tm: Enable transactional memory (TM) lazily for userspace

 arch/powerpc/include/asm/processor.h |  1 +
 arch/powerpc/kernel/process.c| 28 +++-
 arch/powerpc/kernel/traps.c  | 34 ++
 3 files changed, 58 insertions(+), 5 deletions(-)

-- 
2.9.3

Re: [PATCH] powernv/pci: Fix m64 checks for SR-IOV and window alignment

2016-09-14 Thread Alexey Kardashevskiy

On 14/09/16 16:37, Russell Currey wrote:
> Commit 5958d19a143e checks for prefetchable m64 BARs by comparing the
> addresses instead of using resource flags.  This broke SR-IOV as the m64
> check in pnv_pci_ioda_fixup_iov_resources() fails.
> 
> The condition in pnv_pci_window_alignment() also changed to checking
> only IORESOURCE_MEM_64 instead of both IORESOURCE_MEM_64 and
> IORESOURCE_PREFETCH.
> 
> Revert these cases to the previous behaviour, adding a new helper function
> to do so.  This is named pnv_pci_is_m64_flags() to make it clear this
> function is only looking at resource flags and should not be relied on for
> non-SRIOV resources.
> 
> Fixes: 5958d19a143e ("Fix incorrect PE reservation attempt on some 64-bit 
> BARs")
> Reported-by: Alexey Kardashevskiy 
> Signed-off-by: Russell Currey 

Tested-by: Alexey Kardashevskiy 


> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c16d790..2f25622 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -124,6 +124,13 @@ static inline bool pnv_pci_is_m64(struct pnv_phb *phb, 
> struct resource *r)
>   r->start < (phb->ioda.m64_base + phb->ioda.m64_size));
>  }
>  
> +static inline bool pnv_pci_is_m64_flags(unsigned long resource_flags)
> +{
> + unsigned long flags = (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH);
> +
> + return (resource_flags & flags) == flags;
> +}
> +
>  static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
>  {
>   phb->ioda.pe_array[pe_no].phb = phb;
> @@ -2871,7 +2878,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>   res = >resource[i + PCI_IOV_RESOURCES];
>   if (!res->flags || res->parent)
>   continue;
> - if (!pnv_pci_is_m64(phb, res)) {
> + if (!pnv_pci_is_m64_flags(res->flags)) {
>   dev_warn(>dev, "Don't support SR-IOV with"
>   " non M64 VF BAR%d: %pR. \n",
>i, res);
> @@ -3096,7 +3103,7 @@ static resource_size_t pnv_pci_window_alignment(struct 
> pci_bus *bus,
>* alignment for any 64-bit resource, PCIe doesn't care and
>* bridges only do 64-bit prefetchable anyway.
>*/
> - if (phb->ioda.m64_segsize && (type & IORESOURCE_MEM_64))
> + if (phb->ioda.m64_segsize && pnv_pci_is_m64_flags(type))
>   return phb->ioda.m64_segsize;
>   if (type & IORESOURCE_MEM)
>   return phb->ioda.m32_segsize;
> 


-- 
Alexey

Re: [PATCH v14 00/15] selftests/powerpc: Add ptrace tests for ppc registers

2016-09-14 Thread Michael Ellerman

Cyril Bur  writes:
> Its messy but I think the accepted solution for kselftests is to do:
>
> #include "../../../../../usr/include/linux/elf.h"
>
> which I believe will get the headers generated for the target by `make
> headers_install` and therefore should match that for which the
> kselftests are being compiled.

Don't put the path in the include line though, add it to CFLAGS.

See eg. tools/testing/selftests/powerpc/tm/Makefile:

tm-syscall: CFLAGS += -I../../../../../usr/include


cheers

[PATCH] powernv/pci: Fix m64 checks for SR-IOV and window alignment

2016-09-14 Thread Russell Currey

Commit 5958d19a143e checks for prefetchable m64 BARs by comparing the
addresses instead of using resource flags.  This broke SR-IOV as the m64
check in pnv_pci_ioda_fixup_iov_resources() fails.

The condition in pnv_pci_window_alignment() also changed to checking
only IORESOURCE_MEM_64 instead of both IORESOURCE_MEM_64 and
IORESOURCE_PREFETCH.

Revert these cases to the previous behaviour, adding a new helper function
to do so.  This is named pnv_pci_is_m64_flags() to make it clear this
function is only looking at resource flags and should not be relied on for
non-SRIOV resources.

Fixes: 5958d19a143e ("Fix incorrect PE reservation attempt on some 64-bit BARs")
Reported-by: Alexey Kardashevskiy 
Signed-off-by: Russell Currey 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c16d790..2f25622 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -124,6 +124,13 @@ static inline bool pnv_pci_is_m64(struct pnv_phb *phb, 
struct resource *r)
r->start < (phb->ioda.m64_base + phb->ioda.m64_size));
 }
 
+static inline bool pnv_pci_is_m64_flags(unsigned long resource_flags)
+{
+   unsigned long flags = (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH);
+
+   return (resource_flags & flags) == flags;
+}
+
 static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
 {
phb->ioda.pe_array[pe_no].phb = phb;
@@ -2871,7 +2878,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
res = >resource[i + PCI_IOV_RESOURCES];
if (!res->flags || res->parent)
continue;
-   if (!pnv_pci_is_m64(phb, res)) {
+   if (!pnv_pci_is_m64_flags(res->flags)) {
dev_warn(>dev, "Don't support SR-IOV with"
" non M64 VF BAR%d: %pR. \n",
 i, res);
@@ -3096,7 +3103,7 @@ static resource_size_t pnv_pci_window_alignment(struct 
pci_bus *bus,
 * alignment for any 64-bit resource, PCIe doesn't care and
 * bridges only do 64-bit prefetchable anyway.
 */
-   if (phb->ioda.m64_segsize && (type & IORESOURCE_MEM_64))
+   if (phb->ioda.m64_segsize && pnv_pci_is_m64_flags(type))
return phb->ioda.m64_segsize;
if (type & IORESOURCE_MEM)
return phb->ioda.m32_segsize;
-- 
2.9.3

51 matches

Mail list logo