[PATCH] cxl: Fix leaking pid refs in some error paths

2016-10-20 Thread Vaibhav Jain
In some error paths in functions cxl_start_context and
afu_ioctl_start_work pid references to the current & group-leader tasks
can leak after they are taken. This patch fixes these error paths to
release these pid references before exiting the error path.

This patch is based on earlier patch "cxl: Prevent adapter reset
if an active context exists" at
https://patchwork.ozlabs.org/patch/682187/

Fixes: 7b8ad495("cxl: Fix DSI misses when the context owning task exits")
Reported-by: Frederic Barrat 
Signed-off-by: Vaibhav Jain 
---
 drivers/misc/cxl/api.c  |  2 ++
 drivers/misc/cxl/file.c | 22 +-
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index af23d7d..2e5233b 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -247,7 +247,9 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
cxl_ctx_get();
 
if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
+   put_pid(ctx->glpid);
put_pid(ctx->pid);
+   ctx->glpid = ctx->pid = NULL;
cxl_adapter_context_put(ctx->afu->adapter);
cxl_ctx_put();
goto out;
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index d0b421f..77080cc 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -194,6 +194,16 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
ctx->mmio_err_ff = !!(work.flags & CXL_START_WORK_ERR_FF);
 
/*
+* Increment the mapped context count for adapter. This also checks
+* if adapter_context_lock is taken.
+*/
+   rc = cxl_adapter_context_get(ctx->afu->adapter);
+   if (rc) {
+   afu_release_irqs(ctx, ctx);
+   goto out;
+   }
+
+   /*
 * We grab the PID here and not in the file open to allow for the case
 * where a process (master, some daemon, etc) has opened the chardev on
 * behalf of another process, so the AFU's mm gets bound to the process
@@ -205,15 +215,6 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
ctx->pid = get_task_pid(current, PIDTYPE_PID);
ctx->glpid = get_task_pid(current->group_leader, PIDTYPE_PID);
 
-   /*
-* Increment the mapped context count for adapter. This also checks
-* if adapter_context_lock is taken.
-*/
-   rc = cxl_adapter_context_get(ctx->afu->adapter);
-   if (rc) {
-   afu_release_irqs(ctx, ctx);
-   goto out;
-   }
 
trace_cxl_attach(ctx, work.work_element_descriptor, 
work.num_interrupts, amr);
 
@@ -221,6 +222,9 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
amr))) {
afu_release_irqs(ctx, ctx);
cxl_adapter_context_put(ctx->afu->adapter);
+   put_pid(ctx->glpid);
+   put_pid(ctx->pid);
+   ctx->glpid = ctx->pid = NULL;
goto out;
}
 
-- 
2.7.4



Re: [PATCH v5 0/9] implement vcpu preempted check

2016-10-20 Thread Peter Zijlstra
On Thu, Oct 20, 2016 at 05:27:45PM -0400, Pan Xinhui wrote:

> 
> This patch set aims to fix lock holder preemption issues.

Thanks, this looks very good. I'll wait for ACKs from at least the KVM
people, since that was I think the most contentious patch.


Re: [PATCH v5 7/9] x86, xen: support vcpu preempted check

2016-10-20 Thread Juergen Gross
Corrected xen-devel mailing list address, added other Xen maintainers

On 20/10/16 23:27, Pan Xinhui wrote:
> From: Juergen Gross 
> 
> Support the vcpu_is_preempted() functionality under Xen. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> A quick test (4 vcpus on 1 physical cpu doing a parallel build job
> with "make -j 8") reduced system time by about 5% with this patch.
> 
> Signed-off-by: Juergen Gross 
> Signed-off-by: Pan Xinhui 
> ---
>  arch/x86/xen/spinlock.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 3d6e006..74756bb 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
>   per_cpu(irq_name, cpu) = NULL;
>  }
>  
> -
>  /*
>   * Our init of PV spinlocks is split in two init functions due to us
>   * using paravirt patching and jump labels patching and having to do
> @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
>   pv_lock_ops.queued_spin_unlock = 
> PV_CALLEE_SAVE(__pv_queued_spin_unlock);
>   pv_lock_ops.wait = xen_qlock_wait;
>   pv_lock_ops.kick = xen_qlock_kick;
> +
> + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
>  }
>  
>  /*
> 



Re: [PATCH] powerpc/book3s64: Always build for power4 or later

2016-10-20 Thread Balbir Singh


On 21/10/16 11:01, Michael Ellerman wrote:
> When we're not compiling for a specific CPU, ie. none of the
> CONFIG_POWERx_CPU options are set, and CONFIG_GENERIC_CPU *is* set, we
> currently don't pass any -mcpu option to the compiler. This means the
> compiler builds for a "generic" Power CPU.
> 
> But back in 2014 we dropped support for pre power4 CPUs in commit
> 468a33028edd ("powerpc: Drop support for pre-POWER4 cpus").
> 
> Given that, there's no point in building the kernel to run on pre power4
> cpus. So update the flags we pass to the compiler when
> CONFIG_GENERIC_CPU is set, to specify -mcpu=power4.
> 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/Makefile | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index 617dece67924..041fda1e2a5d 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -121,6 +121,7 @@ CFLAGS-$(CONFIG_PPC32):= -ffixed-r2 $(MULTIPLEWORD)
>  
>  ifeq ($(CONFIG_PPC_BOOK3S_64),y)
>  CFLAGS-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=power7,-mtune=power4)
> +CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=power4
>  else
>  CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=powerpc64
>  endif
> 

Acked-by: Balbir Singh 


[PATCH v6 10/10] ima: platform-independent hash value

2016-10-20 Thread Thiago Jung Bauermann
From: Andreas Steffen 

For remote attestion it is important for the ima measurement values
to be platform-independent. Therefore integer fields to be hashed
must be converted to canonical format.

Changelog:
- Define canonical format as little endian (Mimi)

Signed-off-by: Andreas Steffen 
Signed-off-by: Mimi Zohar 
---
 security/integrity/ima/ima_crypto.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_crypto.c 
b/security/integrity/ima/ima_crypto.c
index 38f2ed830dd6..802d5d20f36f 100644
--- a/security/integrity/ima/ima_crypto.c
+++ b/security/integrity/ima/ima_crypto.c
@@ -477,11 +477,13 @@ static int ima_calc_field_array_hash_tfm(struct 
ima_field_data *field_data,
u8 buffer[IMA_EVENT_NAME_LEN_MAX + 1] = { 0 };
u8 *data_to_hash = field_data[i].data;
u32 datalen = field_data[i].len;
+   u32 datalen_to_hash =
+   !ima_canonical_fmt ? datalen : cpu_to_le32(datalen);
 
if (strcmp(td->name, IMA_TEMPLATE_IMA_NAME) != 0) {
rc = crypto_shash_update(shash,
-   (const u8 *) _data[i].len,
-   sizeof(field_data[i].len));
+   (const u8 *) _to_hash,
+   sizeof(datalen_to_hash));
if (rc)
break;
} else if (strcmp(td->fields[i]->field_id, "n") == 0) {
-- 
2.7.4



[PATCH v6 09/10] ima: define a canonical binary_runtime_measurements list format

2016-10-20 Thread Thiago Jung Bauermann
From: Mimi Zohar 

The IMA binary_runtime_measurements list is currently in platform native
format.

To allow restoring a measurement list carried across kexec with a
different endianness than the targeted kernel, this patch defines
little-endian as the canonical format.  For big endian systems wanting
to save/restore the measurement list from a system with a different
endianness, a new boot command line parameter named "ima_canonical_fmt"
is defined.

Considerations: use of the "ima_canonical_fmt" boot command line
option will break existing userspace applications on big endian systems
expecting the binary_runtime_measurements list to be in platform native
format.

Changelog v3:
- restore PCR value properly

Signed-off-by: Mimi Zohar 
---
 Documentation/kernel-parameters.txt   |  4 
 security/integrity/ima/ima.h  |  6 ++
 security/integrity/ima/ima_fs.c   | 28 +---
 security/integrity/ima/ima_kexec.c| 11 +--
 security/integrity/ima/ima_template.c | 24 ++--
 security/integrity/ima/ima_template_lib.c |  7 +--
 6 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 37babf91f2cb..3ee81afad7e9 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1641,6 +1641,10 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
The builtin appraise policy appraises all files
owned by uid=0.
 
+   ima_canonical_fmt [IMA]
+   Use the canonical format for the binary runtime
+   measurements, instead of host native format.
+
ima_hash=   [IMA]
Format: { md5 | sha1 | rmd160 | sha256 | sha384
   | sha512 | ... }
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 6b0540ad189f..5e6180a4da7d 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -122,6 +122,12 @@ void ima_load_kexec_buffer(void);
 static inline void ima_load_kexec_buffer(void) {}
 #endif /* CONFIG_HAVE_IMA_KEXEC */
 
+/*
+ * The default binary_runtime_measurements list format is defined as the
+ * platform native format.  The canonical format is defined as little-endian.
+ */
+extern bool ima_canonical_fmt;
+
 /* Internal IMA function definitions */
 int ima_init(void);
 int ima_fs_init(void);
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 66e5dd5e226f..2bcad99d434e 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -28,6 +28,16 @@
 
 static DEFINE_MUTEX(ima_write_mutex);
 
+bool ima_canonical_fmt;
+static int __init default_canonical_fmt_setup(char *str)
+{
+#ifdef __BIG_ENDIAN
+   ima_canonical_fmt = 1;
+#endif
+   return 1;
+}
+__setup("ima_canonical_fmt", default_canonical_fmt_setup);
+
 static int valid_policy = 1;
 #define TMPBUFLEN 12
 static ssize_t ima_show_htable_value(char __user *buf, size_t count,
@@ -122,7 +132,7 @@ int ima_measurements_show(struct seq_file *m, void *v)
struct ima_queue_entry *qe = v;
struct ima_template_entry *e;
char *template_name;
-   int namelen;
+   u32 pcr, namelen, template_data_len; /* temporary fields */
bool is_ima_template = false;
int i;
 
@@ -139,25 +149,29 @@ int ima_measurements_show(struct seq_file *m, void *v)
 * PCR used defaults to the same (config option) in
 * little-endian format, unless set in policy
 */
-   ima_putc(m, >pcr, sizeof(e->pcr));
+   pcr = !ima_canonical_fmt ? e->pcr : cpu_to_le32(e->pcr);
+   ima_putc(m, , sizeof(e->pcr));
 
/* 2nd: template digest */
ima_putc(m, e->digest, TPM_DIGEST_SIZE);
 
/* 3rd: template name size */
-   namelen = strlen(template_name);
+   namelen = !ima_canonical_fmt ? strlen(template_name) :
+   cpu_to_le32(strlen(template_name));
ima_putc(m, , sizeof(namelen));
 
/* 4th:  template name */
-   ima_putc(m, template_name, namelen);
+   ima_putc(m, template_name, strlen(template_name));
 
/* 5th:  template length (except for 'ima' template) */
if (strcmp(template_name, IMA_TEMPLATE_IMA_NAME) == 0)
is_ima_template = true;
 
-   if (!is_ima_template)
-   ima_putc(m, >template_data_len,
-sizeof(e->template_data_len));
+   if (!is_ima_template) {
+   template_data_len = !ima_canonical_fmt ? e->template_data_len :
+   cpu_to_le32(e->template_data_len);
+   ima_putc(m, _data_len, sizeof(e->template_data_len));
+   }
 
/* 6th:  template specific data */
for (i = 0; i < e->template_desc->num_fields; i++) {

[PATCH v6 08/10] ima: support restoring multiple template formats

2016-10-20 Thread Thiago Jung Bauermann
From: Mimi Zohar 

The configured IMA measurement list template format can be replaced at
runtime on the boot command line, including a custom template format.
This patch adds support for restoring a measuremement list containing
multiple builtin/custom template formats.

Signed-off-by: Mimi Zohar 
---
 security/integrity/ima/ima_template.c | 53 +--
 1 file changed, 50 insertions(+), 3 deletions(-)

diff --git a/security/integrity/ima/ima_template.c 
b/security/integrity/ima/ima_template.c
index c0d808c20c40..e57b4682ff93 100644
--- a/security/integrity/ima/ima_template.c
+++ b/security/integrity/ima/ima_template.c
@@ -155,9 +155,14 @@ static int template_desc_init_fields(const char 
*template_fmt,
 {
const char *template_fmt_ptr;
struct ima_template_field *found_fields[IMA_TEMPLATE_NUM_FIELDS_MAX];
-   int template_num_fields = template_fmt_size(template_fmt);
+   int template_num_fields;
int i, len;
 
+   if (num_fields && *num_fields > 0) /* already initialized? */
+   return 0;
+
+   template_num_fields = template_fmt_size(template_fmt);
+
if (template_num_fields > IMA_TEMPLATE_NUM_FIELDS_MAX) {
pr_err("format string '%s' contains too many fields\n",
   template_fmt);
@@ -237,6 +242,35 @@ int __init ima_init_template(void)
return result;
 }
 
+static struct ima_template_desc *restore_template_fmt(char *template_name)
+{
+   struct ima_template_desc *template_desc = NULL;
+   int ret;
+
+   ret = template_desc_init_fields(template_name, NULL, NULL);
+   if (ret < 0) {
+   pr_err("attempting to initialize the template \"%s\" failed\n",
+   template_name);
+   goto out;
+   }
+
+   template_desc = kzalloc(sizeof(*template_desc), GFP_KERNEL);
+   if (!template_desc)
+   goto out;
+
+   template_desc->name = "";
+   template_desc->fmt = kstrdup(template_name, GFP_KERNEL);
+   if (!template_desc->fmt)
+   goto out;
+
+   spin_lock(_list);
+   list_add_tail_rcu(_desc->list, _templates);
+   spin_unlock(_list);
+   synchronize_rcu();
+out:
+   return template_desc;
+}
+
 static int ima_restore_template_data(struct ima_template_desc *template_desc,
 void *template_data,
 int template_data_size,
@@ -367,10 +401,23 @@ int ima_restore_measurement_list(loff_t size, void *buf)
}
data_v1 = bufp += (u_int8_t)hdr_v1->template_name_len;
 
-   /* get template format */
template_desc = lookup_template_desc(template_name);
if (!template_desc) {
-   pr_err("template \"%s\" not found\n", template_name);
+   template_desc = restore_template_fmt(template_name);
+   if (!template_desc)
+   break;
+   }
+
+   /*
+* Only the running system's template format is initialized
+* on boot.  As needed, initialize the other template formats.
+*/
+   ret = template_desc_init_fields(template_desc->fmt,
+   &(template_desc->fields),
+   &(template_desc->num_fields));
+   if (ret < 0) {
+   pr_err("attempting to restore the template fmt \"%s\" \
+   failed\n", template_desc->fmt);
ret = -EINVAL;
break;
}
-- 
2.7.4



[PATCH v6 07/10] ima: store the builtin/custom template definitions in a list

2016-10-20 Thread Thiago Jung Bauermann
From: Mimi Zohar 

The builtin and single custom templates are currently stored in an
array.  In preparation for being able to restore a measurement list
containing multiple builtin/custom templates, this patch stores the
builtin and custom templates as a linked list.  This will permit
defining more than one custom template per boot.

Changelog v4:
- fix "spinlock bad magic" BUG - reported by Dmitry Vyukov

Changelog v3:
- initialize template format list in ima_template_desc_current(), as it
might be called during __setup before normal initialization. (kernel
test robot)
- remove __init annotation of ima_init_template_list()

Changelog v2:
- fix lookup_template_desc() preemption imbalance (kernel test robot)

Signed-off-by: Mimi Zohar 
---
 security/integrity/ima/ima.h  |  2 ++
 security/integrity/ima/ima_main.c |  1 +
 security/integrity/ima/ima_template.c | 52 +++
 3 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 139dec67dcbf..6b0540ad189f 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -85,6 +85,7 @@ struct ima_template_field {
 
 /* IMA template descriptor definition */
 struct ima_template_desc {
+   struct list_head list;
char *name;
char *fmt;
int num_fields;
@@ -146,6 +147,7 @@ int ima_restore_measurement_list(loff_t bufsize, void *buf);
 int ima_measurements_show(struct seq_file *m, void *v);
 unsigned long ima_get_binary_runtime_size(void);
 int ima_init_template(void);
+void ima_init_template_list(void);
 
 /*
  * used to protect h_table and sha_table
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 423d111b3b94..50818c60538b 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -418,6 +418,7 @@ static int __init init_ima(void)
 {
int error;
 
+   ima_init_template_list();
hash_setup(CONFIG_IMA_DEFAULT_HASH);
error = ima_init();
if (!error) {
diff --git a/security/integrity/ima/ima_template.c 
b/security/integrity/ima/ima_template.c
index 37f972cb05fe..c0d808c20c40 100644
--- a/security/integrity/ima/ima_template.c
+++ b/security/integrity/ima/ima_template.c
@@ -15,16 +15,20 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include 
 #include "ima.h"
 #include "ima_template_lib.h"
 
-static struct ima_template_desc defined_templates[] = {
+static struct ima_template_desc builtin_templates[] = {
{.name = IMA_TEMPLATE_IMA_NAME, .fmt = IMA_TEMPLATE_IMA_FMT},
{.name = "ima-ng", .fmt = "d-ng|n-ng"},
{.name = "ima-sig", .fmt = "d-ng|n-ng|sig"},
{.name = "", .fmt = ""},/* placeholder for a custom format */
 };
 
+static LIST_HEAD(defined_templates);
+static DEFINE_SPINLOCK(template_list);
+
 static struct ima_template_field supported_fields[] = {
{.field_id = "d", .field_init = ima_eventdigest_init,
 .field_show = ima_show_template_digest},
@@ -53,6 +57,8 @@ static int __init ima_template_setup(char *str)
if (ima_template)
return 1;
 
+   ima_init_template_list();
+
/*
 * Verify that a template with the supplied name exists.
 * If not, use CONFIG_IMA_DEFAULT_TEMPLATE.
@@ -81,7 +87,7 @@ __setup("ima_template=", ima_template_setup);
 
 static int __init ima_template_fmt_setup(char *str)
 {
-   int num_templates = ARRAY_SIZE(defined_templates);
+   int num_templates = ARRAY_SIZE(builtin_templates);
 
if (ima_template)
return 1;
@@ -92,22 +98,28 @@ static int __init ima_template_fmt_setup(char *str)
return 1;
}
 
-   defined_templates[num_templates - 1].fmt = str;
-   ima_template = defined_templates + num_templates - 1;
+   builtin_templates[num_templates - 1].fmt = str;
+   ima_template = builtin_templates + num_templates - 1;
+
return 1;
 }
 __setup("ima_template_fmt=", ima_template_fmt_setup);
 
 static struct ima_template_desc *lookup_template_desc(const char *name)
 {
-   int i;
+   struct ima_template_desc *template_desc;
+   int found = 0;
 
-   for (i = 0; i < ARRAY_SIZE(defined_templates); i++) {
-   if (strcmp(defined_templates[i].name, name) == 0)
-   return defined_templates + i;
+   rcu_read_lock();
+   list_for_each_entry_rcu(template_desc, _templates, list) {
+   if ((strcmp(template_desc->name, name) == 0) ||
+   (strcmp(template_desc->fmt, name) == 0)) {
+   found = 1;
+   break;
+   }
}
-
-   return NULL;
+   rcu_read_unlock();
+   return found ? template_desc : NULL;
 }
 
 static struct ima_template_field *lookup_template_field(const char *field_id)
@@ -183,11 +195,29 @@ static int 

[PATCH v6 06/10] ima: on soft reboot, save the measurement list

2016-10-20 Thread Thiago Jung Bauermann
From: Mimi Zohar 

The TPM PCRs are only reset on a hard reboot.  In order to validate a
TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
of the running kernel must be saved and restored on boot.

This patch uses the kexec buffer passing mechanism to pass the
serialized IMA binary_runtime_measurements to the next kernel.

Changelog v5:
- move writing the IMA measurement list to kexec load and remove
  from kexec execute.
- remove registering notifier to call update on kexec execute
- add includes needed by code in this patch to ima_kexec.c (Thiago)
- fold patch "ima: serialize the binary_runtime_measurements"
into this patch.

Changelog v4:
- Revert the skip_checksum change.  Instead calculate the checksum
with the measurement list segment, on update validate the existing
checksum before re-calulating a new checksum with the updated
measurement list.

Changelog v3:
- Request a kexec segment for storing the measurement list a half page,
not a full page, more than needed for additional measurements.
- Added binary_runtime_size overflow test
- Limit maximum number of pages needed for kexec_segment_size to half
of totalram_pages. (Dave Young)

Changelog v2:
- Fix build issue by defining a stub ima_add_kexec_buffer and stub
  struct kimage when CONFIG_IMA=n and CONFIG_IMA_KEXEC=n. (Fenguang Wu)
- removed kexec_add_handover_buffer() checksum argument.
- added skip_checksum member to kexec_buf
- only register reboot notifier once

Changelog v1:
- updated to call IMA functions  (Mimi)
- move code from ima_template.c to ima_kexec.c (Mimi)

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Mimi Zohar 
Acked-by: "Eric W. Biederman" 
---
 include/linux/ima.h|  12 
 kernel/kexec_file.c|   4 ++
 security/integrity/ima/ima.h   |   1 +
 security/integrity/ima/ima_fs.c|   2 +-
 security/integrity/ima/ima_kexec.c | 117 +
 5 files changed, 135 insertions(+), 1 deletion(-)

diff --git a/include/linux/ima.h b/include/linux/ima.h
index 0eb7c2e7f0d6..7f6952f8d6aa 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -11,6 +11,7 @@
 #define _LINUX_IMA_H
 
 #include 
+#include 
 struct linux_binprm;
 
 #ifdef CONFIG_IMA
@@ -23,6 +24,10 @@ extern int ima_post_read_file(struct file *file, void *buf, 
loff_t size,
  enum kernel_read_file_id id);
 extern void ima_post_path_mknod(struct dentry *dentry);
 
+#ifdef CONFIG_IMA_KEXEC
+extern void ima_add_kexec_buffer(struct kimage *image);
+#endif
+
 #else
 static inline int ima_bprm_check(struct linux_binprm *bprm)
 {
@@ -62,6 +67,13 @@ static inline void ima_post_path_mknod(struct dentry *dentry)
 
 #endif /* CONFIG_IMA */
 
+#ifndef CONFIG_IMA_KEXEC
+struct kimage;
+
+static inline void ima_add_kexec_buffer(struct kimage *image)
+{}
+#endif
+
 #ifdef CONFIG_IMA_APPRAISE
 extern void ima_inode_post_setattr(struct dentry *dentry);
 extern int ima_inode_setxattr(struct dentry *dentry, const char *xattr_name,
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 0c2df7f73792..b56a558e406d 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -132,6 +133,9 @@ kimage_file_prepare_segments(struct kimage *image, int 
kernel_fd, int initrd_fd,
return ret;
image->kernel_buf_len = size;
 
+   /* IMA needs to pass the measurement list to the next kernel. */
+   ima_add_kexec_buffer(image);
+
/* Call arch image probe handlers */
ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
image->kernel_buf_len);
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index ea1dcc452911..139dec67dcbf 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -143,6 +143,7 @@ void ima_print_digest(struct seq_file *m, u8 *digest, u32 
size);
 struct ima_template_desc *ima_template_desc_current(void);
 int ima_restore_measurement_entry(struct ima_template_entry *entry);
 int ima_restore_measurement_list(loff_t bufsize, void *buf);
+int ima_measurements_show(struct seq_file *m, void *v);
 unsigned long ima_get_binary_runtime_size(void);
 int ima_init_template(void);
 
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index c07a3844ea0a..66e5dd5e226f 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -116,7 +116,7 @@ void ima_putc(struct seq_file *m, void *data, int datalen)
  *   [eventdata length]
  *   eventdata[n]=template specific data
  */
-static int ima_measurements_show(struct seq_file *m, void *v)
+int ima_measurements_show(struct seq_file *m, void *v)
 {
/* the list never shrinks, so we don't need a lock here */
struct 

[PATCH v6 05/10] powerpc: ima: Send the kexec buffer to the next kernel

2016-10-20 Thread Thiago Jung Bauermann
The IMA kexec buffer allows the currently running kernel to pass
the measurement list via a kexec segment to the kernel that will be
kexec'd.

This is the architecture-specific part of setting up the IMA kexec
buffer for the next kernel. It will be used in the next patch.

Changelog v5:
- New patch in this version. This code was previously in the kexec buffer
  handover patch series.

Changelog relative to kexec handover patches v5:
- Moved code to arch/powerpc/kernel/ima_kexec.c.
- Renamed functions and struct members to variations of ima_kexec_buffer
  instead of variations of kexec_handover_buffer.
- Use a single property /chosen/linux,ima-kexec-buffer containing
  the buffer address and length, instead of
  /chosen/linux,kexec-handover-buffer-{start,end}.
- Use #address-cells and #size-cells to write the DT property.
- Use size_t instead of unsigned long for size arguments.
- Use CONFIG_IMA_KEXEC to build this code only when necessary.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: "Eric W. Biederman" 
---
 arch/powerpc/include/asm/ima.h  | 16 +
 arch/powerpc/include/asm/kexec.h| 14 -
 arch/powerpc/kernel/ima_kexec.c | 91 +
 arch/powerpc/kernel/kexec_elf_64.c  |  2 +-
 arch/powerpc/kernel/machine_kexec_file_64.c | 12 +++-
 5 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h
index d5a72dd9b499..2313bdface34 100644
--- a/arch/powerpc/include/asm/ima.h
+++ b/arch/powerpc/include/asm/ima.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_POWERPC_IMA_H
 #define _ASM_POWERPC_IMA_H
 
+struct kimage;
+
 int ima_get_kexec_buffer(void **addr, size_t *size);
 int ima_free_kexec_buffer(void);
 
@@ -10,4 +12,18 @@ void remove_ima_buffer(void *fdt, int chosen_node);
 static inline void remove_ima_buffer(void *fdt, int chosen_node) {}
 #endif
 
+#ifdef CONFIG_IMA_KEXEC
+int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
+ size_t size);
+
+int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node);
+#else
+static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
+  int chosen_node)
+{
+   remove_ima_buffer(fdt, chosen_node);
+   return 0;
+}
+#endif /* CONFIG_IMA_KEXEC */
+
 #endif /* _ASM_POWERPC_IMA_H */
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 23056d2dc330..a49cab287acb 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -94,12 +94,22 @@ static inline bool kdump_in_progress(void)
 #ifdef CONFIG_KEXEC_FILE
 extern struct kexec_file_ops kexec_elf64_ops;
 
+#ifdef CONFIG_IMA_KEXEC
+#define ARCH_HAS_KIMAGE_ARCH
+
+struct kimage_arch {
+   phys_addr_t ima_buffer_addr;
+   size_t ima_buffer_size;
+};
+#endif
+
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
unsigned long fdt_load_addr, unsigned long stack_top,
int debug);
-int setup_new_fdt(void *fdt, unsigned long initrd_load_addr,
- unsigned long initrd_len, const char *cmdline);
+int setup_new_fdt(const struct kimage *image, void *fdt,
+ unsigned long initrd_load_addr, unsigned long initrd_len,
+ const char *cmdline);
 bool find_debug_console(const void *fdt);
 int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 #endif /* CONFIG_KEXEC_FILE */
diff --git a/arch/powerpc/kernel/ima_kexec.c b/arch/powerpc/kernel/ima_kexec.c
index 36e5a5df3804..5ea42c937ca9 100644
--- a/arch/powerpc/kernel/ima_kexec.c
+++ b/arch/powerpc/kernel/ima_kexec.c
@@ -130,3 +130,94 @@ void remove_ima_buffer(void *fdt, int chosen_node)
if (!ret)
pr_debug("Removed old IMA buffer reservation.\n");
 }
+
+#ifdef CONFIG_IMA_KEXEC
+/**
+ * arch_ima_add_kexec_buffer - do arch-specific steps to add the IMA buffer
+ *
+ * Architectures should use this function to pass on the IMA buffer
+ * information to the next kernel.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
+ size_t size)
+{
+   image->arch.ima_buffer_addr = load_addr;
+   image->arch.ima_buffer_size = size;
+
+   return 0;
+}
+
+static int write_number(void *p, u64 value, int cells)
+{
+   if (cells == 1) {
+   u32 tmp;
+
+   if (value > U32_MAX)
+   return -EINVAL;
+
+   tmp = cpu_to_be32(value);
+   memcpy(p, , sizeof(tmp));
+   } else if (cells == 2) {
+   u64 tmp;
+
+   tmp = cpu_to_be64(value);
+   memcpy(p, , sizeof(tmp));
+   } else
+   return -EINVAL;
+
+   

[PATCH v6 01/10] powerpc: ima: Get the kexec buffer passed by the previous kernel

2016-10-20 Thread Thiago Jung Bauermann
The IMA kexec buffer allows the currently running kernel to pass
the measurement list via a kexec segment to the kernel that will be
kexec'd. The second kernel can check whether the previous kernel sent
the buffer and retrieve it.

This is the architecture-specific part which enables IMA to receive the
measurement list passed by the previous kernel. It will be used in the
next patch.

The change in machine_kexec_64.c is to factor out the logic of removing
an FDT memory reservation so that it can be used by remove_ima_buffer.

Changelog v6:
- The kexec_file_load patches v9 already define delete_fdt_mem_rsv,
  so now we just need to export it.

Changelog v5:
- New patch in this version. This code was previously in the kexec buffer
  handover patch series.

Changelog relative to kexec handover patches v5:
- Added CONFIG_HAVE_IMA_KEXEC.
- Added arch/powerpc/include/asm/ima.h.
- Moved code to arch/powerpc/kernel/ima_kexec.c.
- Renamed functions to variations of ima_kexec_buffer instead of
  variations of kexec_handover_buffer.
- Use a single property /chosen/linux,ima-kexec-buffer containing
  the buffer address and length, instead of
  /chosen/linux,kexec-handover-buffer-{start,end}.
- Use #address-cells and #size-cells to read the DT property.
- Use size_t instead of unsigned long for size arguments.
- Always remove linux,ima-kexec-buffer and its memory reservation
  when preparing a device tree for kexec_file_load.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: "Eric W. Biederman" 
---
 arch/Kconfig|   3 +
 arch/powerpc/Kconfig|   1 +
 arch/powerpc/include/asm/ima.h  |  13 +++
 arch/powerpc/include/asm/kexec.h|   1 +
 arch/powerpc/kernel/Makefile|   4 +
 arch/powerpc/kernel/ima_kexec.c | 132 
 arch/powerpc/kernel/machine_kexec_file_64.c |   5 +-
 7 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 659bdd079277..e1605ff286a1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -5,6 +5,9 @@
 config KEXEC_CORE
bool
 
+config HAVE_IMA_KEXEC
+   bool
+
 config OPROFILE
tristate "OProfile system profiling"
depends on PROFILING
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 897d0f14447d..40ee044f1915 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -458,6 +458,7 @@ config KEXEC
 config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
+   select HAVE_IMA_KEXEC
select BUILD_BIN2C
depends on PPC64
depends on CRYPTO=y
diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h
new file mode 100644
index ..d5a72dd9b499
--- /dev/null
+++ b/arch/powerpc/include/asm/ima.h
@@ -0,0 +1,13 @@
+#ifndef _ASM_POWERPC_IMA_H
+#define _ASM_POWERPC_IMA_H
+
+int ima_get_kexec_buffer(void **addr, size_t *size);
+int ima_free_kexec_buffer(void);
+
+#ifdef CONFIG_IMA
+void remove_ima_buffer(void *fdt, int chosen_node);
+#else
+static inline void remove_ima_buffer(void *fdt, int chosen_node) {}
+#endif
+
+#endif /* _ASM_POWERPC_IMA_H */
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 4497db7555b0..23056d2dc330 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -101,6 +101,7 @@ int setup_purgatory(struct kimage *image, const void 
*slave_code,
 int setup_new_fdt(void *fdt, unsigned long initrd_load_addr,
  unsigned long initrd_len, const char *cmdline);
 bool find_debug_console(const void *fdt);
+int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 #endif /* CONFIG_KEXEC_FILE */
 
 #else /* !CONFIG_KEXEC_CORE */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 424b13b1b2b0..c3b37171168c 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -111,6 +111,10 @@ obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o crash.o \
   machine_kexec_$(BITS).o
 obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o elf_util.o \
   kexec_elf_$(BITS).o
+ifeq ($(CONFIG_HAVE_IMA_KEXEC)$(CONFIG_IMA),yy)
+obj-y  += ima_kexec.o
+endif
+
 obj-$(CONFIG_AUDIT)+= audit.o
 obj64-$(CONFIG_AUDIT)  += compat_audit.o
 
diff --git a/arch/powerpc/kernel/ima_kexec.c b/arch/powerpc/kernel/ima_kexec.c
new file mode 100644
index ..36e5a5df3804
--- /dev/null
+++ b/arch/powerpc/kernel/ima_kexec.c
@@ -0,0 +1,132 @@
+/*
+ * Copyright (C) 2016 IBM Corporation
+ *
+ * Authors:
+ * Thiago Jung Bauermann 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either 

[PATCH v6 04/10] ima: maintain memory size needed for serializing the measurement list

2016-10-20 Thread Thiago Jung Bauermann
From: Mimi Zohar 

In preparation for serializing the binary_runtime_measurements, this patch
maintains the amount of memory required.

Changelog v5:
- replace CONFIG_KEXEC_FILE with architecture CONFIG_HAVE_IMA_KEXEC (Thiago)

Changelog v3:
- include the ima_kexec_hdr size in the binary_runtime_measurement size.

Signed-off-by: Mimi Zohar 
---
 security/integrity/ima/Kconfig | 12 +
 security/integrity/ima/ima.h   |  1 +
 security/integrity/ima/ima_queue.c | 53 --
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index 5487827fa86c..370eb2f4dd37 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -27,6 +27,18 @@ config IMA
  to learn more about IMA.
  If unsure, say N.
 
+config IMA_KEXEC
+   bool "Enable carrying the IMA measurement list across a soft boot"
+   depends on IMA && TCG_TPM && HAVE_IMA_KEXEC
+   default n
+   help
+  TPM PCRs are only reset on a hard reboot.  In order to validate
+  a TPM's quote after a soft boot, the IMA measurement list of the
+  running kernel must be saved and restored on boot.
+
+  Depending on the IMA policy, the measurement list can grow to
+  be very large.
+
 config IMA_MEASURE_PCR_IDX
int
depends on IMA
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 51dc8d57d64d..ea1dcc452911 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -143,6 +143,7 @@ void ima_print_digest(struct seq_file *m, u8 *digest, u32 
size);
 struct ima_template_desc *ima_template_desc_current(void);
 int ima_restore_measurement_entry(struct ima_template_entry *entry);
 int ima_restore_measurement_list(loff_t bufsize, void *buf);
+unsigned long ima_get_binary_runtime_size(void);
 int ima_init_template(void);
 
 /*
diff --git a/security/integrity/ima/ima_queue.c 
b/security/integrity/ima/ima_queue.c
index 12d1b040bca9..3a3cc2a45645 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -29,6 +29,11 @@
 #define AUDIT_CAUSE_LEN_MAX 32
 
 LIST_HEAD(ima_measurements);   /* list of all measurements */
+#ifdef CONFIG_IMA_KEXEC
+static unsigned long binary_runtime_size;
+#else
+static unsigned long binary_runtime_size = ULONG_MAX;
+#endif
 
 /* key: inode (before secure-hashing a file) */
 struct ima_h_table ima_htable = {
@@ -64,6 +69,24 @@ static struct ima_queue_entry *ima_lookup_digest_entry(u8 
*digest_value,
return ret;
 }
 
+/*
+ * Calculate the memory required for serializing a single
+ * binary_runtime_measurement list entry, which contains a
+ * couple of variable length fields (e.g template name and data).
+ */
+static int get_binary_runtime_size(struct ima_template_entry *entry)
+{
+   int size = 0;
+
+   size += sizeof(u32);/* pcr */
+   size += sizeof(entry->digest);
+   size += sizeof(int);/* template name size field */
+   size += strlen(entry->template_desc->name);
+   size += sizeof(entry->template_data_len);
+   size += entry->template_data_len;
+   return size;
+}
+
 /* ima_add_template_entry helper function:
  * - Add template entry to the measurement list and hash table, for
  *   all entries except those carried across kexec.
@@ -90,9 +113,30 @@ static int ima_add_digest_entry(struct ima_template_entry 
*entry, int flags)
key = ima_hash_key(entry->digest);
hlist_add_head_rcu(>hnext, _htable.queue[key]);
}
+
+   if (binary_runtime_size != ULONG_MAX) {
+   int size;
+
+   size = get_binary_runtime_size(entry);
+   binary_runtime_size = (binary_runtime_size < ULONG_MAX - size) ?
+binary_runtime_size + size : ULONG_MAX;
+   }
return 0;
 }
 
+/*
+ * Return the amount of memory required for serializing the
+ * entire binary_runtime_measurement list, including the ima_kexec_hdr
+ * structure.
+ */
+unsigned long ima_get_binary_runtime_size(void)
+{
+   if (binary_runtime_size >= (ULONG_MAX - sizeof(struct ima_kexec_hdr)))
+   return ULONG_MAX;
+   else
+   return binary_runtime_size + sizeof(struct ima_kexec_hdr);
+};
+
 static int ima_pcr_extend(const u8 *hash, int pcr)
 {
int result = 0;
@@ -106,8 +150,13 @@ static int ima_pcr_extend(const u8 *hash, int pcr)
return result;
 }
 
-/* Add template entry to the measurement list and hash table,
- * and extend the pcr.
+/*
+ * Add template entry to the measurement list and hash table, and
+ * extend the pcr.
+ *
+ * On systems which support carrying the IMA measurement list across
+ * kexec, maintain the total memory size required for serializing the
+ * binary_runtime_measurements.
  */
 int ima_add_template_entry(struct ima_template_entry 

[PATCH v6 00/10] ima: carry the measurement list across kexec

2016-10-20 Thread Thiago Jung Bauermann
Hello,

This is just a rebase on top of kexec_file_load patches v9 which I just
posted. The previous version of this series has some conflicts with it.

Original cover letter:

The TPM PCRs are only reset on a hard reboot.  In order to validate a
TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
of the running kernel must be saved and then restored on the subsequent
boot, possibly of a different architecture.

The existing securityfs binary_runtime_measurements file conveniently
provides a serialized format of the IMA measurement list. This patch
set serializes the measurement list in this format and restores it.

Up to now, the binary_runtime_measurements was defined as architecture
native format.  The assumption being that userspace could and would
handle any architecture conversions.  With the ability of carrying the
measurement list across kexec, possibly from one architecture to a
different one, the per boot architecture information is lost and with it
the ability of recalculating the template digest hash.  To resolve this
problem, without breaking the existing ABI, this patch set introduces
the boot command line option "ima_canonical_fmt", which is arbitrarily
defined as little endian.

The need for this boot command line option will be limited to the
existing version 1 format of the binary_runtime_measurements.
Subsequent formats will be defined as canonical format (eg. TPM 2.0
support for larger digests).

A simplified method of Thiago Bauermann's "kexec buffer handover" patch
series for carrying the IMA measurement list across kexec is included
in this patch set.  The simplified method requires all file measurements
be taken prior to executing the kexec load, as subsequent measurements
will not be carried across the kexec and restored.

Changelog v6:
- Rebased on top of "kexec_file_load implementation for PowerPC"
  patches v9.

Changelog v5:
- Included patches from Thiago Bauermann's "kexec buffer handover"
patch series for carrying the IMA measurement list across kexec.
- Added CONFIG_HAVE_IMA_KEXEC
- Renamed functions to variations of ima_kexec_buffer instead of
variations of kexec_handover_buffer

Changelog v4:
- Fixed "spinlock bad magic" BUG - reported by Dmitry Vyukov
- Rebased on Thiago Bauermann's v5 patch set
- Removed the skip_checksum initialization  

Changelog v3:
- Cleaned up the code for calculating the requested kexec segment size
needed for the IMA measurement list, limiting the segment size to half
of the totalram_pages.
- Fixed kernel test robot reports as enumerated in the respective
patch changelog.

Changelog v2:
- Canonical measurement list support added
- Redefined the ima_kexec_hdr struct to use well defined sizes

Andreas Steffen (1):
  ima: platform-independent hash value

Mimi Zohar (7):
  ima: on soft reboot, restore the measurement list
  ima: permit duplicate measurement list entries
  ima: maintain memory size needed for serializing the measurement list
  ima: on soft reboot, save the measurement list
  ima: store the builtin/custom template definitions in a list
  ima: support restoring multiple template formats
  ima: define a canonical binary_runtime_measurements list format

Thiago Jung Bauermann (2):
  powerpc: ima: Get the kexec buffer passed by the previous kernel
  powerpc: ima: Send the kexec buffer to the next kernel

 Documentation/kernel-parameters.txt |   4 +
 arch/Kconfig|   3 +
 arch/powerpc/Kconfig|   1 +
 arch/powerpc/include/asm/ima.h  |  29 +++
 arch/powerpc/include/asm/kexec.h|  15 +-
 arch/powerpc/kernel/Makefile|   4 +
 arch/powerpc/kernel/ima_kexec.c | 223 +
 arch/powerpc/kernel/kexec_elf_64.c  |   2 +-
 arch/powerpc/kernel/machine_kexec_file_64.c |  15 +-
 include/linux/ima.h |  12 ++
 kernel/kexec_file.c |   4 +
 security/integrity/ima/Kconfig  |  12 ++
 security/integrity/ima/Makefile |   1 +
 security/integrity/ima/ima.h|  31 +++
 security/integrity/ima/ima_crypto.c |   6 +-
 security/integrity/ima/ima_fs.c |  30 ++-
 security/integrity/ima/ima_init.c   |   2 +
 security/integrity/ima/ima_kexec.c  | 168 
 security/integrity/ima/ima_main.c   |   1 +
 security/integrity/ima/ima_queue.c  |  76 +++-
 security/integrity/ima/ima_template.c   | 293 ++--
 security/integrity/ima/ima_template_lib.c   |   7 +-
 22 files changed, 901 insertions(+), 38 deletions(-)
 create mode 100644 arch/powerpc/include/asm/ima.h
 create mode 100644 arch/powerpc/kernel/ima_kexec.c
 create mode 100644 security/integrity/ima/ima_kexec.c

-- 
2.7.4



[PATCH v6 03/10] ima: permit duplicate measurement list entries

2016-10-20 Thread Thiago Jung Bauermann
From: Mimi Zohar 

Measurements carried across kexec need to be added to the IMA
measurement list, but should not prevent measurements of the newly
booted kernel from being added to the measurement list. This patch
adds support for allowing duplicate measurements.

The "boot_aggregate" measurement entry is the delimiter between soft
boots.

Signed-off-by: Mimi Zohar 
---
 security/integrity/ima/ima_queue.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/security/integrity/ima/ima_queue.c 
b/security/integrity/ima/ima_queue.c
index 4b1bb7787839..12d1b040bca9 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -65,11 +65,12 @@ static struct ima_queue_entry *ima_lookup_digest_entry(u8 
*digest_value,
 }
 
 /* ima_add_template_entry helper function:
- * - Add template entry to measurement list and hash table.
+ * - Add template entry to the measurement list and hash table, for
+ *   all entries except those carried across kexec.
  *
  * (Called with ima_extend_list_mutex held.)
  */
-static int ima_add_digest_entry(struct ima_template_entry *entry)
+static int ima_add_digest_entry(struct ima_template_entry *entry, int flags)
 {
struct ima_queue_entry *qe;
unsigned int key;
@@ -85,8 +86,10 @@ static int ima_add_digest_entry(struct ima_template_entry 
*entry)
list_add_tail_rcu(>later, _measurements);
 
atomic_long_inc(_htable.len);
-   key = ima_hash_key(entry->digest);
-   hlist_add_head_rcu(>hnext, _htable.queue[key]);
+   if (flags) {
+   key = ima_hash_key(entry->digest);
+   hlist_add_head_rcu(>hnext, _htable.queue[key]);
+   }
return 0;
 }
 
@@ -126,7 +129,7 @@ int ima_add_template_entry(struct ima_template_entry 
*entry, int violation,
}
}
 
-   result = ima_add_digest_entry(entry);
+   result = ima_add_digest_entry(entry, 1);
if (result < 0) {
audit_cause = "ENOMEM";
audit_info = 0;
@@ -155,7 +158,7 @@ int ima_restore_measurement_entry(struct ima_template_entry 
*entry)
int result = 0;
 
mutex_lock(_extend_list_mutex);
-   result = ima_add_digest_entry(entry);
+   result = ima_add_digest_entry(entry, 0);
mutex_unlock(_extend_list_mutex);
return result;
 }
-- 
2.7.4



[PATCH v6 02/10] ima: on soft reboot, restore the measurement list

2016-10-20 Thread Thiago Jung Bauermann
From: Mimi Zohar 

The TPM PCRs are only reset on a hard reboot.  In order to validate a
TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list
of the running kernel must be saved and restored on boot.  This patch
restores the measurement list.

Changelog v5:
- replace CONFIG_KEXEC_FILE with architecture CONFIG_HAVE_IMA_KEXEC (Thiago)
- replace kexec_get_handover_buffer() with ima_get_kexec_buffer() (Thiago)
- replace kexec_free_handover_buffer() with ima_free_kexec_buffer() (Thiago)
- remove unnecessary includes from ima_kexec.c (Thiago)
- fix off-by-one error when checking hdr_v1->template_name_len (Colin King)

Changelog v2:
- redefined ima_kexec_hdr to use types with well defined sizes (M. Ellerman)
- defined missing ima_load_kexec_buffer() stub function

Changelog v1:
- call ima_load_kexec_buffer() (Thiago)

Signed-off-by: Mimi Zohar 
---
 security/integrity/ima/Makefile   |   1 +
 security/integrity/ima/ima.h  |  21 +
 security/integrity/ima/ima_init.c |   2 +
 security/integrity/ima/ima_kexec.c|  44 +
 security/integrity/ima/ima_queue.c|  10 ++
 security/integrity/ima/ima_template.c | 170 ++
 6 files changed, 248 insertions(+)

diff --git a/security/integrity/ima/Makefile b/security/integrity/ima/Makefile
index 9aeaedad1e2b..29f198bde02b 100644
--- a/security/integrity/ima/Makefile
+++ b/security/integrity/ima/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_IMA) += ima.o
 ima-y := ima_fs.o ima_queue.o ima_init.o ima_main.o ima_crypto.o ima_api.o \
 ima_policy.o ima_template.o ima_template_lib.o
 ima-$(CONFIG_IMA_APPRAISE) += ima_appraise.o
+ima-$(CONFIG_HAVE_IMA_KEXEC) += ima_kexec.o
 obj-$(CONFIG_IMA_BLACKLIST_KEYRING) += ima_mok.o
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index db25f54a04fe..51dc8d57d64d 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -28,6 +28,10 @@
 
 #include "../integrity.h"
 
+#ifdef CONFIG_HAVE_IMA_KEXEC
+#include 
+#endif
+
 enum ima_show_type { IMA_SHOW_BINARY, IMA_SHOW_BINARY_NO_FIELD_LEN,
 IMA_SHOW_BINARY_OLD_STRING_FMT, IMA_SHOW_ASCII };
 enum tpm_pcrs { TPM_PCR0 = 0, TPM_PCR8 = 8 };
@@ -102,6 +106,21 @@ struct ima_queue_entry {
 };
 extern struct list_head ima_measurements;  /* list of all measurements */
 
+/* Some details preceding the binary serialized measurement list */
+struct ima_kexec_hdr {
+   u16 version;
+   u16 _reserved0;
+   u32 _reserved1;
+   u64 buffer_size;
+   u64 count;
+};
+
+#ifdef CONFIG_HAVE_IMA_KEXEC
+void ima_load_kexec_buffer(void);
+#else
+static inline void ima_load_kexec_buffer(void) {}
+#endif /* CONFIG_HAVE_IMA_KEXEC */
+
 /* Internal IMA function definitions */
 int ima_init(void);
 int ima_fs_init(void);
@@ -122,6 +141,8 @@ int ima_init_crypto(void);
 void ima_putc(struct seq_file *m, void *data, int datalen);
 void ima_print_digest(struct seq_file *m, u8 *digest, u32 size);
 struct ima_template_desc *ima_template_desc_current(void);
+int ima_restore_measurement_entry(struct ima_template_entry *entry);
+int ima_restore_measurement_list(loff_t bufsize, void *buf);
 int ima_init_template(void);
 
 /*
diff --git a/security/integrity/ima/ima_init.c 
b/security/integrity/ima/ima_init.c
index 32912bd54ead..3ba0ca49cba6 100644
--- a/security/integrity/ima/ima_init.c
+++ b/security/integrity/ima/ima_init.c
@@ -128,6 +128,8 @@ int __init ima_init(void)
if (rc != 0)
return rc;
 
+   ima_load_kexec_buffer();
+
rc = ima_add_boot_aggregate();  /* boot aggregate must be first entry */
if (rc != 0)
return rc;
diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c
new file mode 100644
index ..36afd0fe9747
--- /dev/null
+++ b/security/integrity/ima/ima_kexec.c
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) 2016 IBM Corporation
+ *
+ * Authors:
+ * Thiago Jung Bauermann 
+ * Mimi Zohar 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#include "ima.h"
+
+/*
+ * Restore the measurement list from the previous kernel.
+ */
+void ima_load_kexec_buffer(void)
+{
+   void *kexec_buffer = NULL;
+   size_t kexec_buffer_size = 0;
+   int rc;
+
+   rc = ima_get_kexec_buffer(_buffer, _buffer_size);
+   switch (rc) {
+   case 0:
+   rc = ima_restore_measurement_list(kexec_buffer_size,
+ kexec_buffer);
+   if (rc != 0)
+   pr_err("Failed to restore the measurement list: %d\n",
+   rc);
+
+   

[PATCH v9 10/10] powerpc: Enable CONFIG_KEXEC_FILE in powerpc server defconfigs.

2016-10-20 Thread Thiago Jung Bauermann
Enable CONFIG_KEXEC_FILE in powernv_defconfig, ppc64_defconfig and
pseries_defconfig.

It depends on CONFIG_CRYPTO_SHA256=y, so add that as well.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/configs/powernv_defconfig | 2 ++
 arch/powerpc/configs/ppc64_defconfig   | 2 ++
 arch/powerpc/configs/pseries_defconfig | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index d98b6eb3254f..5a190aa5534b 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -49,6 +49,7 @@ CONFIG_BINFMT_MISC=m
 CONFIG_PPC_TRANSACTIONAL_MEM=y
 CONFIG_HOTPLUG_CPU=y
 CONFIG_KEXEC=y
+CONFIG_KEXEC_FILE=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_NUMA=y
 CONFIG_MEMORY_HOTPLUG=y
@@ -301,6 +302,7 @@ CONFIG_CRYPTO_CCM=m
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_HMAC=y
 CONFIG_CRYPTO_MICHAEL_MIC=m
+CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_TGR192=m
 CONFIG_CRYPTO_WP512=m
 CONFIG_CRYPTO_ANUBIS=m
diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index 58a98d40086f..0059d2088b9c 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -46,6 +46,7 @@ CONFIG_HZ_100=y
 CONFIG_BINFMT_MISC=m
 CONFIG_PPC_TRANSACTIONAL_MEM=y
 CONFIG_KEXEC=y
+CONFIG_KEXEC_FILE=y
 CONFIG_CRASH_DUMP=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_MEMORY_HOTREMOVE=y
@@ -336,6 +337,7 @@ CONFIG_CRYPTO_TEST=m
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_HMAC=y
 CONFIG_CRYPTO_MICHAEL_MIC=m
+CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_TGR192=m
 CONFIG_CRYPTO_WP512=m
 CONFIG_CRYPTO_ANUBIS=m
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index 8a3bc016b732..f022f657a984 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -52,6 +52,7 @@ CONFIG_HZ_100=y
 CONFIG_BINFMT_MISC=m
 CONFIG_PPC_TRANSACTIONAL_MEM=y
 CONFIG_KEXEC=y
+CONFIG_KEXEC_FILE=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_MEMORY_HOTPLUG=y
 CONFIG_MEMORY_HOTREMOVE=y
@@ -303,6 +304,7 @@ CONFIG_CRYPTO_TEST=m
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_HMAC=y
 CONFIG_CRYPTO_MICHAEL_MIC=m
+CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_TGR192=m
 CONFIG_CRYPTO_WP512=m
 CONFIG_CRYPTO_ANUBIS=m
-- 
2.7.4



[PATCH v9 09/10] powerpc: Add purgatory for kexec_file_load implementation.

2016-10-20 Thread Thiago Jung Bauermann
This purgatory implementation comes from kexec-tools, almost unchanged.

In order to use boot/string.S in ppc64 big endian mode, the functions
defined in it need to have dot symbols so that they can be called from C
code. Therefore, change the file to use a DOTSYM macro if one is defined,
so that the purgatory can add those dot symbols.

The changes made to the purgatory code relative to the version in
kexec-tools were:

The sha256_regions global variable was renamed to sha_regions to match
what kexec_file_load expects, and to use the sha256.c file from x86's
purgatory (this avoids adding yet another SHA-256 implementation).

The global variables in purgatory.c and purgatory-ppc64.c now use a
__section attribute to put them in the .data section instead of being
initialized to zero. It doesn't matter what their initial value is,
because they will be set by the kernel when preparing the kexec image.

Also, since we don't support loading a crashdump kernel via
kexec_file_load yet, the code related to that functionality has been
removed.

Finally, some checkpatch.pl warnings were fixed.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/Makefile|   1 +
 arch/powerpc/boot/string.S   |  67 +++--
 arch/powerpc/purgatory/.gitignore|   2 +
 arch/powerpc/purgatory/Makefile  |  33 +++
 arch/powerpc/purgatory/console-ppc64.c   |  37 +++
 arch/powerpc/purgatory/crtsavres.S   |   5 +
 arch/powerpc/purgatory/hvCall.S  |  27 +
 arch/powerpc/purgatory/hvCall.h  |   8 ++
 arch/powerpc/purgatory/kexec-sha256.h|  11 +++
 arch/powerpc/purgatory/ppc64_asm.h   |  20 
 arch/powerpc/purgatory/printf.c  | 164 +++
 arch/powerpc/purgatory/purgatory-ppc64.c |  36 +++
 arch/powerpc/purgatory/purgatory.c   |  62 
 arch/powerpc/purgatory/purgatory.h   |  14 +++
 arch/powerpc/purgatory/sha256.c  |   6 ++
 arch/powerpc/purgatory/sha256.h  |   1 +
 arch/powerpc/purgatory/string.S  |   2 +
 arch/powerpc/purgatory/v2wrap.S  | 134 +
 18 files changed, 601 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 617dece67924..5e7dcdaf93f5 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -249,6 +249,7 @@ core-y  += arch/powerpc/kernel/ 
\
 core-$(CONFIG_XMON)+= arch/powerpc/xmon/
 core-$(CONFIG_KVM) += arch/powerpc/kvm/
 core-$(CONFIG_PERF_EVENTS) += arch/powerpc/perf/
+core-$(CONFIG_KEXEC_FILE)  += arch/powerpc/purgatory/
 
 drivers-$(CONFIG_OPROFILE) += arch/powerpc/oprofile/
 
diff --git a/arch/powerpc/boot/string.S b/arch/powerpc/boot/string.S
index acc9428f2789..b54bbad5f83d 100644
--- a/arch/powerpc/boot/string.S
+++ b/arch/powerpc/boot/string.S
@@ -11,9 +11,18 @@
 
 #include "ppc_asm.h"
 
+/*
+ * The ppc64 kexec purgatory uses this file and packages it in ELF64,
+ * so it needs dot symbols for the ppc64 big endian ABI. This macro
+ * allows it to create those symbols.
+ */
+#ifndef DOTSYM
+#define DOTSYM(a)  a
+#endif
+
.text
-   .globl  strcpy
-strcpy:
+   .globl  DOTSYM(strcpy)
+DOTSYM(strcpy):
addir5,r3,-1
addir4,r4,-1
 1: lbzur0,1(r4)
@@ -22,8 +31,8 @@ strcpy:
bne 1b
blr
 
-   .globl  strncpy
-strncpy:
+   .globl  DOTSYM(strncpy)
+DOTSYM(strncpy):
cmpwi   0,r5,0
beqlr
mtctr   r5
@@ -35,8 +44,8 @@ strncpy:
bdnzf   2,1b/* dec ctr, branch if ctr != 0 && !cr0.eq */
blr
 
-   .globl  strcat
-strcat:
+   .globl  DOTSYM(strcat)
+DOTSYM(strcat):
addir5,r3,-1
addir4,r4,-1
 1: lbzur0,1(r5)
@@ -49,8 +58,8 @@ strcat:
bne 1b
blr
 
-   .globl  strchr
-strchr:
+   .globl  DOTSYM(strchr)
+DOTSYM(strchr):
addir3,r3,-1
 1: lbzur0,1(r3)
cmpw0,r0,r4
@@ -60,8 +69,8 @@ strchr:
li  r3,0
blr
 
-   .globl  strcmp
-strcmp:
+   .globl  DOTSYM(strcmp)
+DOTSYM(strcmp):
addir5,r3,-1
addir4,r4,-1
 1: lbzur3,1(r5)
@@ -72,8 +81,8 @@ strcmp:
beq 1b
blr
 
-   .globl  strncmp
-strncmp:
+   .globl  DOTSYM(strncmp)
+DOTSYM(strncmp):
mtctr   r5
addir5,r3,-1
addir4,r4,-1
@@ -85,8 +94,8 @@ strncmp:
bdnzt   eq,1b
blr
 
-   .globl  strlen
-strlen:
+   .globl  DOTSYM(strlen)
+DOTSYM(strlen):
addir4,r3,-1
 1: lbzur0,1(r4)
cmpwi   0,r0,0
@@ -94,8 +103,8 @@ strlen:
subfr3,r3,r4
blr
 
-   .globl  memset
-memset:
+   .globl  DOTSYM(memset)
+DOTSYM(memset):
rlwimi  r4,r4,8,16,23
rlwimi  r4,r4,16,0,15
addir6,r3,-4
@@ -120,14 +129,14 @@ memset:
bdnz8b
  

[PATCH v9 08/10] powerpc: Add support for loading ELF kernels with kexec_file_load.

2016-10-20 Thread Thiago Jung Bauermann
This uses all the infrastructure built up by the previous patches
in the series to load an ELF vmlinux file and an initrd. It uses the
flattened device tree at initial_boot_params as a base and adjusts memory
reservations and its /chosen node for the next kernel.

[a...@linux-foundation.org: coding-style fixes]
Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Andrew Morton 
---
 arch/powerpc/include/asm/kexec.h|  12 +
 arch/powerpc/kernel/Makefile|   3 +-
 arch/powerpc/kernel/kexec_elf_64.c  | 280 +++
 arch/powerpc/kernel/machine_kexec_file_64.c | 338 +++-
 4 files changed, 630 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index eca2f975bf44..4497db7555b0 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -91,6 +91,18 @@ static inline bool kdump_in_progress(void)
return crashing_cpu >= 0;
 }
 
+#ifdef CONFIG_KEXEC_FILE
+extern struct kexec_file_ops kexec_elf64_ops;
+
+int setup_purgatory(struct kimage *image, const void *slave_code,
+   const void *fdt, unsigned long kernel_load_addr,
+   unsigned long fdt_load_addr, unsigned long stack_top,
+   int debug);
+int setup_new_fdt(void *fdt, unsigned long initrd_load_addr,
+ unsigned long initrd_len, const char *cmdline);
+bool find_debug_console(const void *fdt);
+#endif /* CONFIG_KEXEC_FILE */
+
 #else /* !CONFIG_KEXEC_CORE */
 static inline void crash_kexec_secondary(struct pt_regs *regs) { }
 
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index de14b7eb11bb..424b13b1b2b0 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -109,7 +109,8 @@ obj-$(CONFIG_PCI)   += pci_$(BITS).o $(pci64-y) \
 obj-$(CONFIG_PCI_MSI)  += msi.o
 obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o crash.o \
   machine_kexec_$(BITS).o
-obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o elf_util.o
+obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o elf_util.o \
+  kexec_elf_$(BITS).o
 obj-$(CONFIG_AUDIT)+= audit.o
 obj64-$(CONFIG_AUDIT)  += compat_audit.o
 
diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec_elf_64.c
new file mode 100644
index ..dc29e0131b76
--- /dev/null
+++ b/arch/powerpc/kernel/kexec_elf_64.c
@@ -0,0 +1,280 @@
+/*
+ * Load ELF vmlinux file for the kexec_file_load syscall.
+ *
+ * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
+ * Copyright (C) 2004  IBM Corp.
+ * Copyright (C) 2005  R Sharada (shar...@in.ibm.com)
+ * Copyright (C) 2006  Mohan Kumar M (mo...@in.ibm.com)
+ * Copyright (C) 2016  IBM Corporation
+ *
+ * Based on kexec-tools' kexec-elf-exec.c and kexec-elf-ppc64.c.
+ * Heavily modified for the kernel by
+ * Thiago Jung Bauermann .
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt)"kexec_elf: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PURGATORY_STACK_SIZE   (16 * 1024)
+
+/**
+ * build_elf_exec_info - read ELF executable and check that we can use it
+ */
+static int build_elf_exec_info(const char *buf, size_t len, struct elfhdr 
*ehdr,
+  struct elf_info *elf_info)
+{
+   int i;
+   int ret;
+
+   ret = elf_read_from_buffer(buf, len, ehdr, elf_info);
+   if (ret)
+   return ret;
+
+   /* Big endian vmlinux has type ET_DYN. */
+   if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN) {
+   pr_err("Not an ELF executable.\n");
+   goto error;
+   } else if (!elf_info->proghdrs) {
+   pr_err("No ELF program header.\n");
+   goto error;
+   }
+
+   for (i = 0; i < ehdr->e_phnum; i++) {
+   /*
+* Kexec does not support loading interpreters.
+* In addition this check keeps us from attempting
+* to kexec ordinay executables.
+*/
+   if (elf_info->proghdrs[i].p_type == PT_INTERP) {
+   pr_err("Requires an ELF interpreter.\n");
+   goto error;
+   }
+   }
+
+   return 0;
+error:
+   elf_free_info(elf_info);
+   return -ENOEXEC;
+}
+
+static int 

[PATCH v9 07/10] powerpc: Add functions to read ELF files of any endianness.

2016-10-20 Thread Thiago Jung Bauermann
A little endian kernel might need to kexec a big endian kernel (the
opposite is less likely but could happen as well), so we can't just cast
the buffer with the binary to ELF structs and use them as is done
elsewhere.

This patch adds functions which do byte-swapping as necessary when
populating the ELF structs. These functions will be used in the next
patch in the series.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/elf_util.h |  21 ++
 arch/powerpc/kernel/elf_util.c  | 418 
 2 files changed, 439 insertions(+)

diff --git a/arch/powerpc/include/asm/elf_util.h 
b/arch/powerpc/include/asm/elf_util.h
index 1df232f65ec8..3dbad8cc7179 100644
--- a/arch/powerpc/include/asm/elf_util.h
+++ b/arch/powerpc/include/asm/elf_util.h
@@ -19,6 +19,18 @@
 
 #include 
 
+struct elf_info {
+   /*
+* Where the ELF binary contents are kept.
+* Memory managed by the user of the struct.
+*/
+   const char *buffer;
+
+   const struct elfhdr *ehdr;
+   const struct elf_phdr *proghdrs;
+   struct elf_shdr *sechdrs;
+};
+
 /*
  * r2 is the TOC pointer: it actually points 0x8000 into the TOC (this
  * gives the value maximum span in an instruction which uses a signed
@@ -40,4 +52,13 @@ int elf64_apply_relocate_add_item(const Elf64_Shdr *sechdrs, 
const char *strtab,
  unsigned long my_r2, const char *obj_name,
  struct module *me);
 
+static inline bool elf_is_elf_file(const struct elfhdr *ehdr)
+{
+   return memcmp(ehdr->e_ident, ELFMAG, SELFMAG) == 0;
+}
+
+int elf_read_from_buffer(const char *buf, size_t len, struct elfhdr *ehdr,
+struct elf_info *elf_info);
+void elf_free_info(struct elf_info *elf_info);
+
 #endif /* _ASM_POWERPC_ELF_UTIL_H */
diff --git a/arch/powerpc/kernel/elf_util.c b/arch/powerpc/kernel/elf_util.c
index ffa68cd6fb99..e57e7397f65c 100644
--- a/arch/powerpc/kernel/elf_util.c
+++ b/arch/powerpc/kernel/elf_util.c
@@ -16,7 +16,24 @@
  * GNU General Public License for more details.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
 #include 
+#include 
+
+#if ELF_CLASS == ELFCLASS32
+#define elf_addr_to_cpuelf32_to_cpu
+
+#ifndef Elf_Rel
+#define Elf_RelElf32_Rel
+#endif /* Elf_Rel */
+#else /* ELF_CLASS == ELFCLASS32 */
+#define elf_addr_to_cpuelf64_to_cpu
+
+#ifndef Elf_Rel
+#define Elf_RelElf64_Rel
+#endif /* Elf_Rel */
 
 /**
  * elf_toc_section - find the toc section in the file with the given ELF 
headers
@@ -44,3 +61,404 @@ unsigned int elf_toc_section(const struct elfhdr *ehdr,
 
return 0;
 }
+
+static uint64_t elf64_to_cpu(const struct elfhdr *ehdr, uint64_t value)
+{
+   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
+   value = le64_to_cpu(value);
+   else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
+   value = be64_to_cpu(value);
+
+   return value;
+}
+#endif /* ELF_CLASS == ELFCLASS32 */
+
+static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value)
+{
+   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
+   value = le16_to_cpu(value);
+   else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
+   value = be16_to_cpu(value);
+
+   return value;
+}
+
+static uint32_t elf32_to_cpu(const struct elfhdr *ehdr, uint32_t value)
+{
+   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
+   value = le32_to_cpu(value);
+   else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
+   value = be32_to_cpu(value);
+
+   return value;
+}
+
+/**
+ * elf_is_ehdr_sane - check that it is safe to use the ELF header
+ * @buf_len:   size of the buffer in which the ELF file is loaded.
+ */
+static bool elf_is_ehdr_sane(const struct elfhdr *ehdr, size_t buf_len)
+{
+   if (ehdr->e_phnum > 0 && ehdr->e_phentsize != sizeof(struct elf_phdr)) {
+   pr_debug("Bad program header size.\n");
+   return false;
+   } else if (ehdr->e_shnum > 0 &&
+  ehdr->e_shentsize != sizeof(struct elf_shdr)) {
+   pr_debug("Bad section header size.\n");
+   return false;
+   } else if (ehdr->e_ident[EI_VERSION] != EV_CURRENT ||
+  ehdr->e_version != EV_CURRENT) {
+   pr_debug("Unknown ELF version.\n");
+   return false;
+   }
+
+   if (ehdr->e_phoff > 0 && ehdr->e_phnum > 0) {
+   size_t phdr_size;
+
+   /*
+* e_phnum is at most 65535 so calculating the size of the
+* program header cannot overflow.
+*/
+   phdr_size = sizeof(struct elf_phdr) * ehdr->e_phnum;
+
+   /* Sanity check the program header table location. */
+   if (ehdr->e_phoff + phdr_size < ehdr->e_phoff) {
+   pr_debug("Program headers at invalid 

[PATCH v9 06/10] powerpc: Implement kexec_file_load.

2016-10-20 Thread Thiago Jung Bauermann
Add arch-specific functions needed by generic kexec_file code.

Also, module_64.c's apply_relocate_add and kexec_file's
arch_kexec_apply_relocations_add have slightly different needs, so
elf64_apply_relocate_add_item needs to be adapted to accommodate both:

When apply_relocate_add is called, the module is already loaded at its
final location in memory so the place where the relocation needs to be
applied and its address in the module's memory are the same.

This is not the case for kexec's purgatory, because it is stored in a
buffer and will only be copied to its final location in memory right
before being executed. Therefore, it needs to be relocated while still
in its buffer. In this case, the place where the relocation needs to
be applied is different from its address in the purgatory's memory.

So we add an address argument to elf64_apply_relocate_add_item
to specify the final address of the relocation in memory. We also add
more relocation types that are used by the purgatory.

Signed-off-by: Josh Sklar 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/Kconfig|  13 ++
 arch/powerpc/include/asm/elf_util.h |  43 +
 arch/powerpc/include/asm/systbl.h   |   1 +
 arch/powerpc/include/asm/unistd.h   |   2 +-
 arch/powerpc/include/uapi/asm/unistd.h  |   1 +
 arch/powerpc/kernel/Makefile|   1 +
 arch/powerpc/kernel/elf_util.c  |  46 ++
 arch/powerpc/kernel/machine_kexec_file_64.c | 245 
 arch/powerpc/kernel/module_64.c |  71 ++--
 9 files changed, 406 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6cb59c6e5ba4..897d0f14447d 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -455,6 +455,19 @@ config KEXEC
  interface is strongly in flux, so no good recommendation can be
  made.
 
+config KEXEC_FILE
+   bool "kexec file based system call"
+   select KEXEC_CORE
+   select BUILD_BIN2C
+   depends on PPC64
+   depends on CRYPTO=y
+   depends on CRYPTO_SHA256=y
+   help
+ This is a new version of the kexec system call. This call is
+ file based and takes in file descriptors as system call arguments
+ for kernel and initramfs as opposed to a list of segments as is the
+ case for the older kexec call.
+
 config RELOCATABLE
bool "Build a relocatable kernel"
depends on (PPC64 && !COMPILE_TEST) || (FLATMEM && (44x || FSL_BOOKE))
diff --git a/arch/powerpc/include/asm/elf_util.h 
b/arch/powerpc/include/asm/elf_util.h
new file mode 100644
index ..1df232f65ec8
--- /dev/null
+++ b/arch/powerpc/include/asm/elf_util.h
@@ -0,0 +1,43 @@
+/*
+ * Utility functions to work with ELF files.
+ *
+ * Copyright (C) 2016, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _ASM_POWERPC_ELF_UTIL_H
+#define _ASM_POWERPC_ELF_UTIL_H
+
+#include 
+
+/*
+ * r2 is the TOC pointer: it actually points 0x8000 into the TOC (this
+ * gives the value maximum span in an instruction which uses a signed
+ * offset)
+ */
+static inline unsigned long elf_my_r2(const struct elf_shdr *sechdrs,
+ unsigned int toc_section)
+{
+   return sechdrs[toc_section].sh_addr + 0x8000;
+}
+
+unsigned int elf_toc_section(const struct elfhdr *ehdr,
+const struct elf_shdr *sechdrs);
+
+int elf64_apply_relocate_add_item(const Elf64_Shdr *sechdrs, const char 
*strtab,
+ const Elf64_Rela *rela, const Elf64_Sym *sym,
+ unsigned long *location,
+ unsigned long address, unsigned long value,
+ unsigned long my_r2, const char *obj_name,
+ struct module *me);
+
+#endif /* _ASM_POWERPC_ELF_UTIL_H */
diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 2fc5d4db503c..4b369d83fe9c 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -386,3 +386,4 @@ SYSCALL(mlock2)
 SYSCALL(copy_file_range)
 COMPAT_SYS_SPU(preadv2)
 COMPAT_SYS_SPU(pwritev2)
+SYSCALL(kexec_file_load)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index cf12c580f6b2..a01e97d3f305 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h

[PATCH v9 05/10] powerpc: Factor out relocation code in module_64.c

2016-10-20 Thread Thiago Jung Bauermann
The kexec_file_load system call needs to relocate the purgatory, so
factor out the module relocation code so that it can be shared.

This patch's purpose is to move the ELF relocation logic from
apply_relocate_add to the new function elf64_apply_relocate_add_item
with as few changes as possible. The following changes were needed:

elf64_apply_relocate_add_item takes a my_r2 argument because the kexec
code can't use the my_r2 function since it doesn't have a struct module
to pass to it. For the same reason, it also takes an obj_name argument to
use in error messages. It still takes a pointer to struct module argument,
but kexec code can just pass NULL because except for the TOC symbol, the
purgatory doesn't have undefined symbols so the module pointer isn't used.

Apart from what is described in the paragraph above, the code has no
functional changes.

Suggested-by: Michael Ellerman 
Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/kernel/module_64.c | 344 +---
 1 file changed, 182 insertions(+), 162 deletions(-)

diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
index 183368e008cf..61baad036639 100644
--- a/arch/powerpc/kernel/module_64.c
+++ b/arch/powerpc/kernel/module_64.c
@@ -507,6 +507,181 @@ static int restore_r2(u32 *instruction, struct module *me)
return 1;
 }
 
+static int elf64_apply_relocate_add_item(const Elf64_Shdr *sechdrs,
+const char *strtab,
+const Elf64_Rela *rela,
+const Elf64_Sym *sym,
+unsigned long *location,
+unsigned long value,
+unsigned long my_r2,
+const char *obj_name,
+struct module *me)
+{
+   switch (ELF64_R_TYPE(rela->r_info)) {
+   case R_PPC64_ADDR32:
+   /* Simply set it */
+   *(u32 *)location = value;
+   break;
+
+   case R_PPC64_ADDR64:
+   /* Simply set it */
+   *(unsigned long *)location = value;
+   break;
+
+   case R_PPC64_TOC:
+   *(unsigned long *)location = my_r2;
+   break;
+
+   case R_PPC64_TOC16:
+   /* Subtract TOC pointer */
+   value -= my_r2;
+   if (value + 0x8000 > 0x) {
+   pr_err("%s: bad TOC16 relocation (0x%lx)\n",
+  obj_name, value);
+   return -ENOEXEC;
+   }
+   *((uint16_t *) location)
+   = (*((uint16_t *) location) & ~0x)
+   | (value & 0x);
+   break;
+
+   case R_PPC64_TOC16_LO:
+   /* Subtract TOC pointer */
+   value -= my_r2;
+   *((uint16_t *) location)
+   = (*((uint16_t *) location) & ~0x)
+   | (value & 0x);
+   break;
+
+   case R_PPC64_TOC16_DS:
+   /* Subtract TOC pointer */
+   value -= my_r2;
+   if ((value & 3) != 0 || value + 0x8000 > 0x) {
+   pr_err("%s: bad TOC16_DS relocation (0x%lx)\n",
+  obj_name, value);
+   return -ENOEXEC;
+   }
+   *((uint16_t *) location)
+   = (*((uint16_t *) location) & ~0xfffc)
+   | (value & 0xfffc);
+   break;
+
+   case R_PPC64_TOC16_LO_DS:
+   /* Subtract TOC pointer */
+   value -= my_r2;
+   if ((value & 3) != 0) {
+   pr_err("%s: bad TOC16_LO_DS relocation (0x%lx)\n",
+  obj_name, value);
+   return -ENOEXEC;
+   }
+   *((uint16_t *) location)
+   = (*((uint16_t *) location) & ~0xfffc)
+   | (value & 0xfffc);
+   break;
+
+   case R_PPC64_TOC16_HA:
+   /* Subtract TOC pointer */
+   value -= my_r2;
+   value = ((value + 0x8000) >> 16);
+   *((uint16_t *) location)
+   = (*((uint16_t *) location) & ~0x)
+   | (value & 0x);
+   break;
+
+   case R_PPC_REL24:
+   /* FIXME: Handle weak symbols here --RR */
+   if (sym->st_shndx == SHN_UNDEF) {
+   /* External: go via stub */
+   value = stub_for_addr(sechdrs, value, me);
+   if (!value)
+   return -ENOENT;
+   if (!restore_r2((u32 *)location + 1, me))
+   

[PATCH v9 04/10] powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.

2016-10-20 Thread Thiago Jung Bauermann
Commit 2965faa5e03d ("kexec: split kexec_load syscall from kexec core
code") introduced CONFIG_KEXEC_CORE so that CONFIG_KEXEC means whether
the kexec_load system call should be compiled-in and CONFIG_KEXEC_FILE
means whether the kexec_file_load system call should be compiled-in.
These options can be set independently from each other.

Since until now powerpc only supported kexec_load, CONFIG_KEXEC and
CONFIG_KEXEC_CORE were synonyms. That is not the case anymore, so we
need to make a distinction. Almost all places where CONFIG_KEXEC was
being used should be using CONFIG_KEXEC_CORE instead, since
kexec_file_load also needs that code compiled in.

Signed-off-by: Thiago Jung Bauermann 
---
 arch/powerpc/Kconfig  | 2 +-
 arch/powerpc/include/asm/debug.h  | 2 +-
 arch/powerpc/include/asm/kexec.h  | 6 +++---
 arch/powerpc/include/asm/machdep.h| 4 ++--
 arch/powerpc/include/asm/smp.h| 2 +-
 arch/powerpc/kernel/Makefile  | 4 ++--
 arch/powerpc/kernel/head_64.S | 2 +-
 arch/powerpc/kernel/misc_32.S | 2 +-
 arch/powerpc/kernel/misc_64.S | 6 +++---
 arch/powerpc/kernel/prom.c| 2 +-
 arch/powerpc/kernel/setup_64.c| 4 ++--
 arch/powerpc/kernel/smp.c | 6 +++---
 arch/powerpc/kernel/traps.c   | 2 +-
 arch/powerpc/platforms/85xx/corenet_generic.c | 2 +-
 arch/powerpc/platforms/85xx/smp.c | 8 
 arch/powerpc/platforms/cell/spu_base.c| 2 +-
 arch/powerpc/platforms/powernv/setup.c| 6 +++---
 arch/powerpc/platforms/ps3/setup.c| 4 ++--
 arch/powerpc/platforms/pseries/Makefile   | 2 +-
 arch/powerpc/platforms/pseries/setup.c| 4 ++--
 20 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65fba4c34cd7..6cb59c6e5ba4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -489,7 +489,7 @@ config CRASH_DUMP
 
 config FA_DUMP
bool "Firmware-assisted dump"
-   depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC
+   depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC_CORE
help
  A robust mechanism to get reliable kernel crash dump with
  assistance from firmware. This approach does not use kexec,
diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index a954e4975049..86308f177f2d 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -10,7 +10,7 @@ struct pt_regs;
 
 extern struct dentry *powerpc_debugfs_root;
 
-#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
+#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC_CORE)
 
 extern int (*__debugger)(struct pt_regs *regs);
 extern int (*__debugger_ipi)(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index a46f5f45570c..eca2f975bf44 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -53,7 +53,7 @@
 
 typedef void (*crash_shutdown_t)(void);
 
-#ifdef CONFIG_KEXEC
+#ifdef CONFIG_KEXEC_CORE
 
 /*
  * This function is responsible for capturing register states if coming
@@ -91,7 +91,7 @@ static inline bool kdump_in_progress(void)
return crashing_cpu >= 0;
 }
 
-#else /* !CONFIG_KEXEC */
+#else /* !CONFIG_KEXEC_CORE */
 static inline void crash_kexec_secondary(struct pt_regs *regs) { }
 
 static inline int overlaps_crashkernel(unsigned long start, unsigned long size)
@@ -116,7 +116,7 @@ static inline bool kdump_in_progress(void)
return false;
 }
 
-#endif /* CONFIG_KEXEC */
+#endif /* CONFIG_KEXEC_CORE */
 #endif /* ! __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_KEXEC_H */
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index e02cbc6a6c70..5011b69107a7 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -183,7 +183,7 @@ struct machdep_calls {
 */
void (*machine_shutdown)(void);
 
-#ifdef CONFIG_KEXEC
+#ifdef CONFIG_KEXEC_CORE
void (*kexec_cpu_down)(int crash_shutdown, int secondary);
 
/* Called to do what every setup is needed on image and the
@@ -198,7 +198,7 @@ struct machdep_calls {
 * no return.
 */
void (*machine_kexec)(struct kimage *image);
-#endif /* CONFIG_KEXEC */
+#endif /* CONFIG_KEXEC_CORE */
 
 #ifdef CONFIG_SUSPEND
/* These are called to disable and enable, respectively, IRQs when
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 0d02c11dc331..32db16d2e7ad 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -176,7 +176,7 @@ static inline void set_hard_smp_processor_id(int cpu, int 
phys)
 #endif /* !CONFIG_SMP */
 #endif /* !CONFIG_PPC64 */
 
-#if defined(CONFIG_PPC64) && 

[PATCH v9 03/10] kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.

2016-10-20 Thread Thiago Jung Bauermann
kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
implementation to find free memory for the purgatory stack.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: Dave Young 
---
 include/linux/kexec.h |  1 +
 kernel/kexec_file.c   | 25 -
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 437ef1b47428..a33f63351f86 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -176,6 +176,7 @@ struct kexec_buf {
 int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
   int (*func)(u64, u64, void *));
 extern int kexec_add_buffer(struct kexec_buf *kbuf);
+int kexec_locate_mem_hole(struct kexec_buf *kbuf);
 #endif /* CONFIG_KEXEC_FILE */
 
 struct kimage {
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index efd2c094af7e..0c2df7f73792 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -450,6 +450,23 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
 }
 
 /**
+ * kexec_locate_mem_hole - find free memory for the purgatory or the next 
kernel
+ * @kbuf:  Parameters for the memory search.
+ *
+ * On success, kbuf->mem will have the start address of the memory region 
found.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int kexec_locate_mem_hole(struct kexec_buf *kbuf)
+{
+   int ret;
+
+   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
+
+   return ret == 1 ? 0 : -EADDRNOTAVAIL;
+}
+
+/**
  * kexec_add_buffer - place a buffer in a kexec segment
  * @kbuf:  Buffer contents and memory parameters.
  *
@@ -489,11 +506,9 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE);
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
-   if (ret != 1) {
-   /* A suitable memory range could not be found for buffer */
-   return -EADDRNOTAVAIL;
-   }
+   ret = kexec_locate_mem_hole(kbuf);
+   if (ret)
+   return ret;
 
/* Found a suitable memory range */
ksegment = >image->segment[kbuf->image->nr_segments];
-- 
2.7.4



[PATCH v9 02/10] kexec_file: Change kexec_add_buffer to take kexec_buf as argument.

2016-10-20 Thread Thiago Jung Bauermann
This is done to simplify the kexec_add_buffer argument list.
Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer.

In addition, change the type of kexec_buf.buffer from char * to void *.
There is no particular reason for it to be a char *, and the change
allows us to get rid of 3 existing casts to char * in the code.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: Dave Young 
Acked-by: Balbir Singh 
---
 arch/x86/kernel/crash.c   | 37 
 arch/x86/kernel/kexec-bzimage64.c | 48 +++--
 include/linux/kexec.h |  8 +---
 kernel/kexec_file.c   | 88 ++-
 4 files changed, 87 insertions(+), 94 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 650830e39e3a..3741461c63a0 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -631,9 +631,9 @@ static int determine_backup_region(u64 start, u64 end, void 
*arg)
 
 int crash_load_segments(struct kimage *image)
 {
-   unsigned long src_start, src_sz, elf_sz;
-   void *elf_addr;
int ret;
+   struct kexec_buf kbuf = { .image = image, .buf_min = 0,
+ .buf_max = ULONG_MAX, .top_down = false };
 
/*
 * Determine and load a segment for backup area. First 640K RAM
@@ -647,43 +647,44 @@ int crash_load_segments(struct kimage *image)
if (ret < 0)
return ret;
 
-   src_start = image->arch.backup_src_start;
-   src_sz = image->arch.backup_src_sz;
-
/* Add backup segment. */
-   if (src_sz) {
+   if (image->arch.backup_src_sz) {
+   kbuf.buffer = _zero_bytes;
+   kbuf.bufsz = sizeof(crash_zero_bytes);
+   kbuf.memsz = image->arch.backup_src_sz;
+   kbuf.buf_align = PAGE_SIZE;
/*
 * Ideally there is no source for backup segment. This is
 * copied in purgatory after crash. Just add a zero filled
 * segment for now to make sure checksum logic works fine.
 */
-   ret = kexec_add_buffer(image, (char *)_zero_bytes,
-  sizeof(crash_zero_bytes), src_sz,
-  PAGE_SIZE, 0, -1, 0,
-  >arch.backup_load_addr);
+   ret = kexec_add_buffer();
if (ret)
return ret;
+   image->arch.backup_load_addr = kbuf.mem;
pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx 
memsz=0x%lx\n",
-image->arch.backup_load_addr, src_start, src_sz);
+image->arch.backup_load_addr,
+image->arch.backup_src_start, kbuf.memsz);
}
 
/* Prepare elf headers and add a segment */
-   ret = prepare_elf_headers(image, _addr, _sz);
+   ret = prepare_elf_headers(image, , );
if (ret)
return ret;
 
-   image->arch.elf_headers = elf_addr;
-   image->arch.elf_headers_sz = elf_sz;
+   image->arch.elf_headers = kbuf.buffer;
+   image->arch.elf_headers_sz = kbuf.bufsz;
 
-   ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
-   ELF_CORE_HEADER_ALIGN, 0, -1, 0,
-   >arch.elf_load_addr);
+   kbuf.memsz = kbuf.bufsz;
+   kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
+   ret = kexec_add_buffer();
if (ret) {
vfree((void *)image->arch.elf_headers);
return ret;
}
+   image->arch.elf_load_addr = kbuf.mem;
pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-image->arch.elf_load_addr, elf_sz, elf_sz);
+image->arch.elf_load_addr, kbuf.bufsz, kbuf.bufsz);
 
return ret;
 }
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 3407b148c240..d0a814a9d96a 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -331,17 +331,17 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
 
struct setup_header *header;
int setup_sects, kern16_size, ret = 0;
-   unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
+   unsigned long setup_header_size, params_cmdline_sz;
struct boot_params *params;
unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
unsigned long purgatory_load_addr;
-   unsigned long kernel_bufsz, kernel_memsz, kernel_align;
-   char *kernel_buf;
struct bzimage64_data *ldata;
struct kexec_entry64_regs regs64;
void *stack;
unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
+   struct 

[PATCH v9 01/10] kexec_file: Allow arch-specific memory walking for kexec_add_buffer

2016-10-20 Thread Thiago Jung Bauermann
Allow architectures to specify a different memory walking function for
kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but
PowerPC uses the memblock subsystem.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: Dave Young 
Acked-by: Balbir Singh 
---
 include/linux/kexec.h   | 29 -
 kernel/kexec_file.c | 30 ++
 kernel/kexec_internal.h | 16 
 3 files changed, 50 insertions(+), 25 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 406c33dcae13..5e320ddaaa82 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -148,7 +148,34 @@ struct kexec_file_ops {
kexec_verify_sig_t *verify_sig;
 #endif
 };
-#endif
+
+/**
+ * struct kexec_buf - parameters for finding a place for a buffer in memory
+ * @image: kexec image in which memory to search.
+ * @buffer:Contents which will be copied to the allocated memory.
+ * @bufsz: Size of @buffer.
+ * @mem:   On return will have address of the buffer in memory.
+ * @memsz: Size for the buffer in memory.
+ * @buf_align: Minimum alignment needed.
+ * @buf_min:   The buffer can't be placed below this address.
+ * @buf_max:   The buffer can't be placed above this address.
+ * @top_down:  Allocate from top of memory.
+ */
+struct kexec_buf {
+   struct kimage *image;
+   char *buffer;
+   unsigned long bufsz;
+   unsigned long mem;
+   unsigned long memsz;
+   unsigned long buf_align;
+   unsigned long buf_min;
+   unsigned long buf_max;
+   bool top_down;
+};
+
+int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
+  int (*func)(u64, u64, void *));
+#endif /* CONFIG_KEXEC_FILE */
 
 struct kimage {
kimage_entry_t head;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 037c321c5618..f865674bff51 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -428,6 +428,27 @@ static int locate_mem_hole_callback(u64 start, u64 end, 
void *arg)
return locate_mem_hole_bottom_up(start, end, kbuf);
 }
 
+/**
+ * arch_kexec_walk_mem - call func(data) on free memory regions
+ * @kbuf:  Context info for the search. Also passed to @func.
+ * @func:  Function to call for each memory region.
+ *
+ * Return: The memory walk will stop when func returns a non-zero value
+ * and that value will be returned. If all free regions are visited without
+ * func returning non-zero, then zero will be returned.
+ */
+int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
+  int (*func)(u64, u64, void *))
+{
+   if (kbuf->image->type == KEXEC_TYPE_CRASH)
+   return walk_iomem_res_desc(crashk_res.desc,
+  IORESOURCE_SYSTEM_RAM | 
IORESOURCE_BUSY,
+  crashk_res.start, crashk_res.end,
+  kbuf, func);
+   else
+   return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
+}
+
 /*
  * Helper function for placing a buffer in a kexec segment. This assumes
  * that kexec_mutex is held.
@@ -474,14 +495,7 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
unsigned long bufsz,
kbuf->top_down = top_down;
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   if (image->type == KEXEC_TYPE_CRASH)
-   ret = walk_iomem_res_desc(crashk_res.desc,
-   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
-   crashk_res.start, crashk_res.end, kbuf,
-   locate_mem_hole_callback);
-   else
-   ret = walk_system_ram_res(0, -1, kbuf,
- locate_mem_hole_callback);
+   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
if (ret != 1) {
/* A suitable memory range could not be found for buffer */
return -EADDRNOTAVAIL;
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 0a52315d9c62..4cef7e4706b0 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -20,22 +20,6 @@ struct kexec_sha_region {
unsigned long len;
 };
 
-/*
- * Keeps track of buffer parameters as provided by caller for requesting
- * memory placement of buffer.
- */
-struct kexec_buf {
-   struct kimage *image;
-   char *buffer;
-   unsigned long bufsz;
-   unsigned long mem;
-   unsigned long memsz;
-   unsigned long buf_align;
-   unsigned long buf_min;
-   unsigned long buf_max;
-   bool top_down;  /* allocate from top of memory hole */
-};
-
 void kimage_file_post_load_cleanup(struct kimage *image);
 #else /* CONFIG_KEXEC_FILE */
 static inline void kimage_file_post_load_cleanup(struct kimage *image) { }
-- 
2.7.4



[PATCH v9 00/10] kexec_file_load implementation for PowerPC

2016-10-20 Thread Thiago Jung Bauermann
Hello,

This version has the following changes:

1. Rebased on v4.9-rc1. This should fix the conflicts in next and -mm
   that these patches were having with the s/CONFIG_WORD_SIZE/BITS/ change.

2. Changed patch "powerpc: Factor out relocation code in module_64.c" to
   make as little changes as possible to arch/powerpc/module_64.c and
   . This meant factoring out only the switch statement from
   apply_relocate_add into its own function, and keeping it in module_64.c.
   arch_kexec_apply_relocations_add and apply_relocate_add share a bit less
   code now. This addresses a concern expressed by Michael Ellerman that
   there were too many changes being made to the module loading code.

3. Reduced number of patches in the series by squashing one patch and
   redistributing the code in two other patches. They were smaller after
   the change above and didn't make that much sense on their own anymore.

4. Moved code implementing kexec_file_load to a new file instead of adding
   it to the existing arch/powerpc/machine_kexec_64.c.

5. Cleaned up the purgatory code a bit to fix checkpatch warnings: removed
   unused code related to crashdump support and added __section(".data")
   to global variables instead of initializing them to zero to avoid them
   going to .bss.

The changes in 3. and 4. made the patches somewhat different from v8 but
it's just code moving from one patch to another, or to a new file. Apart
from what is described in 2. and 5., there's very little actual code
changed. The patches are all bisectable, each one was tested with
several kernel configs.

The gory details are in the detailed changelog at the end of this email.

Original cover letter:

This patch series implements the kexec_file_load system call on PowerPC.

This system call moves the reading of the kernel, initrd and the device tree
from the userspace kexec tool to the kernel. This is needed if you want to
do one or both of the following:

1. only allow loading of signed kernels.
2. "measure" (i.e., record the hashes of) the kernel, initrd, kernel
   command line and other boot inputs for the Integrity Measurement
   Architecture subsystem.

The above are the functions kexec already has built into kexec_file_load.
Yesterday I posted a set of patches which allows a third feature:

3. have IMA pass-on its event log (where integrity measurements are
   registered) accross kexec to the second kernel, so that the event
   history is preserved.

Because OpenPower uses an intermediary Linux instance as a boot loader
(skiroot), feature 1 is needed to implement secure boot for the platform,
while features 2 and 3 are needed to implement trusted boot.

This patch series starts by removing an x86 assumption from kexec_file:
kexec_add_buffer uses iomem to find reserved memory ranges, but PowerPC
uses the memblock subsystem.  A hook is added so that each arch can
specify how memory ranges can be found.

Also, the memory-walking logic in kexec_add_buffer is useful in this
implementation to find a free area for the purgatory's stack, so the
next patch moves that logic to kexec_locate_mem_hole.

The kexec_file_load system call needs to apply relocations to the
purgatory but adding code for that would duplicate functionality with
the module loading mechanism, which also needs to apply relocations to
the kernel modules.  Therefore, this patch series factors out the module
relocation code so that it can be shared.

One thing that is still missing is crashkernel support, which I intend
to submit shortly. For now, arch_kexec_kernel_image_probe rejects crash
kernels.

This code is based on kexec-tools, but with many modifications to adapt
it to the kernel environment and facilities. Except the purgatory,
which only has minimal changes.

Changes for v9:
- Rebased on top of v4.9-rc1
- Patch "powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE 
instead."
  - Fixed conflict with patch renaming CONFIG_WORD_SIZE to BITS.
- Patch "powerpc: Factor out relocation code from module_64.c to elf_util_64.c."
  - Retitled to "powerpc: Factor out relocation code in module_64.c"
  - Fixed conflict with patch renaming CONFIG_WORD_SIZE to BITS.
  - Put relocation code in function elf64_apply_relocate_add_item in
module_64.c itself instead of moving it to elf_util_64.c.
  - Only factored out the switch statement to apply the relocations
instead of the whole logic to iterate through all of the relocations.
  - There's no elf_util_64.c anymore.
  - Doesn't change  anymore.
  - Doesn't change arch/powerpc/kernel/Makefile anymore.
  - Moved creation of  to patch "Implement kexec_file_load."
- Patch "powerpc: Generalize elf64_apply_relocate_add."
  - Dropped patch from series.
  - Moved changes adding address variable to patch "powerpc: Implement
kexec_file_load."
- Patch "powerpc: Adapt elf64_apply_relocate_add for kexec_file_load."
  - Dropped patch from series.
  - Moved new relocs needed by the purgatory to patch "powerpc: Implement

Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container

2016-10-20 Thread Nicholas Piggin
On Fri, 21 Oct 2016 11:21:34 +1100
David Gibson  wrote:

> On Thu, Oct 20, 2016 at 06:31:21PM +1100, Nicholas Piggin wrote:
> > On Thu, 20 Oct 2016 14:03:49 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> > > In some situations the userspace memory context may live longer than
> > > the userspace process itself so if we need to do proper memory context
> > > cleanup, we better cache @mm and use it later when the process is gone
> > > (@current or @current->mm is NULL).
> > > 
> > > This references mm and stores the pointer in the container; this is done
> > > when a container is just created so checking for !current->mm in other
> > > places becomes pointless.
> > > 
> > > This replaces current->mm with container->mm everywhere except debug
> > > prints.
> > > 
> > > This adds a check that current->mm is the same as the one stored in
> > > the container to prevent userspace from registering memory in other
> > > processes.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy 
> > > ---
> > >  drivers/vfio/vfio_iommu_spapr_tce.c | 127 
> > > 
> > >  1 file changed, 71 insertions(+), 56 deletions(-)
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> > > b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > index d0c38b2..6b0b121 100644
> > > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > > @@ -31,49 +31,46 @@  
> > 
> > Does it make sense to move the rest of these hunks into patch 2?
> > I think they're similarly just moving the mm reference into callers.
> > 
> >   
> > >  static void tce_iommu_detach_group(void *iommu_data,
> > >   struct iommu_group *iommu_group);
> > >  
> > > -static long try_increment_locked_vm(long npages)
> > > +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> > >  {
> > >   long ret = 0, locked, lock_limit;
> > >  
> > > - if (!current || !current->mm)
> > > - return -ESRCH; /* process exited */
> > > -
> > >   if (!npages)
> > >   return 0;
> > >  
> > > - down_write(>mm->mmap_sem);
> > > - locked = current->mm->locked_vm + npages;
> > > + down_write(>mmap_sem);
> > > + locked = mm->locked_vm + npages;
> > >   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> > >   ret = -ENOMEM;
> > >   else
> > > - current->mm->locked_vm += npages;
> > > + mm->locked_vm += npages;
> > >  
> > >   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> > >   npages << PAGE_SHIFT,
> > > - current->mm->locked_vm << PAGE_SHIFT,
> > > + mm->locked_vm << PAGE_SHIFT,
> > >   rlimit(RLIMIT_MEMLOCK),
> > >   ret ? " - exceeded" : "");
> > >  
> > > - up_write(>mm->mmap_sem);
> > > + up_write(>mmap_sem);
> > >  
> > >   return ret;
> > >  }
> > >  
> > > -static void decrement_locked_vm(long npages)
> > > +static void decrement_locked_vm(struct mm_struct *mm, long npages)
> > >  {
> > > - if (!current || !current->mm || !npages)
> > > + if (!mm || !npages)
> > >   return; /* process exited */  
> > 
> > I know you're trying to be defensive and change as little logic as possible,
> > but some cases should be an error, and I think some of the "process exited"
> > comments were wrong anyway.
> > 
> > Maybe pull the !mm test into the caller and make it WARN_ON?
> > 
> >   
> > > @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg)
> > >   return ERR_PTR(-EINVAL);
> > >   }
> > >  
> > > + if (!current->mm)
> > > + return ERR_PTR(-ESRCH); /* process exited */  
> > 
> > A userspace thread in the kernel can't have its mm disappear, unless you
> > are actually in the exit code. !current->mm is more like a test for a kernel
> > thread.
> > 
> >   
> > > +
> > >   container = kzalloc(sizeof(*container), GFP_KERNEL);
> > >   if (!container)
> > >   return ERR_PTR(-ENOMEM);
> > > @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg)
> > >  
> > >   container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> > >  
> > > + container->mm = current->mm;
> > > + atomic_inc(>mm->mm_count);
> > > +
> > >   return container;  
> > 
> > It's a nitpick if you respin the patch, but I guess it would better be
> > described as a reference than a cache of the object. "have tce_container
> > take a reference to mm_struct".
> > 
> >   
> > > @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container 
> > > *container,
> > >   unsigned long hpa;
> > >   enum dma_data_direction dirtmp;
> > >  
> > > + if (container->mm != current->mm)
> > > + return -ESRCH;  
> > 
> > Good, is this condition now enforced on all entrypoints that use
> > container->mm (except the final teardown)? (The mlock/rlimit stuff,
> > as we talked about before, doesn't make sense if not).  
> 
> Right.  I don't know that it's 

Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-20 Thread Pan Xinhui



在 2016/10/21 09:23, Boqun Feng 写道:

On Thu, Oct 20, 2016 at 05:27:54PM -0400, Pan Xinhui wrote:

Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
preempted. Other values means the vcpu has been preempted.

  ^
s/preempted/not preempted


yes. the less of *not* definitely sould be avoided..

And better to fix other typos in the commit log ;-)
Maybe you can try aspell? That works for me.


I will try it. :)


Regards,
Boqun



Signed-off-by: Pan Xinhui 
---
 Documentation/virtual/kvm/msr.txt | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..3376f13 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u32 pad[11];
}

whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.

+   preempted: indicate the VCPU who owns this struct is running or
+   not. Non-zero values mean the VCPU has been preempted. Zero
+   means the VCPU is not preempted. NOTE, it is always zero if the
+   the hypervisor doesn't support this field.
+
 MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled.  Bit 1 is reserved and must be zero.  When PV end of
--
2.4.11





Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-20 Thread Boqun Feng
On Thu, Oct 20, 2016 at 05:27:54PM -0400, Pan Xinhui wrote:
> Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
> preempted" into struct kvm_steal_time. This field tells if one vcpu is
> running or not.
> 
> It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
> preempted. Other values means the vcpu has been preempted.
  ^
s/preempted/not preempted

And better to fix other typos in the commit log ;-)
Maybe you can try aspell? That works for me.

Regards,
Boqun

> 
> Signed-off-by: Pan Xinhui 
> ---
>  Documentation/virtual/kvm/msr.txt | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virtual/kvm/msr.txt 
> b/Documentation/virtual/kvm/msr.txt
> index 2a71c8f..3376f13 100644
> --- a/Documentation/virtual/kvm/msr.txt
> +++ b/Documentation/virtual/kvm/msr.txt
> @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
>   __u64 steal;
>   __u32 version;
>   __u32 flags;
> - __u32 pad[12];
> + __u8  preempted;
> + __u32 pad[11];
>   }
>  
>   whose data will be filled in by the hypervisor periodically. Only one
> @@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
>   nanoseconds. Time during which the vcpu is idle, will not be
>   reported as steal time.
>  
> + preempted: indicate the VCPU who owns this struct is running or
> + not. Non-zero values mean the VCPU has been preempted. Zero
> + means the VCPU is not preempted. NOTE, it is always zero if the
> + the hypervisor doesn't support this field.
> +
>  MSR_KVM_EOI_EN: 0x4b564d04
>   data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
>   when disabled.  Bit 1 is reserved and must be zero.  When PV end of
> -- 
> 2.4.11
> 


signature.asc
Description: PGP signature


Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container

2016-10-20 Thread David Gibson
On Thu, Oct 20, 2016 at 02:03:49PM +1100, Alexey Kardashevskiy wrote:
> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better cache @mm and use it later when the process is gone
> (@current or @current->mm is NULL).
> 
> This references mm and stores the pointer in the container; this is done
> when a container is just created so checking for !current->mm in other
> places becomes pointless.
> 
> This replaces current->mm with container->mm everywhere except debug
> prints.
> 
> This adds a check that current->mm is the same as the one stored in
> the container to prevent userspace from registering memory in other
> processes.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 127 
> 
>  1 file changed, 71 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d0c38b2..6b0b121 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -31,49 +31,46 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>   struct iommu_group *iommu_group);
>  
> -static long try_increment_locked_vm(long npages)
> +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>  {
>   long ret = 0, locked, lock_limit;
>  
> - if (!current || !current->mm)
> - return -ESRCH; /* process exited */
> -
>   if (!npages)
>   return 0;
>  
> - down_write(>mm->mmap_sem);
> - locked = current->mm->locked_vm + npages;
> + down_write(>mmap_sem);
> + locked = mm->locked_vm + npages;
>   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
>   ret = -ENOMEM;
>   else
> - current->mm->locked_vm += npages;
> + mm->locked_vm += npages;
>  
>   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
>   npages << PAGE_SHIFT,
> - current->mm->locked_vm << PAGE_SHIFT,
> + mm->locked_vm << PAGE_SHIFT,
>   rlimit(RLIMIT_MEMLOCK),
>   ret ? " - exceeded" : "");
>  
> - up_write(>mm->mmap_sem);
> + up_write(>mmap_sem);
>  
>   return ret;
>  }
>  
> -static void decrement_locked_vm(long npages)
> +static void decrement_locked_vm(struct mm_struct *mm, long npages)
>  {
> - if (!current || !current->mm || !npages)
> + if (!mm || !npages)
>   return; /* process exited */
>  
> - down_write(>mm->mmap_sem);
> - if (WARN_ON_ONCE(npages > current->mm->locked_vm))
> - npages = current->mm->locked_vm;
> - current->mm->locked_vm -= npages;
> + down_write(>mmap_sem);
> + if (WARN_ON_ONCE(npages > mm->locked_vm))
> + npages = mm->locked_vm;
> + mm->locked_vm -= npages;
>   pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
>   npages << PAGE_SHIFT,
> - current->mm->locked_vm << PAGE_SHIFT,
> + mm->locked_vm << PAGE_SHIFT,
>   rlimit(RLIMIT_MEMLOCK));
> - up_write(>mm->mmap_sem);
> + up_write(>mmap_sem);
>  }
>  
>  /*
> @@ -98,6 +95,7 @@ struct tce_container {
>   bool enabled;
>   bool v2;
>   unsigned long locked_pages;
> + struct mm_struct *mm;
>   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>   struct list_head group_list;
>  };
> @@ -113,11 +111,11 @@ static long tce_iommu_unregister_pages(struct 
> tce_container *container,
>   if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK))
>   return -EINVAL;
>  
> - mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT);
> + mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT);
>   if (!mem)
>   return -ENOENT;
>  
> - return mm_iommu_put(current->mm, mem);
> + return mm_iommu_put(container->mm, mem);
>  }
>  
>  static long tce_iommu_register_pages(struct tce_container *container,
> @@ -134,7 +132,7 @@ static long tce_iommu_register_pages(struct tce_container 
> *container,
>   ((vaddr + size) < vaddr))
>   return -EINVAL;
>  
> - ret = mm_iommu_get(current->mm, vaddr, entries, );
> + ret = mm_iommu_get(container->mm, vaddr, entries, );
>   if (ret)
>   return ret;
>  
> @@ -143,7 +141,8 @@ static long tce_iommu_register_pages(struct tce_container 
> *container,
>   return 0;
>  }
>  
> -static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl)
> +static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl,
> + struct mm_struct *mm)
>  {
>   unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) *
>   tbl->it_size, PAGE_SIZE);

Re: [PATCH kernel v3 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-10-20 Thread David Gibson
On Thu, Oct 20, 2016 at 02:03:50PM +1100, Alexey Kardashevskiy wrote:
> At the moment the userspace tool is expected to request pinning of
> the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
> When the userspace process finishes, all the pinned pages need to
> be put; this is done as a part of the userspace memory context (MM)
> destruction which happens on the very last mmdrop().
> 
> This approach has a problem that a MM of the userspace process
> may live longer than the userspace process itself as kernel threads
> use userspace process MMs which was runnning on a CPU where
> the kernel thread was scheduled to. If this happened, the MM remains
> referenced until this exact kernel thread wakes up again
> and releases the very last reference to the MM, on an idle system this
> can take even hours.
> 
> This moves preregistered regions tracking from MM to VFIO; insteads of
> using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
> added so each container releases regions which it has pre-registered.
> 
> This changes the userspace interface to return EBUSY if a memory
> region is already registered in a container. However it should not
> have any practical effect as the only userspace tool available now
> does register memory region once per container anyway.
> 
> As tce_iommu_register_pages/tce_iommu_unregister_pages are called
> under container->lock, this does not need additional locking.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Nicholas Piggin 
> ---
> Changes:
> v3:
> * moved tce_iommu_prereg_free() call out of list_for_each_entry()
> 
> v2:
> * updated commit log
> ---
>  arch/powerpc/mm/mmu_context_book3s64.c |  4 ---
>  arch/powerpc/mm/mmu_context_iommu.c| 11 
>  drivers/vfio/vfio_iommu_spapr_tce.c| 49 
> +-
>  3 files changed, 48 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
> b/arch/powerpc/mm/mmu_context_book3s64.c
> index ad82735..1a07969 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct 
> mm_struct *mm)
>  
>  void destroy_context(struct mm_struct *mm)
>  {
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
> - mm_iommu_cleanup(mm);
> -#endif
> -
>  #ifdef CONFIG_PPC_ICSWX
>   drop_cop(mm->context.acop, mm);
>   kfree(mm->context.cop_lockp);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index 4c6db09..104bad0 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
>  {
>   INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
>  }
> -
> -void mm_iommu_cleanup(struct mm_struct *mm)
> -{
> - struct mm_iommu_table_group_mem_t *mem, *tmp;
> -
> - list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
> - next) {
> - list_del_rcu(>next);
> - mm_iommu_do_free(mem);
> - }
> -}
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 6b0b121..3e2f757 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -86,6 +86,15 @@ struct tce_iommu_group {
>  };
>  
>  /*
> + * A container needs to remember which preregistered region  it has
> + * referenced to do proper cleanup at the userspace process exit.
> + */
> +struct tce_iommu_prereg {
> + struct list_head next;
> + struct mm_iommu_table_group_mem_t *mem;
> +};
> +
> +/*
>   * The container descriptor supports only a single group per container.
>   * Required by the API as the container is not supplied with the IOMMU group
>   * at the moment of initialization.
> @@ -98,12 +107,27 @@ struct tce_container {
>   struct mm_struct *mm;
>   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>   struct list_head group_list;
> + struct list_head prereg_list;
>  };
>  
> +static long tce_iommu_prereg_free(struct tce_container *container,
> + struct tce_iommu_prereg *tcemem)
> +{
> + long ret;
> +
> + list_del(>next);
> + ret = mm_iommu_put(container->mm, tcemem->mem);
> + kfree(tcemem);
> +
> + return ret;
> +}
> +
>  static long tce_iommu_unregister_pages(struct tce_container *container,
>   __u64 vaddr, __u64 size)
>  {
>   struct mm_iommu_table_group_mem_t *mem;
> + struct tce_iommu_prereg *tcemem;
> + bool found = false;
>  
>   if (!current || !current->mm)
>   return -ESRCH; /* process exited */
> @@ -115,7 +139,17 @@ static long tce_iommu_unregister_pages(struct 
> tce_container *container,
>   if (!mem)
>   return -ENOENT;
>  
> - return mm_iommu_put(container->mm, mem);
> + list_for_each_entry(tcemem, >prereg_list, next) {
> + if 

Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container

2016-10-20 Thread David Gibson
On Thu, Oct 20, 2016 at 06:31:21PM +1100, Nicholas Piggin wrote:
> On Thu, 20 Oct 2016 14:03:49 +1100
> Alexey Kardashevskiy  wrote:
> 
> > In some situations the userspace memory context may live longer than
> > the userspace process itself so if we need to do proper memory context
> > cleanup, we better cache @mm and use it later when the process is gone
> > (@current or @current->mm is NULL).
> > 
> > This references mm and stores the pointer in the container; this is done
> > when a container is just created so checking for !current->mm in other
> > places becomes pointless.
> > 
> > This replaces current->mm with container->mm everywhere except debug
> > prints.
> > 
> > This adds a check that current->mm is the same as the one stored in
> > the container to prevent userspace from registering memory in other
> > processes.
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> >  drivers/vfio/vfio_iommu_spapr_tce.c | 127 
> > 
> >  1 file changed, 71 insertions(+), 56 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> > b/drivers/vfio/vfio_iommu_spapr_tce.c
> > index d0c38b2..6b0b121 100644
> > --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> > @@ -31,49 +31,46 @@
> 
> Does it make sense to move the rest of these hunks into patch 2?
> I think they're similarly just moving the mm reference into callers.
> 
> 
> >  static void tce_iommu_detach_group(void *iommu_data,
> > struct iommu_group *iommu_group);
> >  
> > -static long try_increment_locked_vm(long npages)
> > +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> >  {
> > long ret = 0, locked, lock_limit;
> >  
> > -   if (!current || !current->mm)
> > -   return -ESRCH; /* process exited */
> > -
> > if (!npages)
> > return 0;
> >  
> > -   down_write(>mm->mmap_sem);
> > -   locked = current->mm->locked_vm + npages;
> > +   down_write(>mmap_sem);
> > +   locked = mm->locked_vm + npages;
> > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> > ret = -ENOMEM;
> > else
> > -   current->mm->locked_vm += npages;
> > +   mm->locked_vm += npages;
> >  
> > pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> > npages << PAGE_SHIFT,
> > -   current->mm->locked_vm << PAGE_SHIFT,
> > +   mm->locked_vm << PAGE_SHIFT,
> > rlimit(RLIMIT_MEMLOCK),
> > ret ? " - exceeded" : "");
> >  
> > -   up_write(>mm->mmap_sem);
> > +   up_write(>mmap_sem);
> >  
> > return ret;
> >  }
> >  
> > -static void decrement_locked_vm(long npages)
> > +static void decrement_locked_vm(struct mm_struct *mm, long npages)
> >  {
> > -   if (!current || !current->mm || !npages)
> > +   if (!mm || !npages)
> > return; /* process exited */
> 
> I know you're trying to be defensive and change as little logic as possible,
> but some cases should be an error, and I think some of the "process exited"
> comments were wrong anyway.
> 
> Maybe pull the !mm test into the caller and make it WARN_ON?
> 
> 
> > @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg)
> > return ERR_PTR(-EINVAL);
> > }
> >  
> > +   if (!current->mm)
> > +   return ERR_PTR(-ESRCH); /* process exited */
> 
> A userspace thread in the kernel can't have its mm disappear, unless you
> are actually in the exit code. !current->mm is more like a test for a kernel
> thread.
> 
> 
> > +
> > container = kzalloc(sizeof(*container), GFP_KERNEL);
> > if (!container)
> > return ERR_PTR(-ENOMEM);
> > @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg)
> >  
> > container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> >  
> > +   container->mm = current->mm;
> > +   atomic_inc(>mm->mm_count);
> > +
> > return container;
> 
> It's a nitpick if you respin the patch, but I guess it would better be
> described as a reference than a cache of the object. "have tce_container
> take a reference to mm_struct".
> 
> 
> > @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container 
> > *container,
> > unsigned long hpa;
> > enum dma_data_direction dirtmp;
> >  
> > +   if (container->mm != current->mm)
> > +   return -ESRCH;
> 
> Good, is this condition now enforced on all entrypoints that use
> container->mm (except the final teardown)? (The mlock/rlimit stuff,
> as we talked about before, doesn't make sense if not).

Right.  I don't know that it's actually dangerous, but i think it
would be needlessly weird for one process to be able to manipulate
another process's mm via the container fd.  So all the entry points
that are directly called from userspace (basically, the ioctl()s)
should verify that current->mm matches container->mm 

Re: [PATCH] kernel: irq: fix build failure

2016-10-20 Thread Stephen Rothwell
Hi Thomas,

On Thu, 20 Oct 2016 14:55:45 +0200 (CEST) Thomas Gleixner  
wrote:
>
> On Mon, 10 Oct 2016, Sudip Mukherjee wrote:
> 
> > On Thursday 06 October 2016 11:06 PM, Sudip Mukherjee wrote:  
> > > The allmodconfig build of powerpc is failing with the error:
> > > ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined!
> > > 
> > > export the symbol to fix the failure.  
> > 
> > Hi Thomas,
> > powerpc and arm allmodconfig builds still fails with the same error.
> > Build logs of next-20161010 are at:
> > arm at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321467
> > powerpc at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321473  
> 
> I know. This is under discussion with the driver folks as we are not going
> to blindly export stuff just because someone slapped a irq_set_parent()
> into the code w/o knowing why.

Do we have any idea if a resolution is close.  This was first reported
in linux-next in September 14/15.  :-(

-- 
Cheers,
Stephen Rothwell


[PATCH] powerpc/book3s64: Always build for power4 or later

2016-10-20 Thread Michael Ellerman
When we're not compiling for a specific CPU, ie. none of the
CONFIG_POWERx_CPU options are set, and CONFIG_GENERIC_CPU *is* set, we
currently don't pass any -mcpu option to the compiler. This means the
compiler builds for a "generic" Power CPU.

But back in 2014 we dropped support for pre power4 CPUs in commit
468a33028edd ("powerpc: Drop support for pre-POWER4 cpus").

Given that, there's no point in building the kernel to run on pre power4
cpus. So update the flags we pass to the compiler when
CONFIG_GENERIC_CPU is set, to specify -mcpu=power4.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 617dece67924..041fda1e2a5d 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -121,6 +121,7 @@ CFLAGS-$(CONFIG_PPC32)  := -ffixed-r2 $(MULTIPLEWORD)
 
 ifeq ($(CONFIG_PPC_BOOK3S_64),y)
 CFLAGS-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=power7,-mtune=power4)
+CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=power4
 else
 CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=powerpc64
 endif
-- 
2.7.4



Re: [PATCH kernel v3 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-10-20 Thread David Gibson
On Thu, Oct 20, 2016 at 02:03:48PM +1100, Alexey Kardashevskiy wrote:
> This changes mm_iommu_xxx helpers to take mm_struct as a parameter
> instead of getting it from @current which in some situations may
> not have a valid reference to mm.
> 
> This changes helpers to receive @mm and moves all references to @current
> to the caller, including checks for !current and !current->mm;
> checks in mm_iommu_preregistered() are removed as there is no caller
> yet.
> 
> This moves the mm_iommu_adjust_locked_vm() call to the caller as
> it receives mm_iommu_table_group_mem_t but it needs mm.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/mmu_context.h | 16 ++--
>  arch/powerpc/mm/mmu_context_iommu.c| 46 
> +-
>  drivers/vfio/vfio_iommu_spapr_tce.c| 14 ---
>  3 files changed, 36 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index 424844b..b9e3f0a 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -19,16 +19,18 @@ extern void destroy_context(struct mm_struct *mm);
>  struct mm_iommu_table_group_mem_t;
>  
>  extern int isolate_lru_page(struct page *page);  /* from internal.h */
> -extern bool mm_iommu_preregistered(void);
> -extern long mm_iommu_get(unsigned long ua, unsigned long entries,
> +extern bool mm_iommu_preregistered(struct mm_struct *mm);
> +extern long mm_iommu_get(struct mm_struct *mm,
> + unsigned long ua, unsigned long entries,
>   struct mm_iommu_table_group_mem_t **pmem);
> -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> +extern long mm_iommu_put(struct mm_struct *mm,
> + struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_init(struct mm_struct *mm);
>  extern void mm_iommu_cleanup(struct mm_struct *mm);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
> - unsigned long size);
> -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> - unsigned long entries);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct 
> *mm,
> + unsigned long ua, unsigned long size);
> +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> + unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
>   unsigned long ua, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index ad2e575..4c6db09 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -56,7 +56,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>   }
>  
>   pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
> - current->pid,
> + current ? current->pid : 0,
>   incr ? '+' : '-',
>   npages << PAGE_SHIFT,
>   mm->locked_vm << PAGE_SHIFT,
> @@ -66,12 +66,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
>   return ret;
>  }
>  
> -bool mm_iommu_preregistered(void)
> +bool mm_iommu_preregistered(struct mm_struct *mm)
>  {
> - if (!current || !current->mm)
> - return false;
> -
> - return !list_empty(>mm->context.iommu_group_mem_list);
> + return !list_empty(>context.iommu_group_mem_list);
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> @@ -124,19 +121,16 @@ static int mm_iommu_move_page_from_cma(struct page 
> *page)
>   return 0;
>  }
>  
> -long mm_iommu_get(unsigned long ua, unsigned long entries,
> +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
> entries,
>   struct mm_iommu_table_group_mem_t **pmem)
>  {
>   struct mm_iommu_table_group_mem_t *mem;
>   long i, j, ret = 0, locked_entries = 0;
>   struct page *page = NULL;
>  
> - if (!current || !current->mm)
> - return -ESRCH; /* process exited */
> -
>   mutex_lock(_list_mutex);
>  
> - list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list,
> + list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
>   next) {
>   if ((mem->ua == ua) && (mem->entries == entries)) {
>   ++mem->used;
> @@ -154,7 +148,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
>  
>   }
>  
> - ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
> + ret = mm_iommu_adjust_locked_vm(mm, entries, true);
>   if (ret)
>   goto unlock_exit;
>  
> @@ -215,11 

Re: [PATCH kernel v3 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers

2016-10-20 Thread David Gibson
On Thu, Oct 20, 2016 at 02:03:47PM +1100, Alexey Kardashevskiy wrote:
> We are going to get rid of @current references in mmu_context_boos3s64.c
> and cache mm_struct in the VFIO container. Since mm_context_t does not
> have reference counting, we will be using mm_struct which does have
> the reference counter.
> 
> This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
> than mm_context_t (which is embedded into mm).
> 
> This should not cause any behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/mmu_context.h | 4 ++--
>  arch/powerpc/kernel/setup-common.c | 2 +-
>  arch/powerpc/mm/mmu_context_book3s64.c | 4 ++--
>  arch/powerpc/mm/mmu_context_iommu.c| 9 +
>  4 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h 
> b/arch/powerpc/include/asm/mmu_context.h
> index 5c45114..424844b 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -23,8 +23,8 @@ extern bool mm_iommu_preregistered(void);
>  extern long mm_iommu_get(unsigned long ua, unsigned long entries,
>   struct mm_iommu_table_group_mem_t **pmem);
>  extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
> -extern void mm_iommu_init(mm_context_t *ctx);
> -extern void mm_iommu_cleanup(mm_context_t *ctx);
> +extern void mm_iommu_init(struct mm_struct *mm);
> +extern void mm_iommu_cleanup(struct mm_struct *mm);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
>   unsigned long size);
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 270ee30..f516ac5 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -915,7 +915,7 @@ void __init setup_arch(char **cmdline_p)
>   init_mm.context.pte_frag = NULL;
>  #endif
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> - mm_iommu_init(_mm.context);
> + mm_iommu_init(_mm);
>  #endif
>   irqstack_early_init();
>   exc_lvl_early_init();
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
> b/arch/powerpc/mm/mmu_context_book3s64.c
> index b114f8b..ad82735 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct 
> mm_struct *mm)
>   mm->context.pte_frag = NULL;
>  #endif
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> - mm_iommu_init(>context);
> + mm_iommu_init(mm);
>  #endif
>   return 0;
>  }
> @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct 
> mm_struct *mm)
>  void destroy_context(struct mm_struct *mm)
>  {
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> - mm_iommu_cleanup(>context);
> + mm_iommu_cleanup(mm);
>  #endif
>  
>  #ifdef CONFIG_PPC_ICSWX
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index e0f1c33..ad2e575 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -373,16 +373,17 @@ void mm_iommu_mapped_dec(struct 
> mm_iommu_table_group_mem_t *mem)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
>  
> -void mm_iommu_init(mm_context_t *ctx)
> +void mm_iommu_init(struct mm_struct *mm)
>  {
> - INIT_LIST_HEAD_RCU(>iommu_group_mem_list);
> + INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
>  }
>  
> -void mm_iommu_cleanup(mm_context_t *ctx)
> +void mm_iommu_cleanup(struct mm_struct *mm)
>  {
>   struct mm_iommu_table_group_mem_t *mem, *tmp;
>  
> - list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) {
> + list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
> + next) {
>   list_del_rcu(>next);
>   mm_iommu_do_free(mem);
>   }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v6] powerpc: Do not make the entire heap executable

2016-10-20 Thread Jason Gunthorpe
On Tue, Oct 04, 2016 at 09:54:12AM -0700, Kees Cook wrote:
> On Mon, Oct 3, 2016 at 5:18 PM, Michael Ellerman  wrote:
> > Kees Cook  writes:
> >
> >> On Mon, Oct 3, 2016 at 9:13 AM, Denys Vlasenko  wrote:
> >>> On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt,
> >>> or with a toolchain which defaults to it) look like this:
> > ...
> >>>
> >>> Signed-off-by: Jason Gunthorpe 
> >>> Signed-off-by: Denys Vlasenko 
> >>> Acked-by: Kees Cook 
> >>> Acked-by: Michael Ellerman 
> >>> CC: Benjamin Herrenschmidt 
> >>> CC: Paul Mackerras 
> >>> CC: "Aneesh Kumar K.V" 
> >>> CC: Kees Cook 
> >>> CC: Oleg Nesterov 
> >>> CC: Michael Ellerman 
> >>> CC: Florian Weimer 
> >>> CC: linux...@kvack.org
> >>> CC: linuxppc-dev@lists.ozlabs.org
> >>> CC: linux-ker...@vger.kernel.org
> >>> Changes since v5:
> >>> * made do_brk_flags() error out if any bits other than VM_EXEC are set.
> >>>   (Kees Cook: "With this, I'd be happy to Ack.")
> >>>   See https://patchwork.ozlabs.org/patch/661595/
> >>
> >> Excellent, thanks for the v6! Should this go via the ppc tree or the -mm 
> >> tree?
> >
> > -mm would be best, given the diffstat I think it's less likely to
> >  conflict if it goes via -mm.
> 
> Okay, excellent. Andrew, do you have this already in email? I think
> you weren't on the explicit CC from the v6...

FWIW (and ping),

Tested-by: Jason Gunthorpe 

On ARM32 (kirkwood) and PPC32 (405)

For reference, here is the patchwork URL:

https://patchwork.ozlabs.org/patch/677753/

Jason


Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags

2016-10-20 Thread Michal Hocko
On Wed 19-10-16 10:23:55, Dave Hansen wrote:
> On 10/19/2016 10:01 AM, Michal Hocko wrote:
> > The question I had earlier was whether this has to be an explicit FOLL
> > flag used by g-u-p users or we can just use it internally when mm !=
> > current->mm
> 
> The reason I chose not to do that was that deferred work gets run under
> a basically random 'current'.  If we just use 'mm != current->mm', then
> the deferred work will sometimes have pkeys enforced and sometimes not,
> basically randomly.

OK, I see (async_pf_execute and ksm ). It makes more sense to me. Thanks
for the clarification.

-- 
Michal Hocko
SUSE Labs


[PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-20 Thread Pan Xinhui
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
preempted. Other values means the vcpu has been preempted.

Signed-off-by: Pan Xinhui 
---
 Documentation/virtual/kvm/msr.txt | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..3376f13 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u32 pad[11];
}
 
whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.
 
+   preempted: indicate the VCPU who owns this struct is running or
+   not. Non-zero values mean the VCPU has been preempted. Zero
+   means the VCPU is not preempted. NOTE, it is always zero if the
+   the hypervisor doesn't support this field.
+
 MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled.  Bit 1 is reserved and must be zero.  When PV end of
-- 
2.4.11



[PATCH v5 8/9] s390/spinlock: Provide vcpu_is_preempted

2016-10-20 Thread Pan Xinhui
From: Christian Borntraeger 

this implements the s390 backend for commit
"kernel/sched: introduce vcpu preempted check interface"
by reworking the existing smp_vcpu_scheduled into
arch_vcpu_is_preempted. We can then also get rid of the
local cpu_is_preempted function by moving the
CIF_ENABLED_WAIT test into arch_vcpu_is_preempted.

Signed-off-by: Christian Borntraeger 
Acked-by: Heiko Carstens 
---
 arch/s390/include/asm/spinlock.h |  8 
 arch/s390/kernel/smp.c   |  9 +++--
 arch/s390/lib/spinlock.c | 25 -
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index 7e9e09f..7ecd890 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, 
unsigned int new)
return __sync_bool_compare_and_swap(lock, old, new);
 }
 
+#ifndef CONFIG_SMP
+static inline bool arch_vcpu_is_preempted(int cpu) { return false; }
+#else
+bool arch_vcpu_is_preempted(int cpu);
+#endif
+
+#define vcpu_is_preempted arch_vcpu_is_preempted
+
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
  * on the local processor, one does not.
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 35531fe..b988ed1 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address)
return -1;
 }
 
-int smp_vcpu_scheduled(int cpu)
+bool arch_vcpu_is_preempted(int cpu)
 {
-   return pcpu_running(pcpu_devices + cpu);
+   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
+   return false;
+   if (pcpu_running(pcpu_devices + cpu))
+   return false;
+   return true;
 }
+EXPORT_SYMBOL(arch_vcpu_is_preempted);
 
 void smp_yield_cpu(int cpu)
 {
diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c
index e5f50a7..e48a48e 100644
--- a/arch/s390/lib/spinlock.c
+++ b/arch/s390/lib/spinlock.c
@@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int 
*lock, unsigned int old)
asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock));
 }
 
-static inline int cpu_is_preempted(int cpu)
-{
-   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
-   return 0;
-   if (smp_vcpu_scheduled(cpu))
-   return 0;
-   return 1;
-}
-
 void arch_spin_lock_wait(arch_spinlock_t *lp)
 {
unsigned int cpu = SPINLOCK_LOCKVAL;
@@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
continue;
}
/* First iteration: check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned 
long flags)
continue;
}
/* Check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, 
unsigned long flags)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
-   if (owner && cpu_is_preempted(~owner))
+   if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int 
prev)
owner = 0;
while (1) {
if (count-- <= 0) {
- 

[PATCH v5 7/9] x86, xen: support vcpu preempted check

2016-10-20 Thread Pan Xinhui
From: Juergen Gross 

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

A quick test (4 vcpus on 1 physical cpu doing a parallel build job
with "make -j 8") reduced system time by about 5% with this patch.

Signed-off-by: Juergen Gross 
Signed-off-by: Pan Xinhui 
---
 arch/x86/xen/spinlock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3d6e006..74756bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
+
+   pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
 }
 
 /*
-- 
2.4.11



[PATCH v5 6/9] x86, kvm: support vcpu preempted check

2016-10-20 Thread Pan Xinhui
Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

Use one field of struct kvm_steal_time to indicate that if one vcpu
is running or not.

unix benchmark result:
host:  kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/uapi/asm/kvm_para.h |  3 ++-
 arch/x86/kernel/kvm.c| 12 
 arch/x86/kvm/x86.c   | 18 ++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..b3fec56 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,8 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8 preempted;
+   __u32 pad[11];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0b48dd2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+static bool kvm_vcpu_is_preempted(int cpu)
+{
+   struct kvm_steal_time *src;
+
+   src = _cpu(steal_time, cpu);
+
+   return !!src->preempted;
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -471,6 +480,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+   pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+#endif
}
 
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c633de..a627537 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>arch.st.steal, sizeof(struct kvm_steal_time
return;
 
+   vcpu->arch.st.steal.preempted = 0;
+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1;  /* first time write, random 
junk */
 
@@ -2810,8 +2812,24 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
+static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   if (unlikely(kvm_read_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal, sizeof(struct kvm_steal_time
+   return;
+
+   vcpu->arch.st.steal.preempted = 1;
+
+   kvm_write_guest_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal, sizeof(struct kvm_steal_time));
+}
+
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   kvm_steal_time_set_preempted(vcpu);
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
-- 
2.4.11



[PATCH v5 5/9] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

2016-10-20 Thread Pan Xinhui
This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted.
Then kernel can break the spin loops upon on the retval of
vcpu_is_preempted.

As kernel has used this interface, So lets support it.

To deal with kernel and kvm/xen, add vcpu_is_preempted into struct
pv_lock_ops.

Then kvm or xen could provide their own implementation to support
vcpu_is_preempted.

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/asm/paravirt_types.h | 2 ++
 arch/x86/include/asm/spinlock.h   | 8 
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 0f400c0..38c3bb7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -310,6 +310,8 @@ struct pv_lock_ops {
 
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
+
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..0526f59 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,6 +26,14 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return pv_lock_ops.vcpu_is_preempted(cpu);
+}
+#endif
+
 #include 
 
 /*
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 2c55a00..2f204dd 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
+static bool native_vcpu_is_preempted(int cpu)
+{
+   return 0;
+}
+
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
.queued_spin_lock_slowpath = native_queued_spin_lock_slowpath,
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
+   .vcpu_is_preempted = native_vcpu_is_preempted,
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
-- 
2.4.11



[PATCH v5 4/9] powerpc/spinlock: support vcpu preempted check

2016-10-20 Thread Pan Xinhui
This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted. Then
kernel can break the spin loops upon on the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

Only pSeries need support it. And the fact is powerNV are built into
same kernel image with pSeries. So we need return false if we are runnig
as powerNV. The another fact is that lppaca->yiled_count keeps zero on
powerNV. So we can just skip the machine type check.

Suggested-by: Boqun Feng 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
---
 arch/powerpc/include/asm/spinlock.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index abb6b0f..f4a9524 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,14 @@
 #define SYNC_IO
 #endif
 
+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
 #if defined(CONFIG_PPC_SPLPAR)
 /* We only yield to the hypervisor if we are in shared processor mode */
 #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr))
-- 
2.4.11



[PATCH v5 3/9] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner

2016-10-20 Thread Pan Xinhui
An over-committed guest with more vCPUs than pCPUs has a heavy overload in
the two spin_on_owner. This blames on the lock holder preemption issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown] [H] 0xc00768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 kernel/locking/mutex.c  | 15 +--
 kernel/locking/rwsem-xadd.c | 16 +---
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..82108f5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner)
 */
barrier();
 
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
ret = false;
break;
}
@@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex 
*lock)
 
rcu_read_lock();
owner = READ_ONCE(lock->owner);
+
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
if (owner)
-   retval = owner->on_cpu;
+   retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();
/*
 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..0897179 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
goto done;
}
 
-   ret = owner->on_cpu;
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
+   ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
rcu_read_unlock();
return ret;
@@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 */
barrier();
 
-   /* abort spinning when need_resched or owner is not running */
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* abort spinning when need_resched or owner is not running or
+* owner's cpu is preempted. vcpu_is_preempted is a macro
+* defined by false if arch does not support vcpu preempted
+* check
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
rcu_read_unlock();
return false;
}
-- 
2.4.11



[PATCH v5 2/9] locking/osq: Drop the overload of osq_lock()

2016-10-20 Thread Pan Xinhui
An over-committed guest with more vCPUs than pCPUs has a heavy overload in
osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu
node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq
lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 kernel/locking/osq_lock.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..39d1385 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+   return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
 * If we need to reschedule bail... so we can block.
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
 */
-   if (need_resched())
+   if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;
 
cpu_relax_lowlatency();
-- 
2.4.11



[PATCH v5 1/9] kernel/sched: introduce vcpu preempted check interface

2016-10-20 Thread Pan Xinhui
This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if
one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu pteempted check.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 include/linux/sched.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..44c1ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu) false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11



[PATCH v5 0/9] implement vcpu preempted check

2016-10-20 Thread Pan Xinhui
change from v4:
spilt x86 kvm vcpu preempted check into two patches.
add documentation patch.
add x86 vcpu preempted check patch under xen
add s390 vcpu preempted check patch 
change from v3:
add x86 vcpu preempted check patch
change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and 
mutex_spin_on_owner. 
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements in uninx benchmark tests.

PPC test result:
1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 s, 
1 samples)

With patch: 

1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 s, 
1 samples)

X86 test result:
test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps 
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
Process Creation   |29881.2 lps  |28572.8 lps 
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 

Christian Borntraeger (1):
  s390/spinlock: Provide vcpu_is_preempted

Juergen Gross (1):
  x86, xen: support vcpu preempted check

Pan Xinhui (7):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, paravirt: Add interface to support kvm/xen vcpu preempted check
  x86, kvm: support vcpu preempted check
  Documentation: virtual: kvm: Support vcpu preempted check

 Documentation/virtual/kvm/msr.txt |  8 +++-
 arch/powerpc/include/asm/spinlock.h   |  8 
 arch/s390/include/asm/spinlock.h  |  8 
 arch/s390/kernel/smp.c|  9 +++--
 arch/s390/lib/spinlock.c  | 25 -
 arch/x86/include/asm/paravirt_types.h |  2 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  3 ++-
 arch/x86/kernel/kvm.c | 12 
 arch/x86/kernel/paravirt-spinlocks.c  |  6 ++
 arch/x86/kvm/x86.c| 18 ++
 arch/x86/xen/spinlock.c   |  3 ++-
 include/linux/sched.h | 12 
 kernel/locking/mutex.c| 15 +--
 kernel/locking/osq_lock.c | 10 +-
 kernel/locking/rwsem-xadd.c   | 16 +---
 16 files changed, 135 insertions(+), 28 deletions(-)

-- 
2.4.11



Re: [PATCH v4 3/5] powerpc/mm: allow memory hotplug into a memoryless node

2016-10-20 Thread Reza Arbab

On Thu, Oct 20, 2016 at 02:30:42PM +1100, Balbir Singh wrote:

FYI, these checks were temporary to begin with

I found this in git history

b226e462124522f2f23153daff31c311729dfa2f (powerpc: don't add memory to empty 
node/zone)


Nice find! I spent some time digging, but this had eluded me.

--
Reza Arbab



Re: [PATCH] kernel: irq: fix build failure

2016-10-20 Thread Thomas Gleixner
On Mon, 10 Oct 2016, Sudip Mukherjee wrote:

> On Thursday 06 October 2016 11:06 PM, Sudip Mukherjee wrote:
> > The allmodconfig build of powerpc is failing with the error:
> > ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined!
> > 
> > export the symbol to fix the failure.
> 
> Hi Thomas,
> powerpc and arm allmodconfig builds still fails with the same error.
> Build logs of next-20161010 are at:
> arm at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321467
> powerpc at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321473

I know. This is under discussion with the driver folks as we are not going
to blindly export stuff just because someone slapped a irq_set_parent()
into the code w/o knowing why.

Thanks,

tglx


[RFC][PATCH] powerpc/pseries: implement NMI IPIs with H_SIGNAL_SYS_RESET hcall

2016-10-20 Thread Nicholas Piggin
Add a .cause_nmi_ipi() to smp_ops to support it. Although it's
possible to raise a system reset exception on this CPU, which
may (or may not) be useful to bring ourselves into a known state.
So perhaps it's better make this a platform operation?

Thanks,
Nick

---
 arch/powerpc/include/asm/hvcall.h | 8 +++-
 arch/powerpc/include/asm/plpar_wrappers.h | 5 +
 arch/powerpc/include/asm/smp.h| 4 
 arch/powerpc/kernel/smp.c | 3 +++
 arch/powerpc/platforms/powermac/smp.c | 1 +
 arch/powerpc/platforms/powernv/smp.c  | 1 +
 arch/powerpc/platforms/pseries/smp.c  | 8 
 7 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 708edeb..a38171c 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -275,7 +275,8 @@
 #define H_COP  0x304
 #define H_GET_MPP_X0x314
 #define H_SET_MODE 0x31C
-#define MAX_HCALL_OPCODE   H_SET_MODE
+#define H_SIGNAL_SYS_RESET 0x380
+#define MAX_HCALL_OPCODE   H_SIGNAL_SYS_RESET
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
@@ -306,6 +307,11 @@
 #define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3
 #define H_SET_MODE_RESOURCE_LE 4
 
+/* Values for argument to H_SIGNAL_SYS_RESET */
+#define H_SIGNAL_SYS_RESET_ALL -1
+#define H_SIGNAL_SYS_RESET_ALLBUTSELF  -2
+#define H_SIGNAL_SYS_RESET_CPU(x)  (x)
+
 #ifndef __ASSEMBLY__
 
 /**
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 1b39424..7fe5983 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -340,4 +340,9 @@ static inline long plapr_set_watchpoint0(unsigned long 
dawr0, unsigned long dawr
return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0);
 }
 
+static inline long plapr_signal_system_reset(long cpu)
+{
+   return plpar_hcall_norets(H_SIGNAL_SYS_RESET, cpu);
+}
+
 #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 0d02c11..15eb615 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -37,11 +37,15 @@ extern int cpu_to_chip_id(int cpu);
 
 #ifdef CONFIG_SMP
 
+#define SMP_OP_NMI_ALL -1
+#define SMP_OP_NMI_ALLBUTSELF  -2
+
 struct smp_ops_t {
void  (*message_pass)(int cpu, int msg);
 #ifdef CONFIG_PPC_SMP_MUXED_IPI
void  (*cause_ipi)(int cpu, unsigned long data);
 #endif
+   int   (*cause_nmi_ipi)(int cpu);
void  (*probe)(void);
int   (*kick_cpu)(int nr);
void  (*setup_cpu)(int nr);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9c6f3fd..4a1161e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -334,6 +334,9 @@ void smp_send_debugger_break(void)
if (unlikely(!smp_ops))
return;
 
+   if (smp_ops->cause_nmi_ipi && 
smp_ops->cause_nmi_ipi(SMP_OP_NMI_ALLBUTSELF))
+   return;
+
for_each_online_cpu(cpu)
if (cpu != me)
do_message_pass(cpu, PPC_MSG_DEBUGGER_BREAK);
diff --git a/arch/powerpc/platforms/powermac/smp.c 
b/arch/powerpc/platforms/powermac/smp.c
index c9eb7d6..1d76e15 100644
--- a/arch/powerpc/platforms/powermac/smp.c
+++ b/arch/powerpc/platforms/powermac/smp.c
@@ -446,6 +446,7 @@ void __init smp_psurge_give_timebase(void)
 struct smp_ops_t psurge_smp_ops = {
.message_pass   = NULL, /* Use smp_muxed_ipi_message_pass */
.cause_ipi  = smp_psurge_cause_ipi,
+   .cause_nmi_ipi  = NULL,
.probe  = smp_psurge_probe,
.kick_cpu   = smp_psurge_kick_cpu,
.setup_cpu  = smp_psurge_setup_cpu,
diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index c789258..092ec1f 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -244,6 +244,7 @@ static int pnv_cpu_bootable(unsigned int nr)
 static struct smp_ops_t pnv_smp_ops = {
.message_pass   = smp_muxed_ipi_message_pass,
.cause_ipi  = NULL, /* Filled at runtime by xics_smp_probe() */
+   .cause_nmi_ipi  = NULL,
.probe  = xics_smp_probe,
.kick_cpu   = pnv_smp_kick_cpu,
.setup_cpu  = pnv_smp_setup_cpu,
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index f6f83ae..187b981 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -196,6 +196,13 @@ static void pSeries_cause_ipi_mux(int cpu, unsigned long 
data)
xics_cause_ipi(cpu, data);
 }
 
+static int smp_pSeries_cause_nmi_ipi(int cpu)
+{
+   if (plapr_signal_system_reset(cpu) == H_SUCCESS)
+   return 1;
+   return 0;

Re: [RFC] ppc64le: Enable emulation support for simple Load/Store instructions

2016-10-20 Thread Greg KH
On Thu, Oct 20, 2016 at 12:01:55PM +0530, Ravi Bangoria wrote:
> emulate_step() uses a number of underlying kernel functions that were
> initially not enabled for LE. This has been rectified since. So, fix
> emulate_step() for LE for the corresponding instructions.
> 
> Reported-by: Anton Blanchard 
> Signed-off-by: Ravi Bangoria 
> ---
> Note: This patch only enables LOAD, STORE, LARX and STCX instructions.
>   I'll send a subsequent patch for other types like LOAD_FP,
>   LOAD_VMX etc.
> 
>  arch/powerpc/lib/sstep.c | 8 
>  1 file changed, 8 deletions(-)



This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
for how to do this properly.




Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container

2016-10-20 Thread Nicholas Piggin
On Thu, 20 Oct 2016 14:03:49 +1100
Alexey Kardashevskiy  wrote:

> In some situations the userspace memory context may live longer than
> the userspace process itself so if we need to do proper memory context
> cleanup, we better cache @mm and use it later when the process is gone
> (@current or @current->mm is NULL).
> 
> This references mm and stores the pointer in the container; this is done
> when a container is just created so checking for !current->mm in other
> places becomes pointless.
> 
> This replaces current->mm with container->mm everywhere except debug
> prints.
> 
> This adds a check that current->mm is the same as the one stored in
> the container to prevent userspace from registering memory in other
> processes.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 127 
> 
>  1 file changed, 71 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d0c38b2..6b0b121 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -31,49 +31,46 @@

Does it make sense to move the rest of these hunks into patch 2?
I think they're similarly just moving the mm reference into callers.


>  static void tce_iommu_detach_group(void *iommu_data,
>   struct iommu_group *iommu_group);
>  
> -static long try_increment_locked_vm(long npages)
> +static long try_increment_locked_vm(struct mm_struct *mm, long npages)
>  {
>   long ret = 0, locked, lock_limit;
>  
> - if (!current || !current->mm)
> - return -ESRCH; /* process exited */
> -
>   if (!npages)
>   return 0;
>  
> - down_write(>mm->mmap_sem);
> - locked = current->mm->locked_vm + npages;
> + down_write(>mmap_sem);
> + locked = mm->locked_vm + npages;
>   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
>   ret = -ENOMEM;
>   else
> - current->mm->locked_vm += npages;
> + mm->locked_vm += npages;
>  
>   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
>   npages << PAGE_SHIFT,
> - current->mm->locked_vm << PAGE_SHIFT,
> + mm->locked_vm << PAGE_SHIFT,
>   rlimit(RLIMIT_MEMLOCK),
>   ret ? " - exceeded" : "");
>  
> - up_write(>mm->mmap_sem);
> + up_write(>mmap_sem);
>  
>   return ret;
>  }
>  
> -static void decrement_locked_vm(long npages)
> +static void decrement_locked_vm(struct mm_struct *mm, long npages)
>  {
> - if (!current || !current->mm || !npages)
> + if (!mm || !npages)
>   return; /* process exited */

I know you're trying to be defensive and change as little logic as possible,
but some cases should be an error, and I think some of the "process exited"
comments were wrong anyway.

Maybe pull the !mm test into the caller and make it WARN_ON?


> @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg)
>   return ERR_PTR(-EINVAL);
>   }
>  
> + if (!current->mm)
> + return ERR_PTR(-ESRCH); /* process exited */

A userspace thread in the kernel can't have its mm disappear, unless you
are actually in the exit code. !current->mm is more like a test for a kernel
thread.


> +
>   container = kzalloc(sizeof(*container), GFP_KERNEL);
>   if (!container)
>   return ERR_PTR(-ENOMEM);
> @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg)
>  
>   container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
>  
> + container->mm = current->mm;
> + atomic_inc(>mm->mm_count);
> +
>   return container;

It's a nitpick if you respin the patch, but I guess it would better be
described as a reference than a cache of the object. "have tce_container
take a reference to mm_struct".


> @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container 
> *container,
>   unsigned long hpa;
>   enum dma_data_direction dirtmp;
>  
> + if (container->mm != current->mm)
> + return -ESRCH;

Good, is this condition now enforced on all entrypoints that use
container->mm (except the final teardown)? (The mlock/rlimit stuff,
as we talked about before, doesn't make sense if not).

Thanks,
Nick



Re: [PATCH] cpufreq: powernv: Use node ID in init_chip_info

2016-10-20 Thread Shilpasri G Bhat
Hi,

On 10/20/2016 09:29 AM, Viresh Kumar wrote:
> + few IBM guys who have been working on this.
> 
> On 19-10-16, 15:02, Emily Shaffer wrote:
>> Fixed assumption that node_id==chip_id in powernv-cpufreq.c:init_chip_info;
>> explicitly use node ID where necessary.

Thanks for the bug fix. I agree that the node-ids should not be assumed to be
always be equal to chip-ids. But I think it would be better to get rid of
cpumask_of_node() as it has problems when the powernv-cpufreq driver is
initialized with offline cpus, like reported in the post below.
https://patchwork.kernel.org/patch/8887591/
(^^ This should also solve the node_id=chip_id problem)

Since throttle stats are common for all cpus in the chip, so we are better of
not using cpumask_of_node() instead define something like cpumask_of_chip()
where the driver doesn't have to compute chip cpumask.

Thanks and Regards,
Shilpa

>>
>> Tested: All CPUs report in /sys/devices/system/cpu*/cpufreq/throttle_stats
>>
>> Effort: platforms/arch-powerpc
>> Google-Bug-Id: 26979978
> 
> Is this relevant upstream?
> 
>>
>> Signed-off-by: Emily Shaffer 
>> Change-Id: I22eb626b32fbb8053b3bbb9c75e677c700d0c2fb
> 
> Gerrit id isn't required for upstream..
> 
>> ---
>>  drivers/cpufreq/powernv-cpufreq.c | 27 +--
>>  1 file changed, 21 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/cpufreq/powernv-cpufreq.c
>> b/drivers/cpufreq/powernv-cpufreq.c
>> index d3ffde8..3750b58 100644
>> --- a/drivers/cpufreq/powernv-cpufreq.c
>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>> @@ -911,32 +911,47 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
>>
>>  static int init_chip_info(void)
>>  {
>> -   unsigned int chip[256];
>> +   int rc = 0;
>> unsigned int cpu, i;
>> unsigned int prev_chip_id = UINT_MAX;
>> +   unsigned int *chip, *node;
>> +
>> +   chip = kcalloc(num_possible_cpus(), sizeof(unsigned int), 
>> GFP_KERNEL);
>> +   node = kcalloc(num_possible_cpus(), sizeof(unsigned int), 
>> GFP_KERNEL);
>> +   if (!chip || !node) {
>> +   rc = -ENOMEM;
>> +   goto out;
>> +   }
>>
>> for_each_possible_cpu(cpu) {
>> unsigned int id = cpu_to_chip_id(cpu);
>>
>> if (prev_chip_id != id) {
>> prev_chip_id = id;
>> -   chip[nr_chips++] = id;
>> +   node[nr_chips] = cpu_to_node(cpu);
>> +   chip[nr_chips] = id;
>> +   nr_chips++;
>> }
>> }
>>
>> chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL);
>> -   if (!chips)
>> -   return -ENOMEM;
>> +   if (!chips) {
>> +   rc = -ENOMEM;
>> +   goto out;
>> +   }
>>
>> for (i = 0; i < nr_chips; i++) {
>> chips[i].id = chip[i];
>> -   cpumask_copy([i].mask, cpumask_of_node(chip[i]));
>> +   cpumask_copy([i].mask, cpumask_of_node(node[i]));
>> INIT_WORK([i].throttle, powernv_cpufreq_work_fn);
>> for_each_cpu(cpu, [i].mask)
>> per_cpu(chip_info, cpu) =  [i];
>> }
>>
>> -   return 0;
>> +out:
>> +   kfree(node);
>> +   kfree(chip);
>> +   return rc;
>>  }
>>
>>  static inline void clean_chip_info(void)
>> -- 
>> 2.8.0.rc3.226.g39d4020
> 



[RFC] ppc64le: Enable emulation support for simple Load/Store instructions

2016-10-20 Thread Ravi Bangoria
emulate_step() uses a number of underlying kernel functions that were
initially not enabled for LE. This has been rectified since. So, fix
emulate_step() for LE for the corresponding instructions.

Reported-by: Anton Blanchard 
Signed-off-by: Ravi Bangoria 
---
Note: This patch only enables LOAD, STORE, LARX and STCX instructions.
  I'll send a subsequent patch for other types like LOAD_FP,
  LOAD_VMX etc.

 arch/powerpc/lib/sstep.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 3362299..82323ef 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1807,8 +1807,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned 
int instr)
goto instr_done;
 
case LARX:
-   if (regs->msr & MSR_LE)
-   return 0;
if (op.ea & (size - 1))
break;  /* can't handle misaligned */
err = -EFAULT;
@@ -1832,8 +1830,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned 
int instr)
goto ldst_done;
 
case STCX:
-   if (regs->msr & MSR_LE)
-   return 0;
if (op.ea & (size - 1))
break;  /* can't handle misaligned */
err = -EFAULT;
@@ -1859,8 +1855,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned 
int instr)
goto ldst_done;
 
case LOAD:
-   if (regs->msr & MSR_LE)
-   return 0;
err = read_mem(>gpr[op.reg], op.ea, size, regs);
if (!err) {
if (op.type & SIGNEXT)
@@ -1913,8 +1907,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned 
int instr)
goto instr_done;
 
case STORE:
-   if (regs->msr & MSR_LE)
-   return 0;
if ((op.type & UPDATE) && size == sizeof(long) &&
op.reg == 1 && op.update_reg == 1 &&
!(regs->msr & MSR_PR) &&
-- 
1.8.3.1