[PATCH] cxl: Fix leaking pid refs in some error paths
In some error paths in functions cxl_start_context and afu_ioctl_start_work pid references to the current & group-leader tasks can leak after they are taken. This patch fixes these error paths to release these pid references before exiting the error path. This patch is based on earlier patch "cxl: Prevent adapter reset if an active context exists" at https://patchwork.ozlabs.org/patch/682187/ Fixes: 7b8ad495("cxl: Fix DSI misses when the context owning task exits") Reported-by: Frederic BarratSigned-off-by: Vaibhav Jain --- drivers/misc/cxl/api.c | 2 ++ drivers/misc/cxl/file.c | 22 +- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c index af23d7d..2e5233b 100644 --- a/drivers/misc/cxl/api.c +++ b/drivers/misc/cxl/api.c @@ -247,7 +247,9 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, cxl_ctx_get(); if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) { + put_pid(ctx->glpid); put_pid(ctx->pid); + ctx->glpid = ctx->pid = NULL; cxl_adapter_context_put(ctx->afu->adapter); cxl_ctx_put(); goto out; diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c index d0b421f..77080cc 100644 --- a/drivers/misc/cxl/file.c +++ b/drivers/misc/cxl/file.c @@ -194,6 +194,16 @@ static long afu_ioctl_start_work(struct cxl_context *ctx, ctx->mmio_err_ff = !!(work.flags & CXL_START_WORK_ERR_FF); /* +* Increment the mapped context count for adapter. This also checks +* if adapter_context_lock is taken. +*/ + rc = cxl_adapter_context_get(ctx->afu->adapter); + if (rc) { + afu_release_irqs(ctx, ctx); + goto out; + } + + /* * We grab the PID here and not in the file open to allow for the case * where a process (master, some daemon, etc) has opened the chardev on * behalf of another process, so the AFU's mm gets bound to the process @@ -205,15 +215,6 @@ static long afu_ioctl_start_work(struct cxl_context *ctx, ctx->pid = get_task_pid(current, PIDTYPE_PID); ctx->glpid = get_task_pid(current->group_leader, PIDTYPE_PID); - /* -* Increment the mapped context count for adapter. This also checks -* if adapter_context_lock is taken. -*/ - rc = cxl_adapter_context_get(ctx->afu->adapter); - if (rc) { - afu_release_irqs(ctx, ctx); - goto out; - } trace_cxl_attach(ctx, work.work_element_descriptor, work.num_interrupts, amr); @@ -221,6 +222,9 @@ static long afu_ioctl_start_work(struct cxl_context *ctx, amr))) { afu_release_irqs(ctx, ctx); cxl_adapter_context_put(ctx->afu->adapter); + put_pid(ctx->glpid); + put_pid(ctx->pid); + ctx->glpid = ctx->pid = NULL; goto out; } -- 2.7.4
Re: [PATCH v5 0/9] implement vcpu preempted check
On Thu, Oct 20, 2016 at 05:27:45PM -0400, Pan Xinhui wrote: > > This patch set aims to fix lock holder preemption issues. Thanks, this looks very good. I'll wait for ACKs from at least the KVM people, since that was I think the most contentious patch.
Re: [PATCH v5 7/9] x86, xen: support vcpu preempted check
Corrected xen-devel mailing list address, added other Xen maintainers On 20/10/16 23:27, Pan Xinhui wrote: > From: Juergen Gross> > Support the vcpu_is_preempted() functionality under Xen. This will > enhance lock performance on overcommitted hosts (more runnable vcpus > than physical cpus in the system) as doing busy waits for preempted > vcpus will hurt system performance far worse than early yielding. > > A quick test (4 vcpus on 1 physical cpu doing a parallel build job > with "make -j 8") reduced system time by about 5% with this patch. > > Signed-off-by: Juergen Gross > Signed-off-by: Pan Xinhui > --- > arch/x86/xen/spinlock.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c > index 3d6e006..74756bb 100644 > --- a/arch/x86/xen/spinlock.c > +++ b/arch/x86/xen/spinlock.c > @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) > per_cpu(irq_name, cpu) = NULL; > } > > - > /* > * Our init of PV spinlocks is split in two init functions due to us > * using paravirt patching and jump labels patching and having to do > @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) > pv_lock_ops.queued_spin_unlock = > PV_CALLEE_SAVE(__pv_queued_spin_unlock); > pv_lock_ops.wait = xen_qlock_wait; > pv_lock_ops.kick = xen_qlock_kick; > + > + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; > } > > /* >
Re: [PATCH] powerpc/book3s64: Always build for power4 or later
On 21/10/16 11:01, Michael Ellerman wrote: > When we're not compiling for a specific CPU, ie. none of the > CONFIG_POWERx_CPU options are set, and CONFIG_GENERIC_CPU *is* set, we > currently don't pass any -mcpu option to the compiler. This means the > compiler builds for a "generic" Power CPU. > > But back in 2014 we dropped support for pre power4 CPUs in commit > 468a33028edd ("powerpc: Drop support for pre-POWER4 cpus"). > > Given that, there's no point in building the kernel to run on pre power4 > cpus. So update the flags we pass to the compiler when > CONFIG_GENERIC_CPU is set, to specify -mcpu=power4. > > Signed-off-by: Michael Ellerman> --- > arch/powerpc/Makefile | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile > index 617dece67924..041fda1e2a5d 100644 > --- a/arch/powerpc/Makefile > +++ b/arch/powerpc/Makefile > @@ -121,6 +121,7 @@ CFLAGS-$(CONFIG_PPC32):= -ffixed-r2 $(MULTIPLEWORD) > > ifeq ($(CONFIG_PPC_BOOK3S_64),y) > CFLAGS-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=power7,-mtune=power4) > +CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=power4 > else > CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=powerpc64 > endif > Acked-by: Balbir Singh
[PATCH v6 10/10] ima: platform-independent hash value
From: Andreas SteffenFor remote attestion it is important for the ima measurement values to be platform-independent. Therefore integer fields to be hashed must be converted to canonical format. Changelog: - Define canonical format as little endian (Mimi) Signed-off-by: Andreas Steffen Signed-off-by: Mimi Zohar --- security/integrity/ima/ima_crypto.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c index 38f2ed830dd6..802d5d20f36f 100644 --- a/security/integrity/ima/ima_crypto.c +++ b/security/integrity/ima/ima_crypto.c @@ -477,11 +477,13 @@ static int ima_calc_field_array_hash_tfm(struct ima_field_data *field_data, u8 buffer[IMA_EVENT_NAME_LEN_MAX + 1] = { 0 }; u8 *data_to_hash = field_data[i].data; u32 datalen = field_data[i].len; + u32 datalen_to_hash = + !ima_canonical_fmt ? datalen : cpu_to_le32(datalen); if (strcmp(td->name, IMA_TEMPLATE_IMA_NAME) != 0) { rc = crypto_shash_update(shash, - (const u8 *) _data[i].len, - sizeof(field_data[i].len)); + (const u8 *) _to_hash, + sizeof(datalen_to_hash)); if (rc) break; } else if (strcmp(td->fields[i]->field_id, "n") == 0) { -- 2.7.4
[PATCH v6 09/10] ima: define a canonical binary_runtime_measurements list format
From: Mimi ZoharThe IMA binary_runtime_measurements list is currently in platform native format. To allow restoring a measurement list carried across kexec with a different endianness than the targeted kernel, this patch defines little-endian as the canonical format. For big endian systems wanting to save/restore the measurement list from a system with a different endianness, a new boot command line parameter named "ima_canonical_fmt" is defined. Considerations: use of the "ima_canonical_fmt" boot command line option will break existing userspace applications on big endian systems expecting the binary_runtime_measurements list to be in platform native format. Changelog v3: - restore PCR value properly Signed-off-by: Mimi Zohar --- Documentation/kernel-parameters.txt | 4 security/integrity/ima/ima.h | 6 ++ security/integrity/ima/ima_fs.c | 28 +--- security/integrity/ima/ima_kexec.c| 11 +-- security/integrity/ima/ima_template.c | 24 ++-- security/integrity/ima/ima_template_lib.c | 7 +-- 6 files changed, 67 insertions(+), 13 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 37babf91f2cb..3ee81afad7e9 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1641,6 +1641,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. The builtin appraise policy appraises all files owned by uid=0. + ima_canonical_fmt [IMA] + Use the canonical format for the binary runtime + measurements, instead of host native format. + ima_hash= [IMA] Format: { md5 | sha1 | rmd160 | sha256 | sha384 | sha512 | ... } diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index 6b0540ad189f..5e6180a4da7d 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -122,6 +122,12 @@ void ima_load_kexec_buffer(void); static inline void ima_load_kexec_buffer(void) {} #endif /* CONFIG_HAVE_IMA_KEXEC */ +/* + * The default binary_runtime_measurements list format is defined as the + * platform native format. The canonical format is defined as little-endian. + */ +extern bool ima_canonical_fmt; + /* Internal IMA function definitions */ int ima_init(void); int ima_fs_init(void); diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c index 66e5dd5e226f..2bcad99d434e 100644 --- a/security/integrity/ima/ima_fs.c +++ b/security/integrity/ima/ima_fs.c @@ -28,6 +28,16 @@ static DEFINE_MUTEX(ima_write_mutex); +bool ima_canonical_fmt; +static int __init default_canonical_fmt_setup(char *str) +{ +#ifdef __BIG_ENDIAN + ima_canonical_fmt = 1; +#endif + return 1; +} +__setup("ima_canonical_fmt", default_canonical_fmt_setup); + static int valid_policy = 1; #define TMPBUFLEN 12 static ssize_t ima_show_htable_value(char __user *buf, size_t count, @@ -122,7 +132,7 @@ int ima_measurements_show(struct seq_file *m, void *v) struct ima_queue_entry *qe = v; struct ima_template_entry *e; char *template_name; - int namelen; + u32 pcr, namelen, template_data_len; /* temporary fields */ bool is_ima_template = false; int i; @@ -139,25 +149,29 @@ int ima_measurements_show(struct seq_file *m, void *v) * PCR used defaults to the same (config option) in * little-endian format, unless set in policy */ - ima_putc(m, >pcr, sizeof(e->pcr)); + pcr = !ima_canonical_fmt ? e->pcr : cpu_to_le32(e->pcr); + ima_putc(m, , sizeof(e->pcr)); /* 2nd: template digest */ ima_putc(m, e->digest, TPM_DIGEST_SIZE); /* 3rd: template name size */ - namelen = strlen(template_name); + namelen = !ima_canonical_fmt ? strlen(template_name) : + cpu_to_le32(strlen(template_name)); ima_putc(m, , sizeof(namelen)); /* 4th: template name */ - ima_putc(m, template_name, namelen); + ima_putc(m, template_name, strlen(template_name)); /* 5th: template length (except for 'ima' template) */ if (strcmp(template_name, IMA_TEMPLATE_IMA_NAME) == 0) is_ima_template = true; - if (!is_ima_template) - ima_putc(m, >template_data_len, -sizeof(e->template_data_len)); + if (!is_ima_template) { + template_data_len = !ima_canonical_fmt ? e->template_data_len : + cpu_to_le32(e->template_data_len); + ima_putc(m, _data_len, sizeof(e->template_data_len)); + } /* 6th: template specific data */ for (i = 0; i < e->template_desc->num_fields; i++) {
[PATCH v6 08/10] ima: support restoring multiple template formats
From: Mimi ZoharThe configured IMA measurement list template format can be replaced at runtime on the boot command line, including a custom template format. This patch adds support for restoring a measuremement list containing multiple builtin/custom template formats. Signed-off-by: Mimi Zohar --- security/integrity/ima/ima_template.c | 53 +-- 1 file changed, 50 insertions(+), 3 deletions(-) diff --git a/security/integrity/ima/ima_template.c b/security/integrity/ima/ima_template.c index c0d808c20c40..e57b4682ff93 100644 --- a/security/integrity/ima/ima_template.c +++ b/security/integrity/ima/ima_template.c @@ -155,9 +155,14 @@ static int template_desc_init_fields(const char *template_fmt, { const char *template_fmt_ptr; struct ima_template_field *found_fields[IMA_TEMPLATE_NUM_FIELDS_MAX]; - int template_num_fields = template_fmt_size(template_fmt); + int template_num_fields; int i, len; + if (num_fields && *num_fields > 0) /* already initialized? */ + return 0; + + template_num_fields = template_fmt_size(template_fmt); + if (template_num_fields > IMA_TEMPLATE_NUM_FIELDS_MAX) { pr_err("format string '%s' contains too many fields\n", template_fmt); @@ -237,6 +242,35 @@ int __init ima_init_template(void) return result; } +static struct ima_template_desc *restore_template_fmt(char *template_name) +{ + struct ima_template_desc *template_desc = NULL; + int ret; + + ret = template_desc_init_fields(template_name, NULL, NULL); + if (ret < 0) { + pr_err("attempting to initialize the template \"%s\" failed\n", + template_name); + goto out; + } + + template_desc = kzalloc(sizeof(*template_desc), GFP_KERNEL); + if (!template_desc) + goto out; + + template_desc->name = ""; + template_desc->fmt = kstrdup(template_name, GFP_KERNEL); + if (!template_desc->fmt) + goto out; + + spin_lock(_list); + list_add_tail_rcu(_desc->list, _templates); + spin_unlock(_list); + synchronize_rcu(); +out: + return template_desc; +} + static int ima_restore_template_data(struct ima_template_desc *template_desc, void *template_data, int template_data_size, @@ -367,10 +401,23 @@ int ima_restore_measurement_list(loff_t size, void *buf) } data_v1 = bufp += (u_int8_t)hdr_v1->template_name_len; - /* get template format */ template_desc = lookup_template_desc(template_name); if (!template_desc) { - pr_err("template \"%s\" not found\n", template_name); + template_desc = restore_template_fmt(template_name); + if (!template_desc) + break; + } + + /* +* Only the running system's template format is initialized +* on boot. As needed, initialize the other template formats. +*/ + ret = template_desc_init_fields(template_desc->fmt, + &(template_desc->fields), + &(template_desc->num_fields)); + if (ret < 0) { + pr_err("attempting to restore the template fmt \"%s\" \ + failed\n", template_desc->fmt); ret = -EINVAL; break; } -- 2.7.4
[PATCH v6 07/10] ima: store the builtin/custom template definitions in a list
From: Mimi ZoharThe builtin and single custom templates are currently stored in an array. In preparation for being able to restore a measurement list containing multiple builtin/custom templates, this patch stores the builtin and custom templates as a linked list. This will permit defining more than one custom template per boot. Changelog v4: - fix "spinlock bad magic" BUG - reported by Dmitry Vyukov Changelog v3: - initialize template format list in ima_template_desc_current(), as it might be called during __setup before normal initialization. (kernel test robot) - remove __init annotation of ima_init_template_list() Changelog v2: - fix lookup_template_desc() preemption imbalance (kernel test robot) Signed-off-by: Mimi Zohar --- security/integrity/ima/ima.h | 2 ++ security/integrity/ima/ima_main.c | 1 + security/integrity/ima/ima_template.c | 52 +++ 3 files changed, 44 insertions(+), 11 deletions(-) diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index 139dec67dcbf..6b0540ad189f 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -85,6 +85,7 @@ struct ima_template_field { /* IMA template descriptor definition */ struct ima_template_desc { + struct list_head list; char *name; char *fmt; int num_fields; @@ -146,6 +147,7 @@ int ima_restore_measurement_list(loff_t bufsize, void *buf); int ima_measurements_show(struct seq_file *m, void *v); unsigned long ima_get_binary_runtime_size(void); int ima_init_template(void); +void ima_init_template_list(void); /* * used to protect h_table and sha_table diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c index 423d111b3b94..50818c60538b 100644 --- a/security/integrity/ima/ima_main.c +++ b/security/integrity/ima/ima_main.c @@ -418,6 +418,7 @@ static int __init init_ima(void) { int error; + ima_init_template_list(); hash_setup(CONFIG_IMA_DEFAULT_HASH); error = ima_init(); if (!error) { diff --git a/security/integrity/ima/ima_template.c b/security/integrity/ima/ima_template.c index 37f972cb05fe..c0d808c20c40 100644 --- a/security/integrity/ima/ima_template.c +++ b/security/integrity/ima/ima_template.c @@ -15,16 +15,20 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include #include "ima.h" #include "ima_template_lib.h" -static struct ima_template_desc defined_templates[] = { +static struct ima_template_desc builtin_templates[] = { {.name = IMA_TEMPLATE_IMA_NAME, .fmt = IMA_TEMPLATE_IMA_FMT}, {.name = "ima-ng", .fmt = "d-ng|n-ng"}, {.name = "ima-sig", .fmt = "d-ng|n-ng|sig"}, {.name = "", .fmt = ""},/* placeholder for a custom format */ }; +static LIST_HEAD(defined_templates); +static DEFINE_SPINLOCK(template_list); + static struct ima_template_field supported_fields[] = { {.field_id = "d", .field_init = ima_eventdigest_init, .field_show = ima_show_template_digest}, @@ -53,6 +57,8 @@ static int __init ima_template_setup(char *str) if (ima_template) return 1; + ima_init_template_list(); + /* * Verify that a template with the supplied name exists. * If not, use CONFIG_IMA_DEFAULT_TEMPLATE. @@ -81,7 +87,7 @@ __setup("ima_template=", ima_template_setup); static int __init ima_template_fmt_setup(char *str) { - int num_templates = ARRAY_SIZE(defined_templates); + int num_templates = ARRAY_SIZE(builtin_templates); if (ima_template) return 1; @@ -92,22 +98,28 @@ static int __init ima_template_fmt_setup(char *str) return 1; } - defined_templates[num_templates - 1].fmt = str; - ima_template = defined_templates + num_templates - 1; + builtin_templates[num_templates - 1].fmt = str; + ima_template = builtin_templates + num_templates - 1; + return 1; } __setup("ima_template_fmt=", ima_template_fmt_setup); static struct ima_template_desc *lookup_template_desc(const char *name) { - int i; + struct ima_template_desc *template_desc; + int found = 0; - for (i = 0; i < ARRAY_SIZE(defined_templates); i++) { - if (strcmp(defined_templates[i].name, name) == 0) - return defined_templates + i; + rcu_read_lock(); + list_for_each_entry_rcu(template_desc, _templates, list) { + if ((strcmp(template_desc->name, name) == 0) || + (strcmp(template_desc->fmt, name) == 0)) { + found = 1; + break; + } } - - return NULL; + rcu_read_unlock(); + return found ? template_desc : NULL; } static struct ima_template_field *lookup_template_field(const char *field_id) @@ -183,11 +195,29 @@ static int
[PATCH v6 06/10] ima: on soft reboot, save the measurement list
From: Mimi ZoharThe TPM PCRs are only reset on a hard reboot. In order to validate a TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list of the running kernel must be saved and restored on boot. This patch uses the kexec buffer passing mechanism to pass the serialized IMA binary_runtime_measurements to the next kernel. Changelog v5: - move writing the IMA measurement list to kexec load and remove from kexec execute. - remove registering notifier to call update on kexec execute - add includes needed by code in this patch to ima_kexec.c (Thiago) - fold patch "ima: serialize the binary_runtime_measurements" into this patch. Changelog v4: - Revert the skip_checksum change. Instead calculate the checksum with the measurement list segment, on update validate the existing checksum before re-calulating a new checksum with the updated measurement list. Changelog v3: - Request a kexec segment for storing the measurement list a half page, not a full page, more than needed for additional measurements. - Added binary_runtime_size overflow test - Limit maximum number of pages needed for kexec_segment_size to half of totalram_pages. (Dave Young) Changelog v2: - Fix build issue by defining a stub ima_add_kexec_buffer and stub struct kimage when CONFIG_IMA=n and CONFIG_IMA_KEXEC=n. (Fenguang Wu) - removed kexec_add_handover_buffer() checksum argument. - added skip_checksum member to kexec_buf - only register reboot notifier once Changelog v1: - updated to call IMA functions (Mimi) - move code from ima_template.c to ima_kexec.c (Mimi) Signed-off-by: Thiago Jung Bauermann Signed-off-by: Mimi Zohar Acked-by: "Eric W. Biederman" --- include/linux/ima.h| 12 kernel/kexec_file.c| 4 ++ security/integrity/ima/ima.h | 1 + security/integrity/ima/ima_fs.c| 2 +- security/integrity/ima/ima_kexec.c | 117 + 5 files changed, 135 insertions(+), 1 deletion(-) diff --git a/include/linux/ima.h b/include/linux/ima.h index 0eb7c2e7f0d6..7f6952f8d6aa 100644 --- a/include/linux/ima.h +++ b/include/linux/ima.h @@ -11,6 +11,7 @@ #define _LINUX_IMA_H #include +#include struct linux_binprm; #ifdef CONFIG_IMA @@ -23,6 +24,10 @@ extern int ima_post_read_file(struct file *file, void *buf, loff_t size, enum kernel_read_file_id id); extern void ima_post_path_mknod(struct dentry *dentry); +#ifdef CONFIG_IMA_KEXEC +extern void ima_add_kexec_buffer(struct kimage *image); +#endif + #else static inline int ima_bprm_check(struct linux_binprm *bprm) { @@ -62,6 +67,13 @@ static inline void ima_post_path_mknod(struct dentry *dentry) #endif /* CONFIG_IMA */ +#ifndef CONFIG_IMA_KEXEC +struct kimage; + +static inline void ima_add_kexec_buffer(struct kimage *image) +{} +#endif + #ifdef CONFIG_IMA_APPRAISE extern void ima_inode_post_setattr(struct dentry *dentry); extern int ima_inode_setxattr(struct dentry *dentry, const char *xattr_name, diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 0c2df7f73792..b56a558e406d 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -132,6 +133,9 @@ kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd, return ret; image->kernel_buf_len = size; + /* IMA needs to pass the measurement list to the next kernel. */ + ima_add_kexec_buffer(image); + /* Call arch image probe handlers */ ret = arch_kexec_kernel_image_probe(image, image->kernel_buf, image->kernel_buf_len); diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index ea1dcc452911..139dec67dcbf 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -143,6 +143,7 @@ void ima_print_digest(struct seq_file *m, u8 *digest, u32 size); struct ima_template_desc *ima_template_desc_current(void); int ima_restore_measurement_entry(struct ima_template_entry *entry); int ima_restore_measurement_list(loff_t bufsize, void *buf); +int ima_measurements_show(struct seq_file *m, void *v); unsigned long ima_get_binary_runtime_size(void); int ima_init_template(void); diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c index c07a3844ea0a..66e5dd5e226f 100644 --- a/security/integrity/ima/ima_fs.c +++ b/security/integrity/ima/ima_fs.c @@ -116,7 +116,7 @@ void ima_putc(struct seq_file *m, void *data, int datalen) * [eventdata length] * eventdata[n]=template specific data */ -static int ima_measurements_show(struct seq_file *m, void *v) +int ima_measurements_show(struct seq_file *m, void *v) { /* the list never shrinks, so we don't need a lock here */ struct
[PATCH v6 05/10] powerpc: ima: Send the kexec buffer to the next kernel
The IMA kexec buffer allows the currently running kernel to pass the measurement list via a kexec segment to the kernel that will be kexec'd. This is the architecture-specific part of setting up the IMA kexec buffer for the next kernel. It will be used in the next patch. Changelog v5: - New patch in this version. This code was previously in the kexec buffer handover patch series. Changelog relative to kexec handover patches v5: - Moved code to arch/powerpc/kernel/ima_kexec.c. - Renamed functions and struct members to variations of ima_kexec_buffer instead of variations of kexec_handover_buffer. - Use a single property /chosen/linux,ima-kexec-buffer containing the buffer address and length, instead of /chosen/linux,kexec-handover-buffer-{start,end}. - Use #address-cells and #size-cells to write the DT property. - Use size_t instead of unsigned long for size arguments. - Use CONFIG_IMA_KEXEC to build this code only when necessary. Signed-off-by: Thiago Jung BauermannAcked-by: "Eric W. Biederman" --- arch/powerpc/include/asm/ima.h | 16 + arch/powerpc/include/asm/kexec.h| 14 - arch/powerpc/kernel/ima_kexec.c | 91 + arch/powerpc/kernel/kexec_elf_64.c | 2 +- arch/powerpc/kernel/machine_kexec_file_64.c | 12 +++- 5 files changed, 129 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h index d5a72dd9b499..2313bdface34 100644 --- a/arch/powerpc/include/asm/ima.h +++ b/arch/powerpc/include/asm/ima.h @@ -1,6 +1,8 @@ #ifndef _ASM_POWERPC_IMA_H #define _ASM_POWERPC_IMA_H +struct kimage; + int ima_get_kexec_buffer(void **addr, size_t *size); int ima_free_kexec_buffer(void); @@ -10,4 +12,18 @@ void remove_ima_buffer(void *fdt, int chosen_node); static inline void remove_ima_buffer(void *fdt, int chosen_node) {} #endif +#ifdef CONFIG_IMA_KEXEC +int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr, + size_t size); + +int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node); +#else +static inline int setup_ima_buffer(const struct kimage *image, void *fdt, + int chosen_node) +{ + remove_ima_buffer(fdt, chosen_node); + return 0; +} +#endif /* CONFIG_IMA_KEXEC */ + #endif /* _ASM_POWERPC_IMA_H */ diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 23056d2dc330..a49cab287acb 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -94,12 +94,22 @@ static inline bool kdump_in_progress(void) #ifdef CONFIG_KEXEC_FILE extern struct kexec_file_ops kexec_elf64_ops; +#ifdef CONFIG_IMA_KEXEC +#define ARCH_HAS_KIMAGE_ARCH + +struct kimage_arch { + phys_addr_t ima_buffer_addr; + size_t ima_buffer_size; +}; +#endif + int setup_purgatory(struct kimage *image, const void *slave_code, const void *fdt, unsigned long kernel_load_addr, unsigned long fdt_load_addr, unsigned long stack_top, int debug); -int setup_new_fdt(void *fdt, unsigned long initrd_load_addr, - unsigned long initrd_len, const char *cmdline); +int setup_new_fdt(const struct kimage *image, void *fdt, + unsigned long initrd_load_addr, unsigned long initrd_len, + const char *cmdline); bool find_debug_console(const void *fdt); int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size); #endif /* CONFIG_KEXEC_FILE */ diff --git a/arch/powerpc/kernel/ima_kexec.c b/arch/powerpc/kernel/ima_kexec.c index 36e5a5df3804..5ea42c937ca9 100644 --- a/arch/powerpc/kernel/ima_kexec.c +++ b/arch/powerpc/kernel/ima_kexec.c @@ -130,3 +130,94 @@ void remove_ima_buffer(void *fdt, int chosen_node) if (!ret) pr_debug("Removed old IMA buffer reservation.\n"); } + +#ifdef CONFIG_IMA_KEXEC +/** + * arch_ima_add_kexec_buffer - do arch-specific steps to add the IMA buffer + * + * Architectures should use this function to pass on the IMA buffer + * information to the next kernel. + * + * Return: 0 on success, negative errno on error. + */ +int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr, + size_t size) +{ + image->arch.ima_buffer_addr = load_addr; + image->arch.ima_buffer_size = size; + + return 0; +} + +static int write_number(void *p, u64 value, int cells) +{ + if (cells == 1) { + u32 tmp; + + if (value > U32_MAX) + return -EINVAL; + + tmp = cpu_to_be32(value); + memcpy(p, , sizeof(tmp)); + } else if (cells == 2) { + u64 tmp; + + tmp = cpu_to_be64(value); + memcpy(p, , sizeof(tmp)); + } else + return -EINVAL; + +
[PATCH v6 01/10] powerpc: ima: Get the kexec buffer passed by the previous kernel
The IMA kexec buffer allows the currently running kernel to pass the measurement list via a kexec segment to the kernel that will be kexec'd. The second kernel can check whether the previous kernel sent the buffer and retrieve it. This is the architecture-specific part which enables IMA to receive the measurement list passed by the previous kernel. It will be used in the next patch. The change in machine_kexec_64.c is to factor out the logic of removing an FDT memory reservation so that it can be used by remove_ima_buffer. Changelog v6: - The kexec_file_load patches v9 already define delete_fdt_mem_rsv, so now we just need to export it. Changelog v5: - New patch in this version. This code was previously in the kexec buffer handover patch series. Changelog relative to kexec handover patches v5: - Added CONFIG_HAVE_IMA_KEXEC. - Added arch/powerpc/include/asm/ima.h. - Moved code to arch/powerpc/kernel/ima_kexec.c. - Renamed functions to variations of ima_kexec_buffer instead of variations of kexec_handover_buffer. - Use a single property /chosen/linux,ima-kexec-buffer containing the buffer address and length, instead of /chosen/linux,kexec-handover-buffer-{start,end}. - Use #address-cells and #size-cells to read the DT property. - Use size_t instead of unsigned long for size arguments. - Always remove linux,ima-kexec-buffer and its memory reservation when preparing a device tree for kexec_file_load. Signed-off-by: Thiago Jung BauermannAcked-by: "Eric W. Biederman" --- arch/Kconfig| 3 + arch/powerpc/Kconfig| 1 + arch/powerpc/include/asm/ima.h | 13 +++ arch/powerpc/include/asm/kexec.h| 1 + arch/powerpc/kernel/Makefile| 4 + arch/powerpc/kernel/ima_kexec.c | 132 arch/powerpc/kernel/machine_kexec_file_64.c | 5 +- 7 files changed, 158 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index 659bdd079277..e1605ff286a1 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -5,6 +5,9 @@ config KEXEC_CORE bool +config HAVE_IMA_KEXEC + bool + config OPROFILE tristate "OProfile system profiling" depends on PROFILING diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 897d0f14447d..40ee044f1915 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -458,6 +458,7 @@ config KEXEC config KEXEC_FILE bool "kexec file based system call" select KEXEC_CORE + select HAVE_IMA_KEXEC select BUILD_BIN2C depends on PPC64 depends on CRYPTO=y diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h new file mode 100644 index ..d5a72dd9b499 --- /dev/null +++ b/arch/powerpc/include/asm/ima.h @@ -0,0 +1,13 @@ +#ifndef _ASM_POWERPC_IMA_H +#define _ASM_POWERPC_IMA_H + +int ima_get_kexec_buffer(void **addr, size_t *size); +int ima_free_kexec_buffer(void); + +#ifdef CONFIG_IMA +void remove_ima_buffer(void *fdt, int chosen_node); +#else +static inline void remove_ima_buffer(void *fdt, int chosen_node) {} +#endif + +#endif /* _ASM_POWERPC_IMA_H */ diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 4497db7555b0..23056d2dc330 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -101,6 +101,7 @@ int setup_purgatory(struct kimage *image, const void *slave_code, int setup_new_fdt(void *fdt, unsigned long initrd_load_addr, unsigned long initrd_len, const char *cmdline); bool find_debug_console(const void *fdt); +int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size); #endif /* CONFIG_KEXEC_FILE */ #else /* !CONFIG_KEXEC_CORE */ diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 424b13b1b2b0..c3b37171168c 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -111,6 +111,10 @@ obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o crash.o \ machine_kexec_$(BITS).o obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file_$(BITS).o elf_util.o \ kexec_elf_$(BITS).o +ifeq ($(CONFIG_HAVE_IMA_KEXEC)$(CONFIG_IMA),yy) +obj-y += ima_kexec.o +endif + obj-$(CONFIG_AUDIT)+= audit.o obj64-$(CONFIG_AUDIT) += compat_audit.o diff --git a/arch/powerpc/kernel/ima_kexec.c b/arch/powerpc/kernel/ima_kexec.c new file mode 100644 index ..36e5a5df3804 --- /dev/null +++ b/arch/powerpc/kernel/ima_kexec.c @@ -0,0 +1,132 @@ +/* + * Copyright (C) 2016 IBM Corporation + * + * Authors: + * Thiago Jung Bauermann + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either
[PATCH v6 04/10] ima: maintain memory size needed for serializing the measurement list
From: Mimi ZoharIn preparation for serializing the binary_runtime_measurements, this patch maintains the amount of memory required. Changelog v5: - replace CONFIG_KEXEC_FILE with architecture CONFIG_HAVE_IMA_KEXEC (Thiago) Changelog v3: - include the ima_kexec_hdr size in the binary_runtime_measurement size. Signed-off-by: Mimi Zohar --- security/integrity/ima/Kconfig | 12 + security/integrity/ima/ima.h | 1 + security/integrity/ima/ima_queue.c | 53 -- 3 files changed, 64 insertions(+), 2 deletions(-) diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig index 5487827fa86c..370eb2f4dd37 100644 --- a/security/integrity/ima/Kconfig +++ b/security/integrity/ima/Kconfig @@ -27,6 +27,18 @@ config IMA to learn more about IMA. If unsure, say N. +config IMA_KEXEC + bool "Enable carrying the IMA measurement list across a soft boot" + depends on IMA && TCG_TPM && HAVE_IMA_KEXEC + default n + help + TPM PCRs are only reset on a hard reboot. In order to validate + a TPM's quote after a soft boot, the IMA measurement list of the + running kernel must be saved and restored on boot. + + Depending on the IMA policy, the measurement list can grow to + be very large. + config IMA_MEASURE_PCR_IDX int depends on IMA diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index 51dc8d57d64d..ea1dcc452911 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -143,6 +143,7 @@ void ima_print_digest(struct seq_file *m, u8 *digest, u32 size); struct ima_template_desc *ima_template_desc_current(void); int ima_restore_measurement_entry(struct ima_template_entry *entry); int ima_restore_measurement_list(loff_t bufsize, void *buf); +unsigned long ima_get_binary_runtime_size(void); int ima_init_template(void); /* diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c index 12d1b040bca9..3a3cc2a45645 100644 --- a/security/integrity/ima/ima_queue.c +++ b/security/integrity/ima/ima_queue.c @@ -29,6 +29,11 @@ #define AUDIT_CAUSE_LEN_MAX 32 LIST_HEAD(ima_measurements); /* list of all measurements */ +#ifdef CONFIG_IMA_KEXEC +static unsigned long binary_runtime_size; +#else +static unsigned long binary_runtime_size = ULONG_MAX; +#endif /* key: inode (before secure-hashing a file) */ struct ima_h_table ima_htable = { @@ -64,6 +69,24 @@ static struct ima_queue_entry *ima_lookup_digest_entry(u8 *digest_value, return ret; } +/* + * Calculate the memory required for serializing a single + * binary_runtime_measurement list entry, which contains a + * couple of variable length fields (e.g template name and data). + */ +static int get_binary_runtime_size(struct ima_template_entry *entry) +{ + int size = 0; + + size += sizeof(u32);/* pcr */ + size += sizeof(entry->digest); + size += sizeof(int);/* template name size field */ + size += strlen(entry->template_desc->name); + size += sizeof(entry->template_data_len); + size += entry->template_data_len; + return size; +} + /* ima_add_template_entry helper function: * - Add template entry to the measurement list and hash table, for * all entries except those carried across kexec. @@ -90,9 +113,30 @@ static int ima_add_digest_entry(struct ima_template_entry *entry, int flags) key = ima_hash_key(entry->digest); hlist_add_head_rcu(>hnext, _htable.queue[key]); } + + if (binary_runtime_size != ULONG_MAX) { + int size; + + size = get_binary_runtime_size(entry); + binary_runtime_size = (binary_runtime_size < ULONG_MAX - size) ? +binary_runtime_size + size : ULONG_MAX; + } return 0; } +/* + * Return the amount of memory required for serializing the + * entire binary_runtime_measurement list, including the ima_kexec_hdr + * structure. + */ +unsigned long ima_get_binary_runtime_size(void) +{ + if (binary_runtime_size >= (ULONG_MAX - sizeof(struct ima_kexec_hdr))) + return ULONG_MAX; + else + return binary_runtime_size + sizeof(struct ima_kexec_hdr); +}; + static int ima_pcr_extend(const u8 *hash, int pcr) { int result = 0; @@ -106,8 +150,13 @@ static int ima_pcr_extend(const u8 *hash, int pcr) return result; } -/* Add template entry to the measurement list and hash table, - * and extend the pcr. +/* + * Add template entry to the measurement list and hash table, and + * extend the pcr. + * + * On systems which support carrying the IMA measurement list across + * kexec, maintain the total memory size required for serializing the + * binary_runtime_measurements. */ int ima_add_template_entry(struct ima_template_entry
[PATCH v6 00/10] ima: carry the measurement list across kexec
Hello, This is just a rebase on top of kexec_file_load patches v9 which I just posted. The previous version of this series has some conflicts with it. Original cover letter: The TPM PCRs are only reset on a hard reboot. In order to validate a TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list of the running kernel must be saved and then restored on the subsequent boot, possibly of a different architecture. The existing securityfs binary_runtime_measurements file conveniently provides a serialized format of the IMA measurement list. This patch set serializes the measurement list in this format and restores it. Up to now, the binary_runtime_measurements was defined as architecture native format. The assumption being that userspace could and would handle any architecture conversions. With the ability of carrying the measurement list across kexec, possibly from one architecture to a different one, the per boot architecture information is lost and with it the ability of recalculating the template digest hash. To resolve this problem, without breaking the existing ABI, this patch set introduces the boot command line option "ima_canonical_fmt", which is arbitrarily defined as little endian. The need for this boot command line option will be limited to the existing version 1 format of the binary_runtime_measurements. Subsequent formats will be defined as canonical format (eg. TPM 2.0 support for larger digests). A simplified method of Thiago Bauermann's "kexec buffer handover" patch series for carrying the IMA measurement list across kexec is included in this patch set. The simplified method requires all file measurements be taken prior to executing the kexec load, as subsequent measurements will not be carried across the kexec and restored. Changelog v6: - Rebased on top of "kexec_file_load implementation for PowerPC" patches v9. Changelog v5: - Included patches from Thiago Bauermann's "kexec buffer handover" patch series for carrying the IMA measurement list across kexec. - Added CONFIG_HAVE_IMA_KEXEC - Renamed functions to variations of ima_kexec_buffer instead of variations of kexec_handover_buffer Changelog v4: - Fixed "spinlock bad magic" BUG - reported by Dmitry Vyukov - Rebased on Thiago Bauermann's v5 patch set - Removed the skip_checksum initialization Changelog v3: - Cleaned up the code for calculating the requested kexec segment size needed for the IMA measurement list, limiting the segment size to half of the totalram_pages. - Fixed kernel test robot reports as enumerated in the respective patch changelog. Changelog v2: - Canonical measurement list support added - Redefined the ima_kexec_hdr struct to use well defined sizes Andreas Steffen (1): ima: platform-independent hash value Mimi Zohar (7): ima: on soft reboot, restore the measurement list ima: permit duplicate measurement list entries ima: maintain memory size needed for serializing the measurement list ima: on soft reboot, save the measurement list ima: store the builtin/custom template definitions in a list ima: support restoring multiple template formats ima: define a canonical binary_runtime_measurements list format Thiago Jung Bauermann (2): powerpc: ima: Get the kexec buffer passed by the previous kernel powerpc: ima: Send the kexec buffer to the next kernel Documentation/kernel-parameters.txt | 4 + arch/Kconfig| 3 + arch/powerpc/Kconfig| 1 + arch/powerpc/include/asm/ima.h | 29 +++ arch/powerpc/include/asm/kexec.h| 15 +- arch/powerpc/kernel/Makefile| 4 + arch/powerpc/kernel/ima_kexec.c | 223 + arch/powerpc/kernel/kexec_elf_64.c | 2 +- arch/powerpc/kernel/machine_kexec_file_64.c | 15 +- include/linux/ima.h | 12 ++ kernel/kexec_file.c | 4 + security/integrity/ima/Kconfig | 12 ++ security/integrity/ima/Makefile | 1 + security/integrity/ima/ima.h| 31 +++ security/integrity/ima/ima_crypto.c | 6 +- security/integrity/ima/ima_fs.c | 30 ++- security/integrity/ima/ima_init.c | 2 + security/integrity/ima/ima_kexec.c | 168 security/integrity/ima/ima_main.c | 1 + security/integrity/ima/ima_queue.c | 76 +++- security/integrity/ima/ima_template.c | 293 ++-- security/integrity/ima/ima_template_lib.c | 7 +- 22 files changed, 901 insertions(+), 38 deletions(-) create mode 100644 arch/powerpc/include/asm/ima.h create mode 100644 arch/powerpc/kernel/ima_kexec.c create mode 100644 security/integrity/ima/ima_kexec.c -- 2.7.4
[PATCH v6 03/10] ima: permit duplicate measurement list entries
From: Mimi ZoharMeasurements carried across kexec need to be added to the IMA measurement list, but should not prevent measurements of the newly booted kernel from being added to the measurement list. This patch adds support for allowing duplicate measurements. The "boot_aggregate" measurement entry is the delimiter between soft boots. Signed-off-by: Mimi Zohar --- security/integrity/ima/ima_queue.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c index 4b1bb7787839..12d1b040bca9 100644 --- a/security/integrity/ima/ima_queue.c +++ b/security/integrity/ima/ima_queue.c @@ -65,11 +65,12 @@ static struct ima_queue_entry *ima_lookup_digest_entry(u8 *digest_value, } /* ima_add_template_entry helper function: - * - Add template entry to measurement list and hash table. + * - Add template entry to the measurement list and hash table, for + * all entries except those carried across kexec. * * (Called with ima_extend_list_mutex held.) */ -static int ima_add_digest_entry(struct ima_template_entry *entry) +static int ima_add_digest_entry(struct ima_template_entry *entry, int flags) { struct ima_queue_entry *qe; unsigned int key; @@ -85,8 +86,10 @@ static int ima_add_digest_entry(struct ima_template_entry *entry) list_add_tail_rcu(>later, _measurements); atomic_long_inc(_htable.len); - key = ima_hash_key(entry->digest); - hlist_add_head_rcu(>hnext, _htable.queue[key]); + if (flags) { + key = ima_hash_key(entry->digest); + hlist_add_head_rcu(>hnext, _htable.queue[key]); + } return 0; } @@ -126,7 +129,7 @@ int ima_add_template_entry(struct ima_template_entry *entry, int violation, } } - result = ima_add_digest_entry(entry); + result = ima_add_digest_entry(entry, 1); if (result < 0) { audit_cause = "ENOMEM"; audit_info = 0; @@ -155,7 +158,7 @@ int ima_restore_measurement_entry(struct ima_template_entry *entry) int result = 0; mutex_lock(_extend_list_mutex); - result = ima_add_digest_entry(entry); + result = ima_add_digest_entry(entry, 0); mutex_unlock(_extend_list_mutex); return result; } -- 2.7.4
[PATCH v6 02/10] ima: on soft reboot, restore the measurement list
From: Mimi ZoharThe TPM PCRs are only reset on a hard reboot. In order to validate a TPM's quote after a soft reboot (eg. kexec -e), the IMA measurement list of the running kernel must be saved and restored on boot. This patch restores the measurement list. Changelog v5: - replace CONFIG_KEXEC_FILE with architecture CONFIG_HAVE_IMA_KEXEC (Thiago) - replace kexec_get_handover_buffer() with ima_get_kexec_buffer() (Thiago) - replace kexec_free_handover_buffer() with ima_free_kexec_buffer() (Thiago) - remove unnecessary includes from ima_kexec.c (Thiago) - fix off-by-one error when checking hdr_v1->template_name_len (Colin King) Changelog v2: - redefined ima_kexec_hdr to use types with well defined sizes (M. Ellerman) - defined missing ima_load_kexec_buffer() stub function Changelog v1: - call ima_load_kexec_buffer() (Thiago) Signed-off-by: Mimi Zohar --- security/integrity/ima/Makefile | 1 + security/integrity/ima/ima.h | 21 + security/integrity/ima/ima_init.c | 2 + security/integrity/ima/ima_kexec.c| 44 + security/integrity/ima/ima_queue.c| 10 ++ security/integrity/ima/ima_template.c | 170 ++ 6 files changed, 248 insertions(+) diff --git a/security/integrity/ima/Makefile b/security/integrity/ima/Makefile index 9aeaedad1e2b..29f198bde02b 100644 --- a/security/integrity/ima/Makefile +++ b/security/integrity/ima/Makefile @@ -8,4 +8,5 @@ obj-$(CONFIG_IMA) += ima.o ima-y := ima_fs.o ima_queue.o ima_init.o ima_main.o ima_crypto.o ima_api.o \ ima_policy.o ima_template.o ima_template_lib.o ima-$(CONFIG_IMA_APPRAISE) += ima_appraise.o +ima-$(CONFIG_HAVE_IMA_KEXEC) += ima_kexec.o obj-$(CONFIG_IMA_BLACKLIST_KEYRING) += ima_mok.o diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h index db25f54a04fe..51dc8d57d64d 100644 --- a/security/integrity/ima/ima.h +++ b/security/integrity/ima/ima.h @@ -28,6 +28,10 @@ #include "../integrity.h" +#ifdef CONFIG_HAVE_IMA_KEXEC +#include +#endif + enum ima_show_type { IMA_SHOW_BINARY, IMA_SHOW_BINARY_NO_FIELD_LEN, IMA_SHOW_BINARY_OLD_STRING_FMT, IMA_SHOW_ASCII }; enum tpm_pcrs { TPM_PCR0 = 0, TPM_PCR8 = 8 }; @@ -102,6 +106,21 @@ struct ima_queue_entry { }; extern struct list_head ima_measurements; /* list of all measurements */ +/* Some details preceding the binary serialized measurement list */ +struct ima_kexec_hdr { + u16 version; + u16 _reserved0; + u32 _reserved1; + u64 buffer_size; + u64 count; +}; + +#ifdef CONFIG_HAVE_IMA_KEXEC +void ima_load_kexec_buffer(void); +#else +static inline void ima_load_kexec_buffer(void) {} +#endif /* CONFIG_HAVE_IMA_KEXEC */ + /* Internal IMA function definitions */ int ima_init(void); int ima_fs_init(void); @@ -122,6 +141,8 @@ int ima_init_crypto(void); void ima_putc(struct seq_file *m, void *data, int datalen); void ima_print_digest(struct seq_file *m, u8 *digest, u32 size); struct ima_template_desc *ima_template_desc_current(void); +int ima_restore_measurement_entry(struct ima_template_entry *entry); +int ima_restore_measurement_list(loff_t bufsize, void *buf); int ima_init_template(void); /* diff --git a/security/integrity/ima/ima_init.c b/security/integrity/ima/ima_init.c index 32912bd54ead..3ba0ca49cba6 100644 --- a/security/integrity/ima/ima_init.c +++ b/security/integrity/ima/ima_init.c @@ -128,6 +128,8 @@ int __init ima_init(void) if (rc != 0) return rc; + ima_load_kexec_buffer(); + rc = ima_add_boot_aggregate(); /* boot aggregate must be first entry */ if (rc != 0) return rc; diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c new file mode 100644 index ..36afd0fe9747 --- /dev/null +++ b/security/integrity/ima/ima_kexec.c @@ -0,0 +1,44 @@ +/* + * Copyright (C) 2016 IBM Corporation + * + * Authors: + * Thiago Jung Bauermann + * Mimi Zohar + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ +#include "ima.h" + +/* + * Restore the measurement list from the previous kernel. + */ +void ima_load_kexec_buffer(void) +{ + void *kexec_buffer = NULL; + size_t kexec_buffer_size = 0; + int rc; + + rc = ima_get_kexec_buffer(_buffer, _buffer_size); + switch (rc) { + case 0: + rc = ima_restore_measurement_list(kexec_buffer_size, + kexec_buffer); + if (rc != 0) + pr_err("Failed to restore the measurement list: %d\n", + rc); + +
[PATCH v9 10/10] powerpc: Enable CONFIG_KEXEC_FILE in powerpc server defconfigs.
Enable CONFIG_KEXEC_FILE in powernv_defconfig, ppc64_defconfig and pseries_defconfig. It depends on CONFIG_CRYPTO_SHA256=y, so add that as well. Signed-off-by: Thiago Jung Bauermann--- arch/powerpc/configs/powernv_defconfig | 2 ++ arch/powerpc/configs/ppc64_defconfig | 2 ++ arch/powerpc/configs/pseries_defconfig | 2 ++ 3 files changed, 6 insertions(+) diff --git a/arch/powerpc/configs/powernv_defconfig b/arch/powerpc/configs/powernv_defconfig index d98b6eb3254f..5a190aa5534b 100644 --- a/arch/powerpc/configs/powernv_defconfig +++ b/arch/powerpc/configs/powernv_defconfig @@ -49,6 +49,7 @@ CONFIG_BINFMT_MISC=m CONFIG_PPC_TRANSACTIONAL_MEM=y CONFIG_HOTPLUG_CPU=y CONFIG_KEXEC=y +CONFIG_KEXEC_FILE=y CONFIG_IRQ_ALL_CPUS=y CONFIG_NUMA=y CONFIG_MEMORY_HOTPLUG=y @@ -301,6 +302,7 @@ CONFIG_CRYPTO_CCM=m CONFIG_CRYPTO_PCBC=m CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_MICHAEL_MIC=m +CONFIG_CRYPTO_SHA256=y CONFIG_CRYPTO_TGR192=m CONFIG_CRYPTO_WP512=m CONFIG_CRYPTO_ANUBIS=m diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig index 58a98d40086f..0059d2088b9c 100644 --- a/arch/powerpc/configs/ppc64_defconfig +++ b/arch/powerpc/configs/ppc64_defconfig @@ -46,6 +46,7 @@ CONFIG_HZ_100=y CONFIG_BINFMT_MISC=m CONFIG_PPC_TRANSACTIONAL_MEM=y CONFIG_KEXEC=y +CONFIG_KEXEC_FILE=y CONFIG_CRASH_DUMP=y CONFIG_IRQ_ALL_CPUS=y CONFIG_MEMORY_HOTREMOVE=y @@ -336,6 +337,7 @@ CONFIG_CRYPTO_TEST=m CONFIG_CRYPTO_PCBC=m CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_MICHAEL_MIC=m +CONFIG_CRYPTO_SHA256=y CONFIG_CRYPTO_TGR192=m CONFIG_CRYPTO_WP512=m CONFIG_CRYPTO_ANUBIS=m diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index 8a3bc016b732..f022f657a984 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -52,6 +52,7 @@ CONFIG_HZ_100=y CONFIG_BINFMT_MISC=m CONFIG_PPC_TRANSACTIONAL_MEM=y CONFIG_KEXEC=y +CONFIG_KEXEC_FILE=y CONFIG_IRQ_ALL_CPUS=y CONFIG_MEMORY_HOTPLUG=y CONFIG_MEMORY_HOTREMOVE=y @@ -303,6 +304,7 @@ CONFIG_CRYPTO_TEST=m CONFIG_CRYPTO_PCBC=m CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_MICHAEL_MIC=m +CONFIG_CRYPTO_SHA256=y CONFIG_CRYPTO_TGR192=m CONFIG_CRYPTO_WP512=m CONFIG_CRYPTO_ANUBIS=m -- 2.7.4
[PATCH v9 09/10] powerpc: Add purgatory for kexec_file_load implementation.
This purgatory implementation comes from kexec-tools, almost unchanged. In order to use boot/string.S in ppc64 big endian mode, the functions defined in it need to have dot symbols so that they can be called from C code. Therefore, change the file to use a DOTSYM macro if one is defined, so that the purgatory can add those dot symbols. The changes made to the purgatory code relative to the version in kexec-tools were: The sha256_regions global variable was renamed to sha_regions to match what kexec_file_load expects, and to use the sha256.c file from x86's purgatory (this avoids adding yet another SHA-256 implementation). The global variables in purgatory.c and purgatory-ppc64.c now use a __section attribute to put them in the .data section instead of being initialized to zero. It doesn't matter what their initial value is, because they will be set by the kernel when preparing the kexec image. Also, since we don't support loading a crashdump kernel via kexec_file_load yet, the code related to that functionality has been removed. Finally, some checkpatch.pl warnings were fixed. Signed-off-by: Thiago Jung Bauermann--- arch/powerpc/Makefile| 1 + arch/powerpc/boot/string.S | 67 +++-- arch/powerpc/purgatory/.gitignore| 2 + arch/powerpc/purgatory/Makefile | 33 +++ arch/powerpc/purgatory/console-ppc64.c | 37 +++ arch/powerpc/purgatory/crtsavres.S | 5 + arch/powerpc/purgatory/hvCall.S | 27 + arch/powerpc/purgatory/hvCall.h | 8 ++ arch/powerpc/purgatory/kexec-sha256.h| 11 +++ arch/powerpc/purgatory/ppc64_asm.h | 20 arch/powerpc/purgatory/printf.c | 164 +++ arch/powerpc/purgatory/purgatory-ppc64.c | 36 +++ arch/powerpc/purgatory/purgatory.c | 62 arch/powerpc/purgatory/purgatory.h | 14 +++ arch/powerpc/purgatory/sha256.c | 6 ++ arch/powerpc/purgatory/sha256.h | 1 + arch/powerpc/purgatory/string.S | 2 + arch/powerpc/purgatory/v2wrap.S | 134 + 18 files changed, 601 insertions(+), 29 deletions(-) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 617dece67924..5e7dcdaf93f5 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -249,6 +249,7 @@ core-y += arch/powerpc/kernel/ \ core-$(CONFIG_XMON)+= arch/powerpc/xmon/ core-$(CONFIG_KVM) += arch/powerpc/kvm/ core-$(CONFIG_PERF_EVENTS) += arch/powerpc/perf/ +core-$(CONFIG_KEXEC_FILE) += arch/powerpc/purgatory/ drivers-$(CONFIG_OPROFILE) += arch/powerpc/oprofile/ diff --git a/arch/powerpc/boot/string.S b/arch/powerpc/boot/string.S index acc9428f2789..b54bbad5f83d 100644 --- a/arch/powerpc/boot/string.S +++ b/arch/powerpc/boot/string.S @@ -11,9 +11,18 @@ #include "ppc_asm.h" +/* + * The ppc64 kexec purgatory uses this file and packages it in ELF64, + * so it needs dot symbols for the ppc64 big endian ABI. This macro + * allows it to create those symbols. + */ +#ifndef DOTSYM +#define DOTSYM(a) a +#endif + .text - .globl strcpy -strcpy: + .globl DOTSYM(strcpy) +DOTSYM(strcpy): addir5,r3,-1 addir4,r4,-1 1: lbzur0,1(r4) @@ -22,8 +31,8 @@ strcpy: bne 1b blr - .globl strncpy -strncpy: + .globl DOTSYM(strncpy) +DOTSYM(strncpy): cmpwi 0,r5,0 beqlr mtctr r5 @@ -35,8 +44,8 @@ strncpy: bdnzf 2,1b/* dec ctr, branch if ctr != 0 && !cr0.eq */ blr - .globl strcat -strcat: + .globl DOTSYM(strcat) +DOTSYM(strcat): addir5,r3,-1 addir4,r4,-1 1: lbzur0,1(r5) @@ -49,8 +58,8 @@ strcat: bne 1b blr - .globl strchr -strchr: + .globl DOTSYM(strchr) +DOTSYM(strchr): addir3,r3,-1 1: lbzur0,1(r3) cmpw0,r0,r4 @@ -60,8 +69,8 @@ strchr: li r3,0 blr - .globl strcmp -strcmp: + .globl DOTSYM(strcmp) +DOTSYM(strcmp): addir5,r3,-1 addir4,r4,-1 1: lbzur3,1(r5) @@ -72,8 +81,8 @@ strcmp: beq 1b blr - .globl strncmp -strncmp: + .globl DOTSYM(strncmp) +DOTSYM(strncmp): mtctr r5 addir5,r3,-1 addir4,r4,-1 @@ -85,8 +94,8 @@ strncmp: bdnzt eq,1b blr - .globl strlen -strlen: + .globl DOTSYM(strlen) +DOTSYM(strlen): addir4,r3,-1 1: lbzur0,1(r4) cmpwi 0,r0,0 @@ -94,8 +103,8 @@ strlen: subfr3,r3,r4 blr - .globl memset -memset: + .globl DOTSYM(memset) +DOTSYM(memset): rlwimi r4,r4,8,16,23 rlwimi r4,r4,16,0,15 addir6,r3,-4 @@ -120,14 +129,14 @@ memset: bdnz8b
[PATCH v9 08/10] powerpc: Add support for loading ELF kernels with kexec_file_load.
This uses all the infrastructure built up by the previous patches in the series to load an ELF vmlinux file and an initrd. It uses the flattened device tree at initial_boot_params as a base and adjusts memory reservations and its /chosen node for the next kernel. [a...@linux-foundation.org: coding-style fixes] Signed-off-by: Thiago Jung BauermannSigned-off-by: Andrew Morton --- arch/powerpc/include/asm/kexec.h| 12 + arch/powerpc/kernel/Makefile| 3 +- arch/powerpc/kernel/kexec_elf_64.c | 280 +++ arch/powerpc/kernel/machine_kexec_file_64.c | 338 +++- 4 files changed, 630 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index eca2f975bf44..4497db7555b0 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -91,6 +91,18 @@ static inline bool kdump_in_progress(void) return crashing_cpu >= 0; } +#ifdef CONFIG_KEXEC_FILE +extern struct kexec_file_ops kexec_elf64_ops; + +int setup_purgatory(struct kimage *image, const void *slave_code, + const void *fdt, unsigned long kernel_load_addr, + unsigned long fdt_load_addr, unsigned long stack_top, + int debug); +int setup_new_fdt(void *fdt, unsigned long initrd_load_addr, + unsigned long initrd_len, const char *cmdline); +bool find_debug_console(const void *fdt); +#endif /* CONFIG_KEXEC_FILE */ + #else /* !CONFIG_KEXEC_CORE */ static inline void crash_kexec_secondary(struct pt_regs *regs) { } diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index de14b7eb11bb..424b13b1b2b0 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -109,7 +109,8 @@ obj-$(CONFIG_PCI) += pci_$(BITS).o $(pci64-y) \ obj-$(CONFIG_PCI_MSI) += msi.o obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o crash.o \ machine_kexec_$(BITS).o -obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file_$(BITS).o elf_util.o +obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file_$(BITS).o elf_util.o \ + kexec_elf_$(BITS).o obj-$(CONFIG_AUDIT)+= audit.o obj64-$(CONFIG_AUDIT) += compat_audit.o diff --git a/arch/powerpc/kernel/kexec_elf_64.c b/arch/powerpc/kernel/kexec_elf_64.c new file mode 100644 index ..dc29e0131b76 --- /dev/null +++ b/arch/powerpc/kernel/kexec_elf_64.c @@ -0,0 +1,280 @@ +/* + * Load ELF vmlinux file for the kexec_file_load syscall. + * + * Copyright (C) 2004 Adam Litke (a...@us.ibm.com) + * Copyright (C) 2004 IBM Corp. + * Copyright (C) 2005 R Sharada (shar...@in.ibm.com) + * Copyright (C) 2006 Mohan Kumar M (mo...@in.ibm.com) + * Copyright (C) 2016 IBM Corporation + * + * Based on kexec-tools' kexec-elf-exec.c and kexec-elf-ppc64.c. + * Heavily modified for the kernel by + * Thiago Jung Bauermann . + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation (version 2 of the License). + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#define pr_fmt(fmt)"kexec_elf: " fmt + +#include +#include +#include +#include +#include +#include +#include + +#define PURGATORY_STACK_SIZE (16 * 1024) + +/** + * build_elf_exec_info - read ELF executable and check that we can use it + */ +static int build_elf_exec_info(const char *buf, size_t len, struct elfhdr *ehdr, + struct elf_info *elf_info) +{ + int i; + int ret; + + ret = elf_read_from_buffer(buf, len, ehdr, elf_info); + if (ret) + return ret; + + /* Big endian vmlinux has type ET_DYN. */ + if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN) { + pr_err("Not an ELF executable.\n"); + goto error; + } else if (!elf_info->proghdrs) { + pr_err("No ELF program header.\n"); + goto error; + } + + for (i = 0; i < ehdr->e_phnum; i++) { + /* +* Kexec does not support loading interpreters. +* In addition this check keeps us from attempting +* to kexec ordinay executables. +*/ + if (elf_info->proghdrs[i].p_type == PT_INTERP) { + pr_err("Requires an ELF interpreter.\n"); + goto error; + } + } + + return 0; +error: + elf_free_info(elf_info); + return -ENOEXEC; +} + +static int
[PATCH v9 07/10] powerpc: Add functions to read ELF files of any endianness.
A little endian kernel might need to kexec a big endian kernel (the opposite is less likely but could happen as well), so we can't just cast the buffer with the binary to ELF structs and use them as is done elsewhere. This patch adds functions which do byte-swapping as necessary when populating the ELF structs. These functions will be used in the next patch in the series. Signed-off-by: Thiago Jung Bauermann--- arch/powerpc/include/asm/elf_util.h | 21 ++ arch/powerpc/kernel/elf_util.c | 418 2 files changed, 439 insertions(+) diff --git a/arch/powerpc/include/asm/elf_util.h b/arch/powerpc/include/asm/elf_util.h index 1df232f65ec8..3dbad8cc7179 100644 --- a/arch/powerpc/include/asm/elf_util.h +++ b/arch/powerpc/include/asm/elf_util.h @@ -19,6 +19,18 @@ #include +struct elf_info { + /* +* Where the ELF binary contents are kept. +* Memory managed by the user of the struct. +*/ + const char *buffer; + + const struct elfhdr *ehdr; + const struct elf_phdr *proghdrs; + struct elf_shdr *sechdrs; +}; + /* * r2 is the TOC pointer: it actually points 0x8000 into the TOC (this * gives the value maximum span in an instruction which uses a signed @@ -40,4 +52,13 @@ int elf64_apply_relocate_add_item(const Elf64_Shdr *sechdrs, const char *strtab, unsigned long my_r2, const char *obj_name, struct module *me); +static inline bool elf_is_elf_file(const struct elfhdr *ehdr) +{ + return memcmp(ehdr->e_ident, ELFMAG, SELFMAG) == 0; +} + +int elf_read_from_buffer(const char *buf, size_t len, struct elfhdr *ehdr, +struct elf_info *elf_info); +void elf_free_info(struct elf_info *elf_info); + #endif /* _ASM_POWERPC_ELF_UTIL_H */ diff --git a/arch/powerpc/kernel/elf_util.c b/arch/powerpc/kernel/elf_util.c index ffa68cd6fb99..e57e7397f65c 100644 --- a/arch/powerpc/kernel/elf_util.c +++ b/arch/powerpc/kernel/elf_util.c @@ -16,7 +16,24 @@ * GNU General Public License for more details. */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include #include +#include + +#if ELF_CLASS == ELFCLASS32 +#define elf_addr_to_cpuelf32_to_cpu + +#ifndef Elf_Rel +#define Elf_RelElf32_Rel +#endif /* Elf_Rel */ +#else /* ELF_CLASS == ELFCLASS32 */ +#define elf_addr_to_cpuelf64_to_cpu + +#ifndef Elf_Rel +#define Elf_RelElf64_Rel +#endif /* Elf_Rel */ /** * elf_toc_section - find the toc section in the file with the given ELF headers @@ -44,3 +61,404 @@ unsigned int elf_toc_section(const struct elfhdr *ehdr, return 0; } + +static uint64_t elf64_to_cpu(const struct elfhdr *ehdr, uint64_t value) +{ + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB) + value = le64_to_cpu(value); + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB) + value = be64_to_cpu(value); + + return value; +} +#endif /* ELF_CLASS == ELFCLASS32 */ + +static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value) +{ + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB) + value = le16_to_cpu(value); + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB) + value = be16_to_cpu(value); + + return value; +} + +static uint32_t elf32_to_cpu(const struct elfhdr *ehdr, uint32_t value) +{ + if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB) + value = le32_to_cpu(value); + else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB) + value = be32_to_cpu(value); + + return value; +} + +/** + * elf_is_ehdr_sane - check that it is safe to use the ELF header + * @buf_len: size of the buffer in which the ELF file is loaded. + */ +static bool elf_is_ehdr_sane(const struct elfhdr *ehdr, size_t buf_len) +{ + if (ehdr->e_phnum > 0 && ehdr->e_phentsize != sizeof(struct elf_phdr)) { + pr_debug("Bad program header size.\n"); + return false; + } else if (ehdr->e_shnum > 0 && + ehdr->e_shentsize != sizeof(struct elf_shdr)) { + pr_debug("Bad section header size.\n"); + return false; + } else if (ehdr->e_ident[EI_VERSION] != EV_CURRENT || + ehdr->e_version != EV_CURRENT) { + pr_debug("Unknown ELF version.\n"); + return false; + } + + if (ehdr->e_phoff > 0 && ehdr->e_phnum > 0) { + size_t phdr_size; + + /* +* e_phnum is at most 65535 so calculating the size of the +* program header cannot overflow. +*/ + phdr_size = sizeof(struct elf_phdr) * ehdr->e_phnum; + + /* Sanity check the program header table location. */ + if (ehdr->e_phoff + phdr_size < ehdr->e_phoff) { + pr_debug("Program headers at invalid
[PATCH v9 06/10] powerpc: Implement kexec_file_load.
Add arch-specific functions needed by generic kexec_file code. Also, module_64.c's apply_relocate_add and kexec_file's arch_kexec_apply_relocations_add have slightly different needs, so elf64_apply_relocate_add_item needs to be adapted to accommodate both: When apply_relocate_add is called, the module is already loaded at its final location in memory so the place where the relocation needs to be applied and its address in the module's memory are the same. This is not the case for kexec's purgatory, because it is stored in a buffer and will only be copied to its final location in memory right before being executed. Therefore, it needs to be relocated while still in its buffer. In this case, the place where the relocation needs to be applied is different from its address in the purgatory's memory. So we add an address argument to elf64_apply_relocate_add_item to specify the final address of the relocation in memory. We also add more relocation types that are used by the purgatory. Signed-off-by: Josh SklarSigned-off-by: Thiago Jung Bauermann --- arch/powerpc/Kconfig| 13 ++ arch/powerpc/include/asm/elf_util.h | 43 + arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/asm/unistd.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 1 + arch/powerpc/kernel/Makefile| 1 + arch/powerpc/kernel/elf_util.c | 46 ++ arch/powerpc/kernel/machine_kexec_file_64.c | 245 arch/powerpc/kernel/module_64.c | 71 ++-- 9 files changed, 406 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 6cb59c6e5ba4..897d0f14447d 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -455,6 +455,19 @@ config KEXEC interface is strongly in flux, so no good recommendation can be made. +config KEXEC_FILE + bool "kexec file based system call" + select KEXEC_CORE + select BUILD_BIN2C + depends on PPC64 + depends on CRYPTO=y + depends on CRYPTO_SHA256=y + help + This is a new version of the kexec system call. This call is + file based and takes in file descriptors as system call arguments + for kernel and initramfs as opposed to a list of segments as is the + case for the older kexec call. + config RELOCATABLE bool "Build a relocatable kernel" depends on (PPC64 && !COMPILE_TEST) || (FLATMEM && (44x || FSL_BOOKE)) diff --git a/arch/powerpc/include/asm/elf_util.h b/arch/powerpc/include/asm/elf_util.h new file mode 100644 index ..1df232f65ec8 --- /dev/null +++ b/arch/powerpc/include/asm/elf_util.h @@ -0,0 +1,43 @@ +/* + * Utility functions to work with ELF files. + * + * Copyright (C) 2016, IBM Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef _ASM_POWERPC_ELF_UTIL_H +#define _ASM_POWERPC_ELF_UTIL_H + +#include + +/* + * r2 is the TOC pointer: it actually points 0x8000 into the TOC (this + * gives the value maximum span in an instruction which uses a signed + * offset) + */ +static inline unsigned long elf_my_r2(const struct elf_shdr *sechdrs, + unsigned int toc_section) +{ + return sechdrs[toc_section].sh_addr + 0x8000; +} + +unsigned int elf_toc_section(const struct elfhdr *ehdr, +const struct elf_shdr *sechdrs); + +int elf64_apply_relocate_add_item(const Elf64_Shdr *sechdrs, const char *strtab, + const Elf64_Rela *rela, const Elf64_Sym *sym, + unsigned long *location, + unsigned long address, unsigned long value, + unsigned long my_r2, const char *obj_name, + struct module *me); + +#endif /* _ASM_POWERPC_ELF_UTIL_H */ diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 2fc5d4db503c..4b369d83fe9c 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -386,3 +386,4 @@ SYSCALL(mlock2) SYSCALL(copy_file_range) COMPAT_SYS_SPU(preadv2) COMPAT_SYS_SPU(pwritev2) +SYSCALL(kexec_file_load) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index cf12c580f6b2..a01e97d3f305 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h
[PATCH v9 05/10] powerpc: Factor out relocation code in module_64.c
The kexec_file_load system call needs to relocate the purgatory, so factor out the module relocation code so that it can be shared. This patch's purpose is to move the ELF relocation logic from apply_relocate_add to the new function elf64_apply_relocate_add_item with as few changes as possible. The following changes were needed: elf64_apply_relocate_add_item takes a my_r2 argument because the kexec code can't use the my_r2 function since it doesn't have a struct module to pass to it. For the same reason, it also takes an obj_name argument to use in error messages. It still takes a pointer to struct module argument, but kexec code can just pass NULL because except for the TOC symbol, the purgatory doesn't have undefined symbols so the module pointer isn't used. Apart from what is described in the paragraph above, the code has no functional changes. Suggested-by: Michael EllermanSigned-off-by: Thiago Jung Bauermann --- arch/powerpc/kernel/module_64.c | 344 +--- 1 file changed, 182 insertions(+), 162 deletions(-) diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c index 183368e008cf..61baad036639 100644 --- a/arch/powerpc/kernel/module_64.c +++ b/arch/powerpc/kernel/module_64.c @@ -507,6 +507,181 @@ static int restore_r2(u32 *instruction, struct module *me) return 1; } +static int elf64_apply_relocate_add_item(const Elf64_Shdr *sechdrs, +const char *strtab, +const Elf64_Rela *rela, +const Elf64_Sym *sym, +unsigned long *location, +unsigned long value, +unsigned long my_r2, +const char *obj_name, +struct module *me) +{ + switch (ELF64_R_TYPE(rela->r_info)) { + case R_PPC64_ADDR32: + /* Simply set it */ + *(u32 *)location = value; + break; + + case R_PPC64_ADDR64: + /* Simply set it */ + *(unsigned long *)location = value; + break; + + case R_PPC64_TOC: + *(unsigned long *)location = my_r2; + break; + + case R_PPC64_TOC16: + /* Subtract TOC pointer */ + value -= my_r2; + if (value + 0x8000 > 0x) { + pr_err("%s: bad TOC16 relocation (0x%lx)\n", + obj_name, value); + return -ENOEXEC; + } + *((uint16_t *) location) + = (*((uint16_t *) location) & ~0x) + | (value & 0x); + break; + + case R_PPC64_TOC16_LO: + /* Subtract TOC pointer */ + value -= my_r2; + *((uint16_t *) location) + = (*((uint16_t *) location) & ~0x) + | (value & 0x); + break; + + case R_PPC64_TOC16_DS: + /* Subtract TOC pointer */ + value -= my_r2; + if ((value & 3) != 0 || value + 0x8000 > 0x) { + pr_err("%s: bad TOC16_DS relocation (0x%lx)\n", + obj_name, value); + return -ENOEXEC; + } + *((uint16_t *) location) + = (*((uint16_t *) location) & ~0xfffc) + | (value & 0xfffc); + break; + + case R_PPC64_TOC16_LO_DS: + /* Subtract TOC pointer */ + value -= my_r2; + if ((value & 3) != 0) { + pr_err("%s: bad TOC16_LO_DS relocation (0x%lx)\n", + obj_name, value); + return -ENOEXEC; + } + *((uint16_t *) location) + = (*((uint16_t *) location) & ~0xfffc) + | (value & 0xfffc); + break; + + case R_PPC64_TOC16_HA: + /* Subtract TOC pointer */ + value -= my_r2; + value = ((value + 0x8000) >> 16); + *((uint16_t *) location) + = (*((uint16_t *) location) & ~0x) + | (value & 0x); + break; + + case R_PPC_REL24: + /* FIXME: Handle weak symbols here --RR */ + if (sym->st_shndx == SHN_UNDEF) { + /* External: go via stub */ + value = stub_for_addr(sechdrs, value, me); + if (!value) + return -ENOENT; + if (!restore_r2((u32 *)location + 1, me)) +
[PATCH v9 04/10] powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.
Commit 2965faa5e03d ("kexec: split kexec_load syscall from kexec core code") introduced CONFIG_KEXEC_CORE so that CONFIG_KEXEC means whether the kexec_load system call should be compiled-in and CONFIG_KEXEC_FILE means whether the kexec_file_load system call should be compiled-in. These options can be set independently from each other. Since until now powerpc only supported kexec_load, CONFIG_KEXEC and CONFIG_KEXEC_CORE were synonyms. That is not the case anymore, so we need to make a distinction. Almost all places where CONFIG_KEXEC was being used should be using CONFIG_KEXEC_CORE instead, since kexec_file_load also needs that code compiled in. Signed-off-by: Thiago Jung Bauermann--- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/debug.h | 2 +- arch/powerpc/include/asm/kexec.h | 6 +++--- arch/powerpc/include/asm/machdep.h| 4 ++-- arch/powerpc/include/asm/smp.h| 2 +- arch/powerpc/kernel/Makefile | 4 ++-- arch/powerpc/kernel/head_64.S | 2 +- arch/powerpc/kernel/misc_32.S | 2 +- arch/powerpc/kernel/misc_64.S | 6 +++--- arch/powerpc/kernel/prom.c| 2 +- arch/powerpc/kernel/setup_64.c| 4 ++-- arch/powerpc/kernel/smp.c | 6 +++--- arch/powerpc/kernel/traps.c | 2 +- arch/powerpc/platforms/85xx/corenet_generic.c | 2 +- arch/powerpc/platforms/85xx/smp.c | 8 arch/powerpc/platforms/cell/spu_base.c| 2 +- arch/powerpc/platforms/powernv/setup.c| 6 +++--- arch/powerpc/platforms/ps3/setup.c| 4 ++-- arch/powerpc/platforms/pseries/Makefile | 2 +- arch/powerpc/platforms/pseries/setup.c| 4 ++-- 20 files changed, 36 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 65fba4c34cd7..6cb59c6e5ba4 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -489,7 +489,7 @@ config CRASH_DUMP config FA_DUMP bool "Firmware-assisted dump" - depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC + depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC_CORE help A robust mechanism to get reliable kernel crash dump with assistance from firmware. This approach does not use kexec, diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h index a954e4975049..86308f177f2d 100644 --- a/arch/powerpc/include/asm/debug.h +++ b/arch/powerpc/include/asm/debug.h @@ -10,7 +10,7 @@ struct pt_regs; extern struct dentry *powerpc_debugfs_root; -#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC) +#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC_CORE) extern int (*__debugger)(struct pt_regs *regs); extern int (*__debugger_ipi)(struct pt_regs *regs); diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index a46f5f45570c..eca2f975bf44 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -53,7 +53,7 @@ typedef void (*crash_shutdown_t)(void); -#ifdef CONFIG_KEXEC +#ifdef CONFIG_KEXEC_CORE /* * This function is responsible for capturing register states if coming @@ -91,7 +91,7 @@ static inline bool kdump_in_progress(void) return crashing_cpu >= 0; } -#else /* !CONFIG_KEXEC */ +#else /* !CONFIG_KEXEC_CORE */ static inline void crash_kexec_secondary(struct pt_regs *regs) { } static inline int overlaps_crashkernel(unsigned long start, unsigned long size) @@ -116,7 +116,7 @@ static inline bool kdump_in_progress(void) return false; } -#endif /* CONFIG_KEXEC */ +#endif /* CONFIG_KEXEC_CORE */ #endif /* ! __ASSEMBLY__ */ #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_KEXEC_H */ diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index e02cbc6a6c70..5011b69107a7 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -183,7 +183,7 @@ struct machdep_calls { */ void (*machine_shutdown)(void); -#ifdef CONFIG_KEXEC +#ifdef CONFIG_KEXEC_CORE void (*kexec_cpu_down)(int crash_shutdown, int secondary); /* Called to do what every setup is needed on image and the @@ -198,7 +198,7 @@ struct machdep_calls { * no return. */ void (*machine_kexec)(struct kimage *image); -#endif /* CONFIG_KEXEC */ +#endif /* CONFIG_KEXEC_CORE */ #ifdef CONFIG_SUSPEND /* These are called to disable and enable, respectively, IRQs when diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 0d02c11dc331..32db16d2e7ad 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -176,7 +176,7 @@ static inline void set_hard_smp_processor_id(int cpu, int phys) #endif /* !CONFIG_SMP */ #endif /* !CONFIG_PPC64 */ -#if defined(CONFIG_PPC64) &&
[PATCH v9 03/10] kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.
kexec_locate_mem_hole will be used by the PowerPC kexec_file_load implementation to find free memory for the purgatory stack. Signed-off-by: Thiago Jung BauermannAcked-by: Dave Young --- include/linux/kexec.h | 1 + kernel/kexec_file.c | 25 - 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 437ef1b47428..a33f63351f86 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -176,6 +176,7 @@ struct kexec_buf { int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, int (*func)(u64, u64, void *)); extern int kexec_add_buffer(struct kexec_buf *kbuf); +int kexec_locate_mem_hole(struct kexec_buf *kbuf); #endif /* CONFIG_KEXEC_FILE */ struct kimage { diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index efd2c094af7e..0c2df7f73792 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -450,6 +450,23 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, } /** + * kexec_locate_mem_hole - find free memory for the purgatory or the next kernel + * @kbuf: Parameters for the memory search. + * + * On success, kbuf->mem will have the start address of the memory region found. + * + * Return: 0 on success, negative errno on error. + */ +int kexec_locate_mem_hole(struct kexec_buf *kbuf) +{ + int ret; + + ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback); + + return ret == 1 ? 0 : -EADDRNOTAVAIL; +} + +/** * kexec_add_buffer - place a buffer in a kexec segment * @kbuf: Buffer contents and memory parameters. * @@ -489,11 +506,9 @@ int kexec_add_buffer(struct kexec_buf *kbuf) kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE); /* Walk the RAM ranges and allocate a suitable range for the buffer */ - ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback); - if (ret != 1) { - /* A suitable memory range could not be found for buffer */ - return -EADDRNOTAVAIL; - } + ret = kexec_locate_mem_hole(kbuf); + if (ret) + return ret; /* Found a suitable memory range */ ksegment = >image->segment[kbuf->image->nr_segments]; -- 2.7.4
[PATCH v9 02/10] kexec_file: Change kexec_add_buffer to take kexec_buf as argument.
This is done to simplify the kexec_add_buffer argument list. Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer. In addition, change the type of kexec_buf.buffer from char * to void *. There is no particular reason for it to be a char *, and the change allows us to get rid of 3 existing casts to char * in the code. Signed-off-by: Thiago Jung BauermannAcked-by: Dave Young Acked-by: Balbir Singh --- arch/x86/kernel/crash.c | 37 arch/x86/kernel/kexec-bzimage64.c | 48 +++-- include/linux/kexec.h | 8 +--- kernel/kexec_file.c | 88 ++- 4 files changed, 87 insertions(+), 94 deletions(-) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 650830e39e3a..3741461c63a0 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -631,9 +631,9 @@ static int determine_backup_region(u64 start, u64 end, void *arg) int crash_load_segments(struct kimage *image) { - unsigned long src_start, src_sz, elf_sz; - void *elf_addr; int ret; + struct kexec_buf kbuf = { .image = image, .buf_min = 0, + .buf_max = ULONG_MAX, .top_down = false }; /* * Determine and load a segment for backup area. First 640K RAM @@ -647,43 +647,44 @@ int crash_load_segments(struct kimage *image) if (ret < 0) return ret; - src_start = image->arch.backup_src_start; - src_sz = image->arch.backup_src_sz; - /* Add backup segment. */ - if (src_sz) { + if (image->arch.backup_src_sz) { + kbuf.buffer = _zero_bytes; + kbuf.bufsz = sizeof(crash_zero_bytes); + kbuf.memsz = image->arch.backup_src_sz; + kbuf.buf_align = PAGE_SIZE; /* * Ideally there is no source for backup segment. This is * copied in purgatory after crash. Just add a zero filled * segment for now to make sure checksum logic works fine. */ - ret = kexec_add_buffer(image, (char *)_zero_bytes, - sizeof(crash_zero_bytes), src_sz, - PAGE_SIZE, 0, -1, 0, - >arch.backup_load_addr); + ret = kexec_add_buffer(); if (ret) return ret; + image->arch.backup_load_addr = kbuf.mem; pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n", -image->arch.backup_load_addr, src_start, src_sz); +image->arch.backup_load_addr, +image->arch.backup_src_start, kbuf.memsz); } /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(image, _addr, _sz); + ret = prepare_elf_headers(image, , ); if (ret) return ret; - image->arch.elf_headers = elf_addr; - image->arch.elf_headers_sz = elf_sz; + image->arch.elf_headers = kbuf.buffer; + image->arch.elf_headers_sz = kbuf.bufsz; - ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz, - ELF_CORE_HEADER_ALIGN, 0, -1, 0, - >arch.elf_load_addr); + kbuf.memsz = kbuf.bufsz; + kbuf.buf_align = ELF_CORE_HEADER_ALIGN; + ret = kexec_add_buffer(); if (ret) { vfree((void *)image->arch.elf_headers); return ret; } + image->arch.elf_load_addr = kbuf.mem; pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n", -image->arch.elf_load_addr, elf_sz, elf_sz); +image->arch.elf_load_addr, kbuf.bufsz, kbuf.bufsz); return ret; } diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 3407b148c240..d0a814a9d96a 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -331,17 +331,17 @@ static void *bzImage64_load(struct kimage *image, char *kernel, struct setup_header *header; int setup_sects, kern16_size, ret = 0; - unsigned long setup_header_size, params_cmdline_sz, params_misc_sz; + unsigned long setup_header_size, params_cmdline_sz; struct boot_params *params; unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr; unsigned long purgatory_load_addr; - unsigned long kernel_bufsz, kernel_memsz, kernel_align; - char *kernel_buf; struct bzimage64_data *ldata; struct kexec_entry64_regs regs64; void *stack; unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr); unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset; + struct
[PATCH v9 01/10] kexec_file: Allow arch-specific memory walking for kexec_add_buffer
Allow architectures to specify a different memory walking function for kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but PowerPC uses the memblock subsystem. Signed-off-by: Thiago Jung BauermannAcked-by: Dave Young Acked-by: Balbir Singh --- include/linux/kexec.h | 29 - kernel/kexec_file.c | 30 ++ kernel/kexec_internal.h | 16 3 files changed, 50 insertions(+), 25 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 406c33dcae13..5e320ddaaa82 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -148,7 +148,34 @@ struct kexec_file_ops { kexec_verify_sig_t *verify_sig; #endif }; -#endif + +/** + * struct kexec_buf - parameters for finding a place for a buffer in memory + * @image: kexec image in which memory to search. + * @buffer:Contents which will be copied to the allocated memory. + * @bufsz: Size of @buffer. + * @mem: On return will have address of the buffer in memory. + * @memsz: Size for the buffer in memory. + * @buf_align: Minimum alignment needed. + * @buf_min: The buffer can't be placed below this address. + * @buf_max: The buffer can't be placed above this address. + * @top_down: Allocate from top of memory. + */ +struct kexec_buf { + struct kimage *image; + char *buffer; + unsigned long bufsz; + unsigned long mem; + unsigned long memsz; + unsigned long buf_align; + unsigned long buf_min; + unsigned long buf_max; + bool top_down; +}; + +int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, + int (*func)(u64, u64, void *)); +#endif /* CONFIG_KEXEC_FILE */ struct kimage { kimage_entry_t head; diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 037c321c5618..f865674bff51 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -428,6 +428,27 @@ static int locate_mem_hole_callback(u64 start, u64 end, void *arg) return locate_mem_hole_bottom_up(start, end, kbuf); } +/** + * arch_kexec_walk_mem - call func(data) on free memory regions + * @kbuf: Context info for the search. Also passed to @func. + * @func: Function to call for each memory region. + * + * Return: The memory walk will stop when func returns a non-zero value + * and that value will be returned. If all free regions are visited without + * func returning non-zero, then zero will be returned. + */ +int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf, + int (*func)(u64, u64, void *)) +{ + if (kbuf->image->type == KEXEC_TYPE_CRASH) + return walk_iomem_res_desc(crashk_res.desc, + IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, + crashk_res.start, crashk_res.end, + kbuf, func); + else + return walk_system_ram_res(0, ULONG_MAX, kbuf, func); +} + /* * Helper function for placing a buffer in a kexec segment. This assumes * that kexec_mutex is held. @@ -474,14 +495,7 @@ int kexec_add_buffer(struct kimage *image, char *buffer, unsigned long bufsz, kbuf->top_down = top_down; /* Walk the RAM ranges and allocate a suitable range for the buffer */ - if (image->type == KEXEC_TYPE_CRASH) - ret = walk_iomem_res_desc(crashk_res.desc, - IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY, - crashk_res.start, crashk_res.end, kbuf, - locate_mem_hole_callback); - else - ret = walk_system_ram_res(0, -1, kbuf, - locate_mem_hole_callback); + ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback); if (ret != 1) { /* A suitable memory range could not be found for buffer */ return -EADDRNOTAVAIL; diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h index 0a52315d9c62..4cef7e4706b0 100644 --- a/kernel/kexec_internal.h +++ b/kernel/kexec_internal.h @@ -20,22 +20,6 @@ struct kexec_sha_region { unsigned long len; }; -/* - * Keeps track of buffer parameters as provided by caller for requesting - * memory placement of buffer. - */ -struct kexec_buf { - struct kimage *image; - char *buffer; - unsigned long bufsz; - unsigned long mem; - unsigned long memsz; - unsigned long buf_align; - unsigned long buf_min; - unsigned long buf_max; - bool top_down; /* allocate from top of memory hole */ -}; - void kimage_file_post_load_cleanup(struct kimage *image); #else /* CONFIG_KEXEC_FILE */ static inline void kimage_file_post_load_cleanup(struct kimage *image) { } -- 2.7.4
[PATCH v9 00/10] kexec_file_load implementation for PowerPC
Hello, This version has the following changes: 1. Rebased on v4.9-rc1. This should fix the conflicts in next and -mm that these patches were having with the s/CONFIG_WORD_SIZE/BITS/ change. 2. Changed patch "powerpc: Factor out relocation code in module_64.c" to make as little changes as possible to arch/powerpc/module_64.c and . This meant factoring out only the switch statement from apply_relocate_add into its own function, and keeping it in module_64.c. arch_kexec_apply_relocations_add and apply_relocate_add share a bit less code now. This addresses a concern expressed by Michael Ellerman that there were too many changes being made to the module loading code. 3. Reduced number of patches in the series by squashing one patch and redistributing the code in two other patches. They were smaller after the change above and didn't make that much sense on their own anymore. 4. Moved code implementing kexec_file_load to a new file instead of adding it to the existing arch/powerpc/machine_kexec_64.c. 5. Cleaned up the purgatory code a bit to fix checkpatch warnings: removed unused code related to crashdump support and added __section(".data") to global variables instead of initializing them to zero to avoid them going to .bss. The changes in 3. and 4. made the patches somewhat different from v8 but it's just code moving from one patch to another, or to a new file. Apart from what is described in 2. and 5., there's very little actual code changed. The patches are all bisectable, each one was tested with several kernel configs. The gory details are in the detailed changelog at the end of this email. Original cover letter: This patch series implements the kexec_file_load system call on PowerPC. This system call moves the reading of the kernel, initrd and the device tree from the userspace kexec tool to the kernel. This is needed if you want to do one or both of the following: 1. only allow loading of signed kernels. 2. "measure" (i.e., record the hashes of) the kernel, initrd, kernel command line and other boot inputs for the Integrity Measurement Architecture subsystem. The above are the functions kexec already has built into kexec_file_load. Yesterday I posted a set of patches which allows a third feature: 3. have IMA pass-on its event log (where integrity measurements are registered) accross kexec to the second kernel, so that the event history is preserved. Because OpenPower uses an intermediary Linux instance as a boot loader (skiroot), feature 1 is needed to implement secure boot for the platform, while features 2 and 3 are needed to implement trusted boot. This patch series starts by removing an x86 assumption from kexec_file: kexec_add_buffer uses iomem to find reserved memory ranges, but PowerPC uses the memblock subsystem. A hook is added so that each arch can specify how memory ranges can be found. Also, the memory-walking logic in kexec_add_buffer is useful in this implementation to find a free area for the purgatory's stack, so the next patch moves that logic to kexec_locate_mem_hole. The kexec_file_load system call needs to apply relocations to the purgatory but adding code for that would duplicate functionality with the module loading mechanism, which also needs to apply relocations to the kernel modules. Therefore, this patch series factors out the module relocation code so that it can be shared. One thing that is still missing is crashkernel support, which I intend to submit shortly. For now, arch_kexec_kernel_image_probe rejects crash kernels. This code is based on kexec-tools, but with many modifications to adapt it to the kernel environment and facilities. Except the purgatory, which only has minimal changes. Changes for v9: - Rebased on top of v4.9-rc1 - Patch "powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead." - Fixed conflict with patch renaming CONFIG_WORD_SIZE to BITS. - Patch "powerpc: Factor out relocation code from module_64.c to elf_util_64.c." - Retitled to "powerpc: Factor out relocation code in module_64.c" - Fixed conflict with patch renaming CONFIG_WORD_SIZE to BITS. - Put relocation code in function elf64_apply_relocate_add_item in module_64.c itself instead of moving it to elf_util_64.c. - Only factored out the switch statement to apply the relocations instead of the whole logic to iterate through all of the relocations. - There's no elf_util_64.c anymore. - Doesn't change anymore. - Doesn't change arch/powerpc/kernel/Makefile anymore. - Moved creation of to patch "Implement kexec_file_load." - Patch "powerpc: Generalize elf64_apply_relocate_add." - Dropped patch from series. - Moved changes adding address variable to patch "powerpc: Implement kexec_file_load." - Patch "powerpc: Adapt elf64_apply_relocate_add for kexec_file_load." - Dropped patch from series. - Moved new relocs needed by the purgatory to patch "powerpc: Implement
Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container
On Fri, 21 Oct 2016 11:21:34 +1100 David Gibsonwrote: > On Thu, Oct 20, 2016 at 06:31:21PM +1100, Nicholas Piggin wrote: > > On Thu, 20 Oct 2016 14:03:49 +1100 > > Alexey Kardashevskiy wrote: > > > > > In some situations the userspace memory context may live longer than > > > the userspace process itself so if we need to do proper memory context > > > cleanup, we better cache @mm and use it later when the process is gone > > > (@current or @current->mm is NULL). > > > > > > This references mm and stores the pointer in the container; this is done > > > when a container is just created so checking for !current->mm in other > > > places becomes pointless. > > > > > > This replaces current->mm with container->mm everywhere except debug > > > prints. > > > > > > This adds a check that current->mm is the same as the one stored in > > > the container to prevent userspace from registering memory in other > > > processes. > > > > > > Signed-off-by: Alexey Kardashevskiy > > > --- > > > drivers/vfio/vfio_iommu_spapr_tce.c | 127 > > > > > > 1 file changed, 71 insertions(+), 56 deletions(-) > > > > > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > > > b/drivers/vfio/vfio_iommu_spapr_tce.c > > > index d0c38b2..6b0b121 100644 > > > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > > > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > > > @@ -31,49 +31,46 @@ > > > > Does it make sense to move the rest of these hunks into patch 2? > > I think they're similarly just moving the mm reference into callers. > > > > > > > static void tce_iommu_detach_group(void *iommu_data, > > > struct iommu_group *iommu_group); > > > > > > -static long try_increment_locked_vm(long npages) > > > +static long try_increment_locked_vm(struct mm_struct *mm, long npages) > > > { > > > long ret = 0, locked, lock_limit; > > > > > > - if (!current || !current->mm) > > > - return -ESRCH; /* process exited */ > > > - > > > if (!npages) > > > return 0; > > > > > > - down_write(>mm->mmap_sem); > > > - locked = current->mm->locked_vm + npages; > > > + down_write(>mmap_sem); > > > + locked = mm->locked_vm + npages; > > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > if (locked > lock_limit && !capable(CAP_IPC_LOCK)) > > > ret = -ENOMEM; > > > else > > > - current->mm->locked_vm += npages; > > > + mm->locked_vm += npages; > > > > > > pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, > > > npages << PAGE_SHIFT, > > > - current->mm->locked_vm << PAGE_SHIFT, > > > + mm->locked_vm << PAGE_SHIFT, > > > rlimit(RLIMIT_MEMLOCK), > > > ret ? " - exceeded" : ""); > > > > > > - up_write(>mm->mmap_sem); > > > + up_write(>mmap_sem); > > > > > > return ret; > > > } > > > > > > -static void decrement_locked_vm(long npages) > > > +static void decrement_locked_vm(struct mm_struct *mm, long npages) > > > { > > > - if (!current || !current->mm || !npages) > > > + if (!mm || !npages) > > > return; /* process exited */ > > > > I know you're trying to be defensive and change as little logic as possible, > > but some cases should be an error, and I think some of the "process exited" > > comments were wrong anyway. > > > > Maybe pull the !mm test into the caller and make it WARN_ON? > > > > > > > @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg) > > > return ERR_PTR(-EINVAL); > > > } > > > > > > + if (!current->mm) > > > + return ERR_PTR(-ESRCH); /* process exited */ > > > > A userspace thread in the kernel can't have its mm disappear, unless you > > are actually in the exit code. !current->mm is more like a test for a kernel > > thread. > > > > > > > + > > > container = kzalloc(sizeof(*container), GFP_KERNEL); > > > if (!container) > > > return ERR_PTR(-ENOMEM); > > > @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg) > > > > > > container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU; > > > > > > + container->mm = current->mm; > > > + atomic_inc(>mm->mm_count); > > > + > > > return container; > > > > It's a nitpick if you respin the patch, but I guess it would better be > > described as a reference than a cache of the object. "have tce_container > > take a reference to mm_struct". > > > > > > > @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container > > > *container, > > > unsigned long hpa; > > > enum dma_data_direction dirtmp; > > > > > > + if (container->mm != current->mm) > > > + return -ESRCH; > > > > Good, is this condition now enforced on all entrypoints that use > > container->mm (except the final teardown)? (The mlock/rlimit stuff, > > as we talked about before, doesn't make sense if not). > > Right. I don't know that it's
Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check
在 2016/10/21 09:23, Boqun Feng 写道: On Thu, Oct 20, 2016 at 05:27:54PM -0400, Pan Xinhui wrote: Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is preempted. Other values means the vcpu has been preempted. ^ s/preempted/not preempted yes. the less of *not* definitely sould be avoided.. And better to fix other typos in the commit log ;-) Maybe you can try aspell? That works for me. I will try it. :) Regards, Boqun Signed-off-by: Pan Xinhui--- Documentation/virtual/kvm/msr.txt | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..3376f13 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
Re: [PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check
On Thu, Oct 20, 2016 at 05:27:54PM -0400, Pan Xinhui wrote: > Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 > preempted" into struct kvm_steal_time. This field tells if one vcpu is > running or not. > > It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is > preempted. Other values means the vcpu has been preempted. ^ s/preempted/not preempted And better to fix other typos in the commit log ;-) Maybe you can try aspell? That works for me. Regards, Boqun > > Signed-off-by: Pan Xinhui> --- > Documentation/virtual/kvm/msr.txt | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/Documentation/virtual/kvm/msr.txt > b/Documentation/virtual/kvm/msr.txt > index 2a71c8f..3376f13 100644 > --- a/Documentation/virtual/kvm/msr.txt > +++ b/Documentation/virtual/kvm/msr.txt > @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 > __u64 steal; > __u32 version; > __u32 flags; > - __u32 pad[12]; > + __u8 preempted; > + __u32 pad[11]; > } > > whose data will be filled in by the hypervisor periodically. Only one > @@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 > nanoseconds. Time during which the vcpu is idle, will not be > reported as steal time. > > + preempted: indicate the VCPU who owns this struct is running or > + not. Non-zero values mean the VCPU has been preempted. Zero > + means the VCPU is not preempted. NOTE, it is always zero if the > + the hypervisor doesn't support this field. > + > MSR_KVM_EOI_EN: 0x4b564d04 > data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 > when disabled. Bit 1 is reserved and must be zero. When PV end of > -- > 2.4.11 > signature.asc Description: PGP signature
Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container
On Thu, Oct 20, 2016 at 02:03:49PM +1100, Alexey Kardashevskiy wrote: > In some situations the userspace memory context may live longer than > the userspace process itself so if we need to do proper memory context > cleanup, we better cache @mm and use it later when the process is gone > (@current or @current->mm is NULL). > > This references mm and stores the pointer in the container; this is done > when a container is just created so checking for !current->mm in other > places becomes pointless. > > This replaces current->mm with container->mm everywhere except debug > prints. > > This adds a check that current->mm is the same as the one stored in > the container to prevent userspace from registering memory in other > processes. > > Signed-off-by: Alexey Kardashevskiy> --- > drivers/vfio/vfio_iommu_spapr_tce.c | 127 > > 1 file changed, 71 insertions(+), 56 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > b/drivers/vfio/vfio_iommu_spapr_tce.c > index d0c38b2..6b0b121 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -31,49 +31,46 @@ > static void tce_iommu_detach_group(void *iommu_data, > struct iommu_group *iommu_group); > > -static long try_increment_locked_vm(long npages) > +static long try_increment_locked_vm(struct mm_struct *mm, long npages) > { > long ret = 0, locked, lock_limit; > > - if (!current || !current->mm) > - return -ESRCH; /* process exited */ > - > if (!npages) > return 0; > > - down_write(>mm->mmap_sem); > - locked = current->mm->locked_vm + npages; > + down_write(>mmap_sem); > + locked = mm->locked_vm + npages; > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > if (locked > lock_limit && !capable(CAP_IPC_LOCK)) > ret = -ENOMEM; > else > - current->mm->locked_vm += npages; > + mm->locked_vm += npages; > > pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, > npages << PAGE_SHIFT, > - current->mm->locked_vm << PAGE_SHIFT, > + mm->locked_vm << PAGE_SHIFT, > rlimit(RLIMIT_MEMLOCK), > ret ? " - exceeded" : ""); > > - up_write(>mm->mmap_sem); > + up_write(>mmap_sem); > > return ret; > } > > -static void decrement_locked_vm(long npages) > +static void decrement_locked_vm(struct mm_struct *mm, long npages) > { > - if (!current || !current->mm || !npages) > + if (!mm || !npages) > return; /* process exited */ > > - down_write(>mm->mmap_sem); > - if (WARN_ON_ONCE(npages > current->mm->locked_vm)) > - npages = current->mm->locked_vm; > - current->mm->locked_vm -= npages; > + down_write(>mmap_sem); > + if (WARN_ON_ONCE(npages > mm->locked_vm)) > + npages = mm->locked_vm; > + mm->locked_vm -= npages; > pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid, > npages << PAGE_SHIFT, > - current->mm->locked_vm << PAGE_SHIFT, > + mm->locked_vm << PAGE_SHIFT, > rlimit(RLIMIT_MEMLOCK)); > - up_write(>mm->mmap_sem); > + up_write(>mmap_sem); > } > > /* > @@ -98,6 +95,7 @@ struct tce_container { > bool enabled; > bool v2; > unsigned long locked_pages; > + struct mm_struct *mm; > struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; > struct list_head group_list; > }; > @@ -113,11 +111,11 @@ static long tce_iommu_unregister_pages(struct > tce_container *container, > if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK)) > return -EINVAL; > > - mem = mm_iommu_find(current->mm, vaddr, size >> PAGE_SHIFT); > + mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT); > if (!mem) > return -ENOENT; > > - return mm_iommu_put(current->mm, mem); > + return mm_iommu_put(container->mm, mem); > } > > static long tce_iommu_register_pages(struct tce_container *container, > @@ -134,7 +132,7 @@ static long tce_iommu_register_pages(struct tce_container > *container, > ((vaddr + size) < vaddr)) > return -EINVAL; > > - ret = mm_iommu_get(current->mm, vaddr, entries, ); > + ret = mm_iommu_get(container->mm, vaddr, entries, ); > if (ret) > return ret; > > @@ -143,7 +141,8 @@ static long tce_iommu_register_pages(struct tce_container > *container, > return 0; > } > > -static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl) > +static long tce_iommu_userspace_view_alloc(struct iommu_table *tbl, > + struct mm_struct *mm) > { > unsigned long cb = _ALIGN_UP(sizeof(tbl->it_userspace[0]) * > tbl->it_size, PAGE_SIZE);
Re: [PATCH kernel v3 4/4] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
On Thu, Oct 20, 2016 at 02:03:50PM +1100, Alexey Kardashevskiy wrote: > At the moment the userspace tool is expected to request pinning of > the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present. > When the userspace process finishes, all the pinned pages need to > be put; this is done as a part of the userspace memory context (MM) > destruction which happens on the very last mmdrop(). > > This approach has a problem that a MM of the userspace process > may live longer than the userspace process itself as kernel threads > use userspace process MMs which was runnning on a CPU where > the kernel thread was scheduled to. If this happened, the MM remains > referenced until this exact kernel thread wakes up again > and releases the very last reference to the MM, on an idle system this > can take even hours. > > This moves preregistered regions tracking from MM to VFIO; insteads of > using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is > added so each container releases regions which it has pre-registered. > > This changes the userspace interface to return EBUSY if a memory > region is already registered in a container. However it should not > have any practical effect as the only userspace tool available now > does register memory region once per container anyway. > > As tce_iommu_register_pages/tce_iommu_unregister_pages are called > under container->lock, this does not need additional locking. > > Signed-off-by: Alexey Kardashevskiy> Reviewed-by: Nicholas Piggin > --- > Changes: > v3: > * moved tce_iommu_prereg_free() call out of list_for_each_entry() > > v2: > * updated commit log > --- > arch/powerpc/mm/mmu_context_book3s64.c | 4 --- > arch/powerpc/mm/mmu_context_iommu.c| 11 > drivers/vfio/vfio_iommu_spapr_tce.c| 49 > +- > 3 files changed, 48 insertions(+), 16 deletions(-) > > diff --git a/arch/powerpc/mm/mmu_context_book3s64.c > b/arch/powerpc/mm/mmu_context_book3s64.c > index ad82735..1a07969 100644 > --- a/arch/powerpc/mm/mmu_context_book3s64.c > +++ b/arch/powerpc/mm/mmu_context_book3s64.c > @@ -159,10 +159,6 @@ static inline void destroy_pagetable_page(struct > mm_struct *mm) > > void destroy_context(struct mm_struct *mm) > { > -#ifdef CONFIG_SPAPR_TCE_IOMMU > - mm_iommu_cleanup(mm); > -#endif > - > #ifdef CONFIG_PPC_ICSWX > drop_cop(mm->context.acop, mm); > kfree(mm->context.cop_lockp); > diff --git a/arch/powerpc/mm/mmu_context_iommu.c > b/arch/powerpc/mm/mmu_context_iommu.c > index 4c6db09..104bad0 100644 > --- a/arch/powerpc/mm/mmu_context_iommu.c > +++ b/arch/powerpc/mm/mmu_context_iommu.c > @@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm) > { > INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list); > } > - > -void mm_iommu_cleanup(struct mm_struct *mm) > -{ > - struct mm_iommu_table_group_mem_t *mem, *tmp; > - > - list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list, > - next) { > - list_del_rcu(>next); > - mm_iommu_do_free(mem); > - } > -} > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > b/drivers/vfio/vfio_iommu_spapr_tce.c > index 6b0b121..3e2f757 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -86,6 +86,15 @@ struct tce_iommu_group { > }; > > /* > + * A container needs to remember which preregistered region it has > + * referenced to do proper cleanup at the userspace process exit. > + */ > +struct tce_iommu_prereg { > + struct list_head next; > + struct mm_iommu_table_group_mem_t *mem; > +}; > + > +/* > * The container descriptor supports only a single group per container. > * Required by the API as the container is not supplied with the IOMMU group > * at the moment of initialization. > @@ -98,12 +107,27 @@ struct tce_container { > struct mm_struct *mm; > struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES]; > struct list_head group_list; > + struct list_head prereg_list; > }; > > +static long tce_iommu_prereg_free(struct tce_container *container, > + struct tce_iommu_prereg *tcemem) > +{ > + long ret; > + > + list_del(>next); > + ret = mm_iommu_put(container->mm, tcemem->mem); > + kfree(tcemem); > + > + return ret; > +} > + > static long tce_iommu_unregister_pages(struct tce_container *container, > __u64 vaddr, __u64 size) > { > struct mm_iommu_table_group_mem_t *mem; > + struct tce_iommu_prereg *tcemem; > + bool found = false; > > if (!current || !current->mm) > return -ESRCH; /* process exited */ > @@ -115,7 +139,17 @@ static long tce_iommu_unregister_pages(struct > tce_container *container, > if (!mem) > return -ENOENT; > > - return mm_iommu_put(container->mm, mem); > + list_for_each_entry(tcemem, >prereg_list, next) { > + if
Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container
On Thu, Oct 20, 2016 at 06:31:21PM +1100, Nicholas Piggin wrote: > On Thu, 20 Oct 2016 14:03:49 +1100 > Alexey Kardashevskiywrote: > > > In some situations the userspace memory context may live longer than > > the userspace process itself so if we need to do proper memory context > > cleanup, we better cache @mm and use it later when the process is gone > > (@current or @current->mm is NULL). > > > > This references mm and stores the pointer in the container; this is done > > when a container is just created so checking for !current->mm in other > > places becomes pointless. > > > > This replaces current->mm with container->mm everywhere except debug > > prints. > > > > This adds a check that current->mm is the same as the one stored in > > the container to prevent userspace from registering memory in other > > processes. > > > > Signed-off-by: Alexey Kardashevskiy > > --- > > drivers/vfio/vfio_iommu_spapr_tce.c | 127 > > > > 1 file changed, 71 insertions(+), 56 deletions(-) > > > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > > b/drivers/vfio/vfio_iommu_spapr_tce.c > > index d0c38b2..6b0b121 100644 > > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > > @@ -31,49 +31,46 @@ > > Does it make sense to move the rest of these hunks into patch 2? > I think they're similarly just moving the mm reference into callers. > > > > static void tce_iommu_detach_group(void *iommu_data, > > struct iommu_group *iommu_group); > > > > -static long try_increment_locked_vm(long npages) > > +static long try_increment_locked_vm(struct mm_struct *mm, long npages) > > { > > long ret = 0, locked, lock_limit; > > > > - if (!current || !current->mm) > > - return -ESRCH; /* process exited */ > > - > > if (!npages) > > return 0; > > > > - down_write(>mm->mmap_sem); > > - locked = current->mm->locked_vm + npages; > > + down_write(>mmap_sem); > > + locked = mm->locked_vm + npages; > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > if (locked > lock_limit && !capable(CAP_IPC_LOCK)) > > ret = -ENOMEM; > > else > > - current->mm->locked_vm += npages; > > + mm->locked_vm += npages; > > > > pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, > > npages << PAGE_SHIFT, > > - current->mm->locked_vm << PAGE_SHIFT, > > + mm->locked_vm << PAGE_SHIFT, > > rlimit(RLIMIT_MEMLOCK), > > ret ? " - exceeded" : ""); > > > > - up_write(>mm->mmap_sem); > > + up_write(>mmap_sem); > > > > return ret; > > } > > > > -static void decrement_locked_vm(long npages) > > +static void decrement_locked_vm(struct mm_struct *mm, long npages) > > { > > - if (!current || !current->mm || !npages) > > + if (!mm || !npages) > > return; /* process exited */ > > I know you're trying to be defensive and change as little logic as possible, > but some cases should be an error, and I think some of the "process exited" > comments were wrong anyway. > > Maybe pull the !mm test into the caller and make it WARN_ON? > > > > @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg) > > return ERR_PTR(-EINVAL); > > } > > > > + if (!current->mm) > > + return ERR_PTR(-ESRCH); /* process exited */ > > A userspace thread in the kernel can't have its mm disappear, unless you > are actually in the exit code. !current->mm is more like a test for a kernel > thread. > > > > + > > container = kzalloc(sizeof(*container), GFP_KERNEL); > > if (!container) > > return ERR_PTR(-ENOMEM); > > @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg) > > > > container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU; > > > > + container->mm = current->mm; > > + atomic_inc(>mm->mm_count); > > + > > return container; > > It's a nitpick if you respin the patch, but I guess it would better be > described as a reference than a cache of the object. "have tce_container > take a reference to mm_struct". > > > > @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container > > *container, > > unsigned long hpa; > > enum dma_data_direction dirtmp; > > > > + if (container->mm != current->mm) > > + return -ESRCH; > > Good, is this condition now enforced on all entrypoints that use > container->mm (except the final teardown)? (The mlock/rlimit stuff, > as we talked about before, doesn't make sense if not). Right. I don't know that it's actually dangerous, but i think it would be needlessly weird for one process to be able to manipulate another process's mm via the container fd. So all the entry points that are directly called from userspace (basically, the ioctl()s) should verify that current->mm matches container->mm
Re: [PATCH] kernel: irq: fix build failure
Hi Thomas, On Thu, 20 Oct 2016 14:55:45 +0200 (CEST) Thomas Gleixnerwrote: > > On Mon, 10 Oct 2016, Sudip Mukherjee wrote: > > > On Thursday 06 October 2016 11:06 PM, Sudip Mukherjee wrote: > > > The allmodconfig build of powerpc is failing with the error: > > > ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined! > > > > > > export the symbol to fix the failure. > > > > Hi Thomas, > > powerpc and arm allmodconfig builds still fails with the same error. > > Build logs of next-20161010 are at: > > arm at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321467 > > powerpc at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321473 > > I know. This is under discussion with the driver folks as we are not going > to blindly export stuff just because someone slapped a irq_set_parent() > into the code w/o knowing why. Do we have any idea if a resolution is close. This was first reported in linux-next in September 14/15. :-( -- Cheers, Stephen Rothwell
[PATCH] powerpc/book3s64: Always build for power4 or later
When we're not compiling for a specific CPU, ie. none of the CONFIG_POWERx_CPU options are set, and CONFIG_GENERIC_CPU *is* set, we currently don't pass any -mcpu option to the compiler. This means the compiler builds for a "generic" Power CPU. But back in 2014 we dropped support for pre power4 CPUs in commit 468a33028edd ("powerpc: Drop support for pre-POWER4 cpus"). Given that, there's no point in building the kernel to run on pre power4 cpus. So update the flags we pass to the compiler when CONFIG_GENERIC_CPU is set, to specify -mcpu=power4. Signed-off-by: Michael Ellerman--- arch/powerpc/Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index 617dece67924..041fda1e2a5d 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -121,6 +121,7 @@ CFLAGS-$(CONFIG_PPC32) := -ffixed-r2 $(MULTIPLEWORD) ifeq ($(CONFIG_PPC_BOOK3S_64),y) CFLAGS-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=power7,-mtune=power4) +CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=power4 else CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=powerpc64 endif -- 2.7.4
Re: [PATCH kernel v3 2/4] powerpc/iommu: Stop using @current in mm_iommu_xxx
On Thu, Oct 20, 2016 at 02:03:48PM +1100, Alexey Kardashevskiy wrote: > This changes mm_iommu_xxx helpers to take mm_struct as a parameter > instead of getting it from @current which in some situations may > not have a valid reference to mm. > > This changes helpers to receive @mm and moves all references to @current > to the caller, including checks for !current and !current->mm; > checks in mm_iommu_preregistered() are removed as there is no caller > yet. > > This moves the mm_iommu_adjust_locked_vm() call to the caller as > it receives mm_iommu_table_group_mem_t but it needs mm. > > This should cause no behavioral change. > > Signed-off-by: Alexey KardashevskiyReviewed-by: David Gibson > --- > arch/powerpc/include/asm/mmu_context.h | 16 ++-- > arch/powerpc/mm/mmu_context_iommu.c| 46 > +- > drivers/vfio/vfio_iommu_spapr_tce.c| 14 --- > 3 files changed, 36 insertions(+), 40 deletions(-) > > diff --git a/arch/powerpc/include/asm/mmu_context.h > b/arch/powerpc/include/asm/mmu_context.h > index 424844b..b9e3f0a 100644 > --- a/arch/powerpc/include/asm/mmu_context.h > +++ b/arch/powerpc/include/asm/mmu_context.h > @@ -19,16 +19,18 @@ extern void destroy_context(struct mm_struct *mm); > struct mm_iommu_table_group_mem_t; > > extern int isolate_lru_page(struct page *page); /* from internal.h */ > -extern bool mm_iommu_preregistered(void); > -extern long mm_iommu_get(unsigned long ua, unsigned long entries, > +extern bool mm_iommu_preregistered(struct mm_struct *mm); > +extern long mm_iommu_get(struct mm_struct *mm, > + unsigned long ua, unsigned long entries, > struct mm_iommu_table_group_mem_t **pmem); > -extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); > +extern long mm_iommu_put(struct mm_struct *mm, > + struct mm_iommu_table_group_mem_t *mem); > extern void mm_iommu_init(struct mm_struct *mm); > extern void mm_iommu_cleanup(struct mm_struct *mm); > -extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, > - unsigned long size); > -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua, > - unsigned long entries); > +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct > *mm, > + unsigned long ua, unsigned long size); > +extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, > + unsigned long ua, unsigned long entries); > extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, > unsigned long ua, unsigned long *hpa); > extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); > diff --git a/arch/powerpc/mm/mmu_context_iommu.c > b/arch/powerpc/mm/mmu_context_iommu.c > index ad2e575..4c6db09 100644 > --- a/arch/powerpc/mm/mmu_context_iommu.c > +++ b/arch/powerpc/mm/mmu_context_iommu.c > @@ -56,7 +56,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, > } > > pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n", > - current->pid, > + current ? current->pid : 0, > incr ? '+' : '-', > npages << PAGE_SHIFT, > mm->locked_vm << PAGE_SHIFT, > @@ -66,12 +66,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, > return ret; > } > > -bool mm_iommu_preregistered(void) > +bool mm_iommu_preregistered(struct mm_struct *mm) > { > - if (!current || !current->mm) > - return false; > - > - return !list_empty(>mm->context.iommu_group_mem_list); > + return !list_empty(>context.iommu_group_mem_list); > } > EXPORT_SYMBOL_GPL(mm_iommu_preregistered); > > @@ -124,19 +121,16 @@ static int mm_iommu_move_page_from_cma(struct page > *page) > return 0; > } > > -long mm_iommu_get(unsigned long ua, unsigned long entries, > +long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long > entries, > struct mm_iommu_table_group_mem_t **pmem) > { > struct mm_iommu_table_group_mem_t *mem; > long i, j, ret = 0, locked_entries = 0; > struct page *page = NULL; > > - if (!current || !current->mm) > - return -ESRCH; /* process exited */ > - > mutex_lock(_list_mutex); > > - list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list, > + list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, > next) { > if ((mem->ua == ua) && (mem->entries == entries)) { > ++mem->used; > @@ -154,7 +148,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries, > > } > > - ret = mm_iommu_adjust_locked_vm(current->mm, entries, true); > + ret = mm_iommu_adjust_locked_vm(mm, entries, true); > if (ret) > goto unlock_exit; > > @@ -215,11
Re: [PATCH kernel v3 1/4] powerpc/iommu: Pass mm_struct to init/cleanup helpers
On Thu, Oct 20, 2016 at 02:03:47PM +1100, Alexey Kardashevskiy wrote: > We are going to get rid of @current references in mmu_context_boos3s64.c > and cache mm_struct in the VFIO container. Since mm_context_t does not > have reference counting, we will be using mm_struct which does have > the reference counter. > > This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather > than mm_context_t (which is embedded into mm). > > This should not cause any behavioral change. > > Signed-off-by: Alexey KardashevskiyReviewed-by: David Gibson > --- > arch/powerpc/include/asm/mmu_context.h | 4 ++-- > arch/powerpc/kernel/setup-common.c | 2 +- > arch/powerpc/mm/mmu_context_book3s64.c | 4 ++-- > arch/powerpc/mm/mmu_context_iommu.c| 9 + > 4 files changed, 10 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/include/asm/mmu_context.h > b/arch/powerpc/include/asm/mmu_context.h > index 5c45114..424844b 100644 > --- a/arch/powerpc/include/asm/mmu_context.h > +++ b/arch/powerpc/include/asm/mmu_context.h > @@ -23,8 +23,8 @@ extern bool mm_iommu_preregistered(void); > extern long mm_iommu_get(unsigned long ua, unsigned long entries, > struct mm_iommu_table_group_mem_t **pmem); > extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); > -extern void mm_iommu_init(mm_context_t *ctx); > -extern void mm_iommu_cleanup(mm_context_t *ctx); > +extern void mm_iommu_init(struct mm_struct *mm); > +extern void mm_iommu_cleanup(struct mm_struct *mm); > extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, > unsigned long size); > extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua, > diff --git a/arch/powerpc/kernel/setup-common.c > b/arch/powerpc/kernel/setup-common.c > index 270ee30..f516ac5 100644 > --- a/arch/powerpc/kernel/setup-common.c > +++ b/arch/powerpc/kernel/setup-common.c > @@ -915,7 +915,7 @@ void __init setup_arch(char **cmdline_p) > init_mm.context.pte_frag = NULL; > #endif > #ifdef CONFIG_SPAPR_TCE_IOMMU > - mm_iommu_init(_mm.context); > + mm_iommu_init(_mm); > #endif > irqstack_early_init(); > exc_lvl_early_init(); > diff --git a/arch/powerpc/mm/mmu_context_book3s64.c > b/arch/powerpc/mm/mmu_context_book3s64.c > index b114f8b..ad82735 100644 > --- a/arch/powerpc/mm/mmu_context_book3s64.c > +++ b/arch/powerpc/mm/mmu_context_book3s64.c > @@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct > mm_struct *mm) > mm->context.pte_frag = NULL; > #endif > #ifdef CONFIG_SPAPR_TCE_IOMMU > - mm_iommu_init(>context); > + mm_iommu_init(mm); > #endif > return 0; > } > @@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct > mm_struct *mm) > void destroy_context(struct mm_struct *mm) > { > #ifdef CONFIG_SPAPR_TCE_IOMMU > - mm_iommu_cleanup(>context); > + mm_iommu_cleanup(mm); > #endif > > #ifdef CONFIG_PPC_ICSWX > diff --git a/arch/powerpc/mm/mmu_context_iommu.c > b/arch/powerpc/mm/mmu_context_iommu.c > index e0f1c33..ad2e575 100644 > --- a/arch/powerpc/mm/mmu_context_iommu.c > +++ b/arch/powerpc/mm/mmu_context_iommu.c > @@ -373,16 +373,17 @@ void mm_iommu_mapped_dec(struct > mm_iommu_table_group_mem_t *mem) > } > EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec); > > -void mm_iommu_init(mm_context_t *ctx) > +void mm_iommu_init(struct mm_struct *mm) > { > - INIT_LIST_HEAD_RCU(>iommu_group_mem_list); > + INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list); > } > > -void mm_iommu_cleanup(mm_context_t *ctx) > +void mm_iommu_cleanup(struct mm_struct *mm) > { > struct mm_iommu_table_group_mem_t *mem, *tmp; > > - list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) { > + list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list, > + next) { > list_del_rcu(>next); > mm_iommu_do_free(mem); > } -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH v6] powerpc: Do not make the entire heap executable
On Tue, Oct 04, 2016 at 09:54:12AM -0700, Kees Cook wrote: > On Mon, Oct 3, 2016 at 5:18 PM, Michael Ellermanwrote: > > Kees Cook writes: > > > >> On Mon, Oct 3, 2016 at 9:13 AM, Denys Vlasenko wrote: > >>> On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt, > >>> or with a toolchain which defaults to it) look like this: > > ... > >>> > >>> Signed-off-by: Jason Gunthorpe > >>> Signed-off-by: Denys Vlasenko > >>> Acked-by: Kees Cook > >>> Acked-by: Michael Ellerman > >>> CC: Benjamin Herrenschmidt > >>> CC: Paul Mackerras > >>> CC: "Aneesh Kumar K.V" > >>> CC: Kees Cook > >>> CC: Oleg Nesterov > >>> CC: Michael Ellerman > >>> CC: Florian Weimer > >>> CC: linux...@kvack.org > >>> CC: linuxppc-dev@lists.ozlabs.org > >>> CC: linux-ker...@vger.kernel.org > >>> Changes since v5: > >>> * made do_brk_flags() error out if any bits other than VM_EXEC are set. > >>> (Kees Cook: "With this, I'd be happy to Ack.") > >>> See https://patchwork.ozlabs.org/patch/661595/ > >> > >> Excellent, thanks for the v6! Should this go via the ppc tree or the -mm > >> tree? > > > > -mm would be best, given the diffstat I think it's less likely to > > conflict if it goes via -mm. > > Okay, excellent. Andrew, do you have this already in email? I think > you weren't on the explicit CC from the v6... FWIW (and ping), Tested-by: Jason Gunthorpe On ARM32 (kirkwood) and PPC32 (405) For reference, here is the patchwork URL: https://patchwork.ozlabs.org/patch/677753/ Jason
Re: [PATCH 00/10] mm: adjust get_user_pages* functions to explicitly pass FOLL_* flags
On Wed 19-10-16 10:23:55, Dave Hansen wrote: > On 10/19/2016 10:01 AM, Michal Hocko wrote: > > The question I had earlier was whether this has to be an explicit FOLL > > flag used by g-u-p users or we can just use it internally when mm != > > current->mm > > The reason I chose not to do that was that deferred work gets run under > a basically random 'current'. If we just use 'mm != current->mm', then > the deferred work will sometimes have pkeys enforced and sometimes not, > basically randomly. OK, I see (async_pf_execute and ksm ). It makes more sense to me. Thanks for the clarification. -- Michal Hocko SUSE Labs
[PATCH v5 9/9] Documentation: virtual: kvm: Support vcpu preempted check
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan Xinhui--- Documentation/virtual/kvm/msr.txt | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..3376f13 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,8 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +233,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
[PATCH v5 8/9] s390/spinlock: Provide vcpu_is_preempted
From: Christian Borntraegerthis implements the s390 backend for commit "kernel/sched: introduce vcpu preempted check interface" by reworking the existing smp_vcpu_scheduled into arch_vcpu_is_preempted. We can then also get rid of the local cpu_is_preempted function by moving the CIF_ENABLED_WAIT test into arch_vcpu_is_preempted. Signed-off-by: Christian Borntraeger Acked-by: Heiko Carstens --- arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c | 9 +++-- arch/s390/lib/spinlock.c | 25 - 3 files changed, 23 insertions(+), 19 deletions(-) diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h index 7e9e09f..7ecd890 100644 --- a/arch/s390/include/asm/spinlock.h +++ b/arch/s390/include/asm/spinlock.h @@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new) return __sync_bool_compare_and_swap(lock, old, new); } +#ifndef CONFIG_SMP +static inline bool arch_vcpu_is_preempted(int cpu) { return false; } +#else +bool arch_vcpu_is_preempted(int cpu); +#endif + +#define vcpu_is_preempted arch_vcpu_is_preempted + /* * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 35531fe..b988ed1 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address) return -1; } -int smp_vcpu_scheduled(int cpu) +bool arch_vcpu_is_preempted(int cpu) { - return pcpu_running(pcpu_devices + cpu); + if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) + return false; + if (pcpu_running(pcpu_devices + cpu)) + return false; + return true; } +EXPORT_SYMBOL(arch_vcpu_is_preempted); void smp_yield_cpu(int cpu) { diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c index e5f50a7..e48a48e 100644 --- a/arch/s390/lib/spinlock.c +++ b/arch/s390/lib/spinlock.c @@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old) asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock)); } -static inline int cpu_is_preempted(int cpu) -{ - if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) - return 0; - if (smp_vcpu_scheduled(cpu)) - return 0; - return 1; -} - void arch_spin_lock_wait(arch_spinlock_t *lp) { unsigned int cpu = SPINLOCK_LOCKVAL; @@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) continue; } /* First iteration: check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) continue; } /* Check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if (owner && arch_vcpu_is_preempted(~owner)) smp_yield_cpu(~owner); count = spin_retry; } @@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev) owner = 0; while (1) { if (count-- <= 0) { -
[PATCH v5 7/9] x86, xen: support vcpu preempted check
From: Juergen GrossSupport the vcpu_is_preempted() functionality under Xen. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. A quick test (4 vcpus on 1 physical cpu doing a parallel build job with "make -j 8") reduced system time by about 5% with this patch. Signed-off-by: Juergen Gross Signed-off-by: Pan Xinhui --- arch/x86/xen/spinlock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 3d6e006..74756bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; + + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } /* -- 2.4.11
[PATCH v5 6/9] x86, kvm: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time to indicate that if one vcpu is running or not. unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui--- arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c| 12 arch/x86/kvm/x86.c | 18 ++ 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..b3fec56 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,8 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = _cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c633de..a627537 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) >arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2810,8 +2812,24 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + if (unlikely(kvm_read_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, sizeof(struct kvm_steal_time + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal, sizeof(struct kvm_steal_time)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); -- 2.4.11
[PATCH v5 5/9] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. To deal with kernel and kvm/xen, add vcpu_is_preempted into struct pv_lock_ops. Then kvm or xen could provide their own implementation to support vcpu_is_preempted. Signed-off-by: Pan Xinhui--- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..0526f59 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_lock_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 2c55a00..2f204dd 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } +static bool native_vcpu_is_preempted(int cpu) +{ + return 0; +} + struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, + .vcpu_is_preempted = native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); -- 2.4.11
[PATCH v5 4/9] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun FengSuggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index abb6b0f..f4a9524 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + #if defined(CONFIG_PPC_SPLPAR) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (lppaca_shared_proc(local_paca->lppaca_ptr)) -- 2.4.11
[PATCH v5 3/9] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan XinhuiAcked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..82108f5 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..0897179 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v5 2/9] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun FengSigned-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v5 1/9] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel)Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
[PATCH v5 0/9] implement vcpu preempted check
change from v4: spilt x86 kvm vcpu preempted check into two patches. add documentation patch. add x86 vcpu preempted check patch under xen add s390 vcpu preempted check patch change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements in uninx benchmark tests. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Christian Borntraeger (1): s390/spinlock: Provide vcpu_is_preempted Juergen Gross (1): x86, xen: support vcpu preempted check Pan Xinhui (7): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, paravirt: Add interface to support kvm/xen vcpu preempted check x86, kvm: support vcpu preempted check Documentation: virtual: kvm: Support vcpu preempted check Documentation/virtual/kvm/msr.txt | 8 +++- arch/powerpc/include/asm/spinlock.h | 8 arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c| 9 +++-- arch/s390/lib/spinlock.c | 25 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 3 ++- arch/x86/kernel/kvm.c | 12 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/kvm/x86.c| 18 ++ arch/x86/xen/spinlock.c | 3 ++- include/linux/sched.h | 12 kernel/locking/mutex.c| 15 +-- kernel/locking/osq_lock.c | 10 +- kernel/locking/rwsem-xadd.c | 16 +--- 16 files changed, 135 insertions(+), 28 deletions(-) -- 2.4.11
Re: [PATCH v4 3/5] powerpc/mm: allow memory hotplug into a memoryless node
On Thu, Oct 20, 2016 at 02:30:42PM +1100, Balbir Singh wrote: FYI, these checks were temporary to begin with I found this in git history b226e462124522f2f23153daff31c311729dfa2f (powerpc: don't add memory to empty node/zone) Nice find! I spent some time digging, but this had eluded me. -- Reza Arbab
Re: [PATCH] kernel: irq: fix build failure
On Mon, 10 Oct 2016, Sudip Mukherjee wrote: > On Thursday 06 October 2016 11:06 PM, Sudip Mukherjee wrote: > > The allmodconfig build of powerpc is failing with the error: > > ERROR: ".irq_set_parent" [drivers/mfd/tps65217.ko] undefined! > > > > export the symbol to fix the failure. > > Hi Thomas, > powerpc and arm allmodconfig builds still fails with the same error. > Build logs of next-20161010 are at: > arm at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321467 > powerpc at https://travis-ci.org/sudipm-mukherjee/parport/jobs/166321473 I know. This is under discussion with the driver folks as we are not going to blindly export stuff just because someone slapped a irq_set_parent() into the code w/o knowing why. Thanks, tglx
[RFC][PATCH] powerpc/pseries: implement NMI IPIs with H_SIGNAL_SYS_RESET hcall
Add a .cause_nmi_ipi() to smp_ops to support it. Although it's possible to raise a system reset exception on this CPU, which may (or may not) be useful to bring ourselves into a known state. So perhaps it's better make this a platform operation? Thanks, Nick --- arch/powerpc/include/asm/hvcall.h | 8 +++- arch/powerpc/include/asm/plpar_wrappers.h | 5 + arch/powerpc/include/asm/smp.h| 4 arch/powerpc/kernel/smp.c | 3 +++ arch/powerpc/platforms/powermac/smp.c | 1 + arch/powerpc/platforms/powernv/smp.c | 1 + arch/powerpc/platforms/pseries/smp.c | 8 7 files changed, 29 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 708edeb..a38171c 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -275,7 +275,8 @@ #define H_COP 0x304 #define H_GET_MPP_X0x314 #define H_SET_MODE 0x31C -#define MAX_HCALL_OPCODE H_SET_MODE +#define H_SIGNAL_SYS_RESET 0x380 +#define MAX_HCALL_OPCODE H_SIGNAL_SYS_RESET /* H_VIOCTL functions */ #define H_GET_VIOA_DUMP_SIZE 0x01 @@ -306,6 +307,11 @@ #define H_SET_MODE_RESOURCE_ADDR_TRANS_MODE3 #define H_SET_MODE_RESOURCE_LE 4 +/* Values for argument to H_SIGNAL_SYS_RESET */ +#define H_SIGNAL_SYS_RESET_ALL -1 +#define H_SIGNAL_SYS_RESET_ALLBUTSELF -2 +#define H_SIGNAL_SYS_RESET_CPU(x) (x) + #ifndef __ASSEMBLY__ /** diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index 1b39424..7fe5983 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -340,4 +340,9 @@ static inline long plapr_set_watchpoint0(unsigned long dawr0, unsigned long dawr return plpar_set_mode(0, H_SET_MODE_RESOURCE_SET_DAWR, dawr0, dawrx0); } +static inline long plapr_signal_system_reset(long cpu) +{ + return plpar_hcall_norets(H_SIGNAL_SYS_RESET, cpu); +} + #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */ diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 0d02c11..15eb615 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -37,11 +37,15 @@ extern int cpu_to_chip_id(int cpu); #ifdef CONFIG_SMP +#define SMP_OP_NMI_ALL -1 +#define SMP_OP_NMI_ALLBUTSELF -2 + struct smp_ops_t { void (*message_pass)(int cpu, int msg); #ifdef CONFIG_PPC_SMP_MUXED_IPI void (*cause_ipi)(int cpu, unsigned long data); #endif + int (*cause_nmi_ipi)(int cpu); void (*probe)(void); int (*kick_cpu)(int nr); void (*setup_cpu)(int nr); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 9c6f3fd..4a1161e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -334,6 +334,9 @@ void smp_send_debugger_break(void) if (unlikely(!smp_ops)) return; + if (smp_ops->cause_nmi_ipi && smp_ops->cause_nmi_ipi(SMP_OP_NMI_ALLBUTSELF)) + return; + for_each_online_cpu(cpu) if (cpu != me) do_message_pass(cpu, PPC_MSG_DEBUGGER_BREAK); diff --git a/arch/powerpc/platforms/powermac/smp.c b/arch/powerpc/platforms/powermac/smp.c index c9eb7d6..1d76e15 100644 --- a/arch/powerpc/platforms/powermac/smp.c +++ b/arch/powerpc/platforms/powermac/smp.c @@ -446,6 +446,7 @@ void __init smp_psurge_give_timebase(void) struct smp_ops_t psurge_smp_ops = { .message_pass = NULL, /* Use smp_muxed_ipi_message_pass */ .cause_ipi = smp_psurge_cause_ipi, + .cause_nmi_ipi = NULL, .probe = smp_psurge_probe, .kick_cpu = smp_psurge_kick_cpu, .setup_cpu = smp_psurge_setup_cpu, diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c index c789258..092ec1f 100644 --- a/arch/powerpc/platforms/powernv/smp.c +++ b/arch/powerpc/platforms/powernv/smp.c @@ -244,6 +244,7 @@ static int pnv_cpu_bootable(unsigned int nr) static struct smp_ops_t pnv_smp_ops = { .message_pass = smp_muxed_ipi_message_pass, .cause_ipi = NULL, /* Filled at runtime by xics_smp_probe() */ + .cause_nmi_ipi = NULL, .probe = xics_smp_probe, .kick_cpu = pnv_smp_kick_cpu, .setup_cpu = pnv_smp_setup_cpu, diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c index f6f83ae..187b981 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -196,6 +196,13 @@ static void pSeries_cause_ipi_mux(int cpu, unsigned long data) xics_cause_ipi(cpu, data); } +static int smp_pSeries_cause_nmi_ipi(int cpu) +{ + if (plapr_signal_system_reset(cpu) == H_SUCCESS) + return 1; + return 0;
Re: [RFC] ppc64le: Enable emulation support for simple Load/Store instructions
On Thu, Oct 20, 2016 at 12:01:55PM +0530, Ravi Bangoria wrote: > emulate_step() uses a number of underlying kernel functions that were > initially not enabled for LE. This has been rectified since. So, fix > emulate_step() for LE for the corresponding instructions. > > Reported-by: Anton Blanchard> Signed-off-by: Ravi Bangoria > --- > Note: This patch only enables LOAD, STORE, LARX and STCX instructions. > I'll send a subsequent patch for other types like LOAD_FP, > LOAD_VMX etc. > > arch/powerpc/lib/sstep.c | 8 > 1 file changed, 8 deletions(-) This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read Documentation/stable_kernel_rules.txt for how to do this properly.
Re: [PATCH kernel v3 3/4] vfio/spapr: Cache mm in tce_container
On Thu, 20 Oct 2016 14:03:49 +1100 Alexey Kardashevskiywrote: > In some situations the userspace memory context may live longer than > the userspace process itself so if we need to do proper memory context > cleanup, we better cache @mm and use it later when the process is gone > (@current or @current->mm is NULL). > > This references mm and stores the pointer in the container; this is done > when a container is just created so checking for !current->mm in other > places becomes pointless. > > This replaces current->mm with container->mm everywhere except debug > prints. > > This adds a check that current->mm is the same as the one stored in > the container to prevent userspace from registering memory in other > processes. > > Signed-off-by: Alexey Kardashevskiy > --- > drivers/vfio/vfio_iommu_spapr_tce.c | 127 > > 1 file changed, 71 insertions(+), 56 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > b/drivers/vfio/vfio_iommu_spapr_tce.c > index d0c38b2..6b0b121 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -31,49 +31,46 @@ Does it make sense to move the rest of these hunks into patch 2? I think they're similarly just moving the mm reference into callers. > static void tce_iommu_detach_group(void *iommu_data, > struct iommu_group *iommu_group); > > -static long try_increment_locked_vm(long npages) > +static long try_increment_locked_vm(struct mm_struct *mm, long npages) > { > long ret = 0, locked, lock_limit; > > - if (!current || !current->mm) > - return -ESRCH; /* process exited */ > - > if (!npages) > return 0; > > - down_write(>mm->mmap_sem); > - locked = current->mm->locked_vm + npages; > + down_write(>mmap_sem); > + locked = mm->locked_vm + npages; > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > if (locked > lock_limit && !capable(CAP_IPC_LOCK)) > ret = -ENOMEM; > else > - current->mm->locked_vm += npages; > + mm->locked_vm += npages; > > pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, > npages << PAGE_SHIFT, > - current->mm->locked_vm << PAGE_SHIFT, > + mm->locked_vm << PAGE_SHIFT, > rlimit(RLIMIT_MEMLOCK), > ret ? " - exceeded" : ""); > > - up_write(>mm->mmap_sem); > + up_write(>mmap_sem); > > return ret; > } > > -static void decrement_locked_vm(long npages) > +static void decrement_locked_vm(struct mm_struct *mm, long npages) > { > - if (!current || !current->mm || !npages) > + if (!mm || !npages) > return; /* process exited */ I know you're trying to be defensive and change as little logic as possible, but some cases should be an error, and I think some of the "process exited" comments were wrong anyway. Maybe pull the !mm test into the caller and make it WARN_ON? > @@ -317,6 +311,9 @@ static void *tce_iommu_open(unsigned long arg) > return ERR_PTR(-EINVAL); > } > > + if (!current->mm) > + return ERR_PTR(-ESRCH); /* process exited */ A userspace thread in the kernel can't have its mm disappear, unless you are actually in the exit code. !current->mm is more like a test for a kernel thread. > + > container = kzalloc(sizeof(*container), GFP_KERNEL); > if (!container) > return ERR_PTR(-ENOMEM); > @@ -326,13 +323,17 @@ static void *tce_iommu_open(unsigned long arg) > > container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU; > > + container->mm = current->mm; > + atomic_inc(>mm->mm_count); > + > return container; It's a nitpick if you respin the patch, but I guess it would better be described as a reference than a cache of the object. "have tce_container take a reference to mm_struct". > @@ -515,13 +526,16 @@ static long tce_iommu_build_v2(struct tce_container > *container, > unsigned long hpa; > enum dma_data_direction dirtmp; > > + if (container->mm != current->mm) > + return -ESRCH; Good, is this condition now enforced on all entrypoints that use container->mm (except the final teardown)? (The mlock/rlimit stuff, as we talked about before, doesn't make sense if not). Thanks, Nick
Re: [PATCH] cpufreq: powernv: Use node ID in init_chip_info
Hi, On 10/20/2016 09:29 AM, Viresh Kumar wrote: > + few IBM guys who have been working on this. > > On 19-10-16, 15:02, Emily Shaffer wrote: >> Fixed assumption that node_id==chip_id in powernv-cpufreq.c:init_chip_info; >> explicitly use node ID where necessary. Thanks for the bug fix. I agree that the node-ids should not be assumed to be always be equal to chip-ids. But I think it would be better to get rid of cpumask_of_node() as it has problems when the powernv-cpufreq driver is initialized with offline cpus, like reported in the post below. https://patchwork.kernel.org/patch/8887591/ (^^ This should also solve the node_id=chip_id problem) Since throttle stats are common for all cpus in the chip, so we are better of not using cpumask_of_node() instead define something like cpumask_of_chip() where the driver doesn't have to compute chip cpumask. Thanks and Regards, Shilpa >> >> Tested: All CPUs report in /sys/devices/system/cpu*/cpufreq/throttle_stats >> >> Effort: platforms/arch-powerpc >> Google-Bug-Id: 26979978 > > Is this relevant upstream? > >> >> Signed-off-by: Emily Shaffer>> Change-Id: I22eb626b32fbb8053b3bbb9c75e677c700d0c2fb > > Gerrit id isn't required for upstream.. > >> --- >> drivers/cpufreq/powernv-cpufreq.c | 27 +-- >> 1 file changed, 21 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/cpufreq/powernv-cpufreq.c >> b/drivers/cpufreq/powernv-cpufreq.c >> index d3ffde8..3750b58 100644 >> --- a/drivers/cpufreq/powernv-cpufreq.c >> +++ b/drivers/cpufreq/powernv-cpufreq.c >> @@ -911,32 +911,47 @@ static struct cpufreq_driver powernv_cpufreq_driver = { >> >> static int init_chip_info(void) >> { >> - unsigned int chip[256]; >> + int rc = 0; >> unsigned int cpu, i; >> unsigned int prev_chip_id = UINT_MAX; >> + unsigned int *chip, *node; >> + >> + chip = kcalloc(num_possible_cpus(), sizeof(unsigned int), >> GFP_KERNEL); >> + node = kcalloc(num_possible_cpus(), sizeof(unsigned int), >> GFP_KERNEL); >> + if (!chip || !node) { >> + rc = -ENOMEM; >> + goto out; >> + } >> >> for_each_possible_cpu(cpu) { >> unsigned int id = cpu_to_chip_id(cpu); >> >> if (prev_chip_id != id) { >> prev_chip_id = id; >> - chip[nr_chips++] = id; >> + node[nr_chips] = cpu_to_node(cpu); >> + chip[nr_chips] = id; >> + nr_chips++; >> } >> } >> >> chips = kcalloc(nr_chips, sizeof(struct chip), GFP_KERNEL); >> - if (!chips) >> - return -ENOMEM; >> + if (!chips) { >> + rc = -ENOMEM; >> + goto out; >> + } >> >> for (i = 0; i < nr_chips; i++) { >> chips[i].id = chip[i]; >> - cpumask_copy([i].mask, cpumask_of_node(chip[i])); >> + cpumask_copy([i].mask, cpumask_of_node(node[i])); >> INIT_WORK([i].throttle, powernv_cpufreq_work_fn); >> for_each_cpu(cpu, [i].mask) >> per_cpu(chip_info, cpu) = [i]; >> } >> >> - return 0; >> +out: >> + kfree(node); >> + kfree(chip); >> + return rc; >> } >> >> static inline void clean_chip_info(void) >> -- >> 2.8.0.rc3.226.g39d4020 >
[RFC] ppc64le: Enable emulation support for simple Load/Store instructions
emulate_step() uses a number of underlying kernel functions that were initially not enabled for LE. This has been rectified since. So, fix emulate_step() for LE for the corresponding instructions. Reported-by: Anton BlanchardSigned-off-by: Ravi Bangoria --- Note: This patch only enables LOAD, STORE, LARX and STCX instructions. I'll send a subsequent patch for other types like LOAD_FP, LOAD_VMX etc. arch/powerpc/lib/sstep.c | 8 1 file changed, 8 deletions(-) diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c index 3362299..82323ef 100644 --- a/arch/powerpc/lib/sstep.c +++ b/arch/powerpc/lib/sstep.c @@ -1807,8 +1807,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr) goto instr_done; case LARX: - if (regs->msr & MSR_LE) - return 0; if (op.ea & (size - 1)) break; /* can't handle misaligned */ err = -EFAULT; @@ -1832,8 +1830,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr) goto ldst_done; case STCX: - if (regs->msr & MSR_LE) - return 0; if (op.ea & (size - 1)) break; /* can't handle misaligned */ err = -EFAULT; @@ -1859,8 +1855,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr) goto ldst_done; case LOAD: - if (regs->msr & MSR_LE) - return 0; err = read_mem(>gpr[op.reg], op.ea, size, regs); if (!err) { if (op.type & SIGNEXT) @@ -1913,8 +1907,6 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr) goto instr_done; case STORE: - if (regs->msr & MSR_LE) - return 0; if ((op.type & UPDATE) && size == sizeof(long) && op.reg == 1 && op.update_reg == 1 && !(regs->msr & MSR_PR) && -- 1.8.3.1