Re: [PATCH] powerpc/xive: Fix/improve verbose debug output
On Fri, 2017-04-28 at 16:34 +1000, Michael Ellerman wrote: > > > If there's non-verbose debug that we think would be useful to > > > differentiate from verbose then those could be pr_debug() - which means > > > they'll be jump labelled off in most production kernels, but still able > > > to be enabled. > > > > Maybe... I don't like the giant "debug" switch accross the whole > > kernel, though. > > Not sure what you mean. You can enable pr_debug()s individually, by > function, by module, by file, or for the whole kernel. > > To enable everything in xive you'd do: > > # echo 'file *xive* +p' > /sys/kernel/debug/dynamic_debug/control > > Or boot with: loglevel=8 dyndbg="file *xive* +p" Ah that's new goodness I wasn't aware of. Anyway, I can spin that later, not planning on doing any work today ;-) Cheers, Ben.
Re: [PATCH v3] cxl: mask slice error interrupts after first occurrence
Hi Alastair, Thanks for addressing previous review comments. Few additional and very minor comments. Alastair D'Silva writes: > From: Alastair D'Silva > > In some situations, a faulty AFU slice may create an interrupt storm, 'interrupt storm of slice-errors,' > rendering the machine unusable. Since these interrupts are informational > only, present the interrupt once, then mask it off to prevent it from > being retriggered until the card is reset. s|card|card/afu > @@ -1226,7 +1237,11 @@ static irqreturn_t native_slice_irq_err(int irq, void > *data) > dev_crit(&afu->dev, "AFU_ERR_An: 0x%.16llx\n", afu_error); > dev_crit(&afu->dev, "PSL_DSISR_An: 0x%.16llx\n", dsisr); > > + /* mask off the IRQ so it won't retrigger until the card is reset */ > + irq_mask = (serr & CXL_PSL_SERR_An_IRQS) >> 32; > + serr |= irq_mask; > cxl_p1n_write(afu, CXL_PSL_SERR_An, serr); > + dev_info(&afu->dev, "Further interrupts will be masked until the Optional: Just to be explicit, since you are only masking a subset of possible slice errors hence I would suggest rephrasing the message as: "Further such interrupts > AFU is reset\n"); To be consistent with the patch description s|AFU|AFU/Card -- Vaibhav Jain Linux Technology Center, IBM India Pvt. Ltd.
Re: [PATCH] powerpc/xive: Fix/improve verbose debug output
Benjamin Herrenschmidt writes: > On Fri, 2017-04-28 at 13:07 +1000, Michael Ellerman wrote: >> Benjamin Herrenschmidt writes: >> >> > The existing verbose debug code doesn't build when enabled. >> >> So why don't we convert all the DBG_VERBOSE() to pr_devel()? > > pr_devel provides a bunch of debug at init/setup/mask/unmask etc... but > the system is still usable OK so those could be converted to pr_debug(). > DBG_VERBOSE starts spewing stuff on every interrupt and eoi, the system > is no longer usable. And those could stay at pr_devel(), requiring a #define DEBUG and recompile to enable. >> If there's non-verbose debug that we think would be useful to >> differentiate from verbose then those could be pr_debug() - which means >> they'll be jump labelled off in most production kernels, but still able >> to be enabled. > > Maybe... I don't like the giant "debug" switch accross the whole > kernel, though. Not sure what you mean. You can enable pr_debug()s individually, by function, by module, by file, or for the whole kernel. To enable everything in xive you'd do: # echo 'file *xive* +p' > /sys/kernel/debug/dynamic_debug/control Or boot with: loglevel=8 dyndbg="file *xive* +p" cheers
Re: [PATCH kernel v2] powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
On Tue, 11 Apr 2017 18:28:42 +1000 Alexey Kardashevskiy wrote: > On 27/03/17 19:27, Alexey Kardashevskiy wrote: > > pnv_pci_table_alloc() ignores possible failure from kzalloc_node(), > > this adds a check. There are 2 callers of pnv_pci_table_alloc(), > > one already checks for tbl!=NULL, this adds WARN_ON() to the other path > > which only happens during boot time in IODA1 and not expected to fail. > > > > Signed-off-by: Alexey Kardashevskiy > > --- > > Changes: > > v2: > > * s/BUG_ON/WARN_ON/ > > Bad/good? Ping? > > > > --- > > arch/powerpc/platforms/powernv/pci-ioda.c | 3 +++ > > arch/powerpc/platforms/powernv/pci.c | 3 +++ > > 2 files changed, 6 insertions(+) > > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > > b/arch/powerpc/platforms/powernv/pci-ioda.c > > index e36738291c32..04ef03a5201b 100644 > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > @@ -2128,6 +2128,9 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb > > *phb, > > > > found: > > tbl = pnv_pci_table_alloc(phb->hose->node); > > + if (WARN_ON(!tbl)) > > + return; > > + > > iommu_register_group(&pe->table_group, phb->hose->global_number, > > pe->pe_number); > > pnv_pci_link_table_and_group(phb->hose->node, 0, tbl, &pe->table_group); > > diff --git a/arch/powerpc/platforms/powernv/pci.c > > b/arch/powerpc/platforms/powernv/pci.c > > index eb835e977e33..9acdf6889c0d 100644 > > --- a/arch/powerpc/platforms/powernv/pci.c > > +++ b/arch/powerpc/platforms/powernv/pci.c > > @@ -766,6 +766,9 @@ struct iommu_table *pnv_pci_table_alloc(int nid) > > struct iommu_table *tbl; > > > > tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid); > > + if (!tbl) > > + return NULL; > > + > > INIT_LIST_HEAD_RCU(&tbl->it_group_list); > > > > return tbl; > > > > -- Alexey
Re: [PATCH kernel] powerpc/powernv: Fix iommu table size calculation hook for small tables
On Thu, 13 Apr 2017 17:05:27 +1000 Alexey Kardashevskiy wrote: > When the userspace requests a small TCE table (which takes less than > the system page size) and more than 1 TCE level, the existing code > returns a single page size which is a bug as each additional TCE level > requires at least one page and this is what > pnv_pci_ioda2_table_alloc_pages() does. And we end up seeing > WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size)) > in drivers/vfio/vfio_iommu_spapr_tce.c. > > This replaces incorrect _ALIGN_UP() (which aligns zero up to zero) with > max_t() to fix the bug. > > Besides removing WARN_ON(), there should be no other changes in > behaviour. Ping? > > Signed-off-by: Alexey Kardashevskiy > --- > arch/powerpc/platforms/powernv/pci-ioda.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > b/arch/powerpc/platforms/powernv/pci-ioda.c > index 6d0da5dfc955..a0d046adcf45 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -2538,7 +2538,8 @@ static unsigned long pnv_pci_ioda2_get_table_size(__u32 > page_shift, > > tce_table_size /= direct_table_size; > tce_table_size <<= 3; > - tce_table_size = _ALIGN_UP(tce_table_size, direct_table_size); > + tce_table_size = max_t(unsigned long, > + tce_table_size, direct_table_size); > } > > return bytes; -- Alexey
[PATCH v2] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()
Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every userspace instruction miss") has shown that limiting the read of faulting instruction to likely cases improves performance. This patch goes further into this direction by limiting the read of the faulting instruction to the only cases where it is definitly needed. On an MPC885, with the same benchmark app as in the commit referred above, we see a reduction of 4000 dTLB misses (approx 3%): Before the patch: Performance counter stats for './fault 500' (10 runs): 720495838 cpu-cycles ( +- 0.04% ) 141769 dTLB-load-misses ( +- 0.02% ) 52722 iTLB-load-misses ( +- 0.01% ) 19611 faults ( +- 0.02% ) 5.750535176 seconds time elapsed ( +- 0.16% ) With the patch: Performance counter stats for './fault 500' (10 runs): 717669123 cpu-cycles ( +- 0.02% ) 137344 dTLB-load-misses ( +- 0.03% ) 52731 iTLB-load-misses ( +- 0.01% ) 19614 faults ( +- 0.03% ) 5.728423115 seconds time elapsed ( +- 0.14% ) Signed-off-by: Christophe Leroy --- v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)' In case the instruction we read has value 0, store_update_sp() will return false, so it will bail out. This patch applies after the serie "powerpc/mm: some cleanup of do_page_fault()" arch/powerpc/mm/fault.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 400f2d0d42f8..2ec82a279d28 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -280,14 +280,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - /* -* We want to do this outside mmap_sem, because reading code around nip -* can result in fault, which will cause a deadlock when called with -* mmap_sem held -*/ - if (is_write && is_user) - __get_user(inst, (unsigned int __user *)regs->nip); - if (is_user) flags |= FAULT_FLAG_USER; @@ -356,8 +348,18 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, * between the last mapped region and the stack will * expand the stack rather than segfaulting. */ - if (address + 2048 < uregs->gpr[1] && !store_updates_sp(inst)) - goto bad_area; + if (address + 2048 < uregs->gpr[1] && !inst) { + /* +* We want to do this outside mmap_sem, because reading +* code around nip can result in fault, which will cause +* a deadlock when called with mmap_sem held +*/ + up_read(&mm->mmap_sem); + __get_user(inst, (unsigned int __user *)regs->nip); + if (!store_updates_sp(inst)) + goto bad_area_nosemaphore; + goto retry; + } } if (expand_stack(vma, address)) goto bad_area; -- 2.12.0
[PATCH V2] hwmon: (ibmpowernv) Add min/max attributes and current sensors
Add support for adding min/max values for the inband sensors copied by OCC to main memory. And also add current(mA) sensors to the list. Signed-off-by: Shilpasri G Bhat --- Changes from V1: - Add functions to get min and max attribute strings - Add function 'populate_sensor' to fill in the 'struct sensor_data' for each sensor. drivers/hwmon/ibmpowernv.c | 96 +- 1 file changed, 77 insertions(+), 19 deletions(-) diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c index 6d2e660..d59262c 100644 --- a/drivers/hwmon/ibmpowernv.c +++ b/drivers/hwmon/ibmpowernv.c @@ -50,6 +50,7 @@ enum sensors { TEMP, POWER_SUPPLY, POWER_INPUT, + CURRENT, MAX_SENSOR_TYPE, }; @@ -65,7 +66,8 @@ enum sensors { {"fan", "ibm,opal-sensor-cooling-fan"}, {"temp", "ibm,opal-sensor-amb-temp"}, {"in", "ibm,opal-sensor-power-supply"}, - {"power", "ibm,opal-sensor-power"} + {"power", "ibm,opal-sensor-power"}, + {"curr"}, /* Follows newer device tree compatible ibm,opal-sensor */ }; struct sensor_data { @@ -287,6 +289,7 @@ static int populate_attr_groups(struct platform_device *pdev) opal = of_find_node_by_path("/ibm,opal/sensors"); for_each_child_of_node(opal, np) { const char *label; + int len; if (np->name == NULL) continue; @@ -298,10 +301,14 @@ static int populate_attr_groups(struct platform_device *pdev) sensor_groups[type].attr_count++; /* -* add a new attribute for labels +* add attributes for labels, min and max */ if (!of_property_read_string(np, "label", &label)) sensor_groups[type].attr_count++; + if (of_find_property(np, "sensor-data-min", &len)) + sensor_groups[type].attr_count++; + if (of_find_property(np, "sensor-data-max", &len)) + sensor_groups[type].attr_count++; } of_node_put(opal); @@ -337,6 +344,49 @@ static void create_hwmon_attr(struct sensor_data *sdata, const char *attr_name, sdata->dev_attr.show = show; } +static void populate_sensor(struct sensor_data *sdata, int od, int hd, int sid, + const char *attr_name, enum sensors type, + const struct attribute_group *pgroup, + ssize_t (*show)(struct device *dev, + struct device_attribute *attr, + char *buf)) +{ + sdata->id = sid; + sdata->type = type; + sdata->opal_index = od; + sdata->hwmon_index = hd; + create_hwmon_attr(sdata, attr_name, show); + pgroup->attrs[sensor_groups[type].attr_count++] = &sdata->dev_attr.attr; +} + +static char *get_max_attr(enum sensors type) +{ + switch (type) { + case POWER_INPUT: + return "input_highest"; + case TEMP: + return "max"; + default: + break; + } + + return "highest"; +} + +static char *get_min_attr(enum sensors type) +{ + switch (type) { + case POWER_INPUT: + return "input_lowest"; + case TEMP: + return "min"; + default: + break; + } + + return "lowest"; +} + /* * Iterate through the device tree for each child of 'sensors' node, create * a sysfs attribute file, the file is named by translating the DT node name @@ -365,6 +415,7 @@ static int create_device_attrs(struct platform_device *pdev) for_each_child_of_node(opal, np) { const char *attr_name; u32 opal_index; + u32 hwmon_index; const char *label; if (np->name == NULL) @@ -386,9 +437,6 @@ static int create_device_attrs(struct platform_device *pdev) continue; } - sdata[count].id = sensor_id; - sdata[count].type = type; - /* * If we can not parse the node name, it means we are * running on a newer device tree. We can just forget @@ -401,14 +449,12 @@ static int create_device_attrs(struct platform_device *pdev) opal_index = INVALID_INDEX; } - sdata[count].opal_index = opal_index; - sdata[count].hwmon_index = - get_sensor_hwmon_index(&sdata[count], sdata, count); - - create_hwmon_attr(&sdata[count], attr_name, show_sensor); - - pgroups[type]->attrs[sensor_groups[type].attr_count++] = - &sdata[count++].dev_attr.attr; + hwmon_index = get_sensor_hwmon_index(&sdata[count], sdata, +
[v5 2/2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome
The raid6 Q syndrome check has been optimised using the vpermxor instruction. This instruction was made available with POWER8, ISA version 2.07. It allows for both vperm and vxor instructions to be done in a single instruction. This has been tested for correctness on a ppc64le vm with a basic RAID6 setup containing 5 drives. The performance benchmarks are from the raid6test in the /lib/raid6/test directory. These results are from an IBM Firestone machine with ppc64le architecture. The benchmark results show a 35% speed increase over the best existing algorithm for powerpc (altivec). The raid6test has also been run on a big-endian ppc64 vm to ensure it also works for big-endian architectures. Performance benchmarks: raid6: altivecx4 gen() 18773 MB/s raid6: altivecx8 gen() 19438 MB/s raid6: vpermxor4 gen() 25112 MB/s raid6: vpermxor8 gen() 26279 MB/s Signed-off-by: Matt Brown --- Changelog v5 - moved altivec.uc fix into other patch in series --- include/linux/raid/pq.h | 4 ++ lib/raid6/Makefile | 27 - lib/raid6/algos.c | 4 ++ lib/raid6/test/Makefile | 14 ++- lib/raid6/vpermxor.uc | 104 5 files changed, 151 insertions(+), 2 deletions(-) create mode 100644 lib/raid6/vpermxor.uc diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h index 4d57bba..3df9aa6 100644 --- a/include/linux/raid/pq.h +++ b/include/linux/raid/pq.h @@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2; extern const struct raid6_calls raid6_avx512x4; extern const struct raid6_calls raid6_tilegx8; extern const struct raid6_calls raid6_s390vx8; +extern const struct raid6_calls raid6_vpermxor1; +extern const struct raid6_calls raid6_vpermxor2; +extern const struct raid6_calls raid6_vpermxor4; +extern const struct raid6_calls raid6_vpermxor8; struct raid6_recov_calls { void (*data2)(int, size_t, int, int, void **); diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile index 3057011..db095a7 100644 --- a/lib/raid6/Makefile +++ b/lib/raid6/Makefile @@ -4,7 +4,8 @@ raid6_pq-y += algos.o recov.o tables.o int1.o int2.o int4.o \ int8.o int16.o int32.o raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o avx2.o avx512.o recov_avx512.o -raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o +raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \ + vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o raid6_pq-$(CONFIG_TILEGX) += tilegx8.o raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o @@ -88,6 +89,30 @@ $(obj)/altivec8.c: UNROLL := 8 $(obj)/altivec8.c: $(src)/altivec.uc $(src)/unroll.awk FORCE $(call if_changed,unroll) +CFLAGS_vpermxor1.o += $(altivec_flags) +targets += vpermxor1.c +$(obj)/vpermxor1.c: UNROLL := 1 +$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + +CFLAGS_vpermxor2.o += $(altivec_flags) +targets += vpermxor2.c +$(obj)/vpermxor2.c: UNROLL := 2 +$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + +CFLAGS_vpermxor4.o += $(altivec_flags) +targets += vpermxor4.c +$(obj)/vpermxor4.c: UNROLL := 4 +$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + +CFLAGS_vpermxor8.o += $(altivec_flags) +targets += vpermxor8.c +$(obj)/vpermxor8.c: UNROLL := 8 +$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE + $(call if_changed,unroll) + CFLAGS_neon1.o += $(NEON_FLAGS) targets += neon1.c $(obj)/neon1.c: UNROLL := 1 diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c index 7857049..edd4f69 100644 --- a/lib/raid6/algos.c +++ b/lib/raid6/algos.c @@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = { &raid6_altivec2, &raid6_altivec4, &raid6_altivec8, + &raid6_vpermxor1, + &raid6_vpermxor2, + &raid6_vpermxor4, + &raid6_vpermxor8, #endif #if defined(CONFIG_TILEGX) &raid6_tilegx8, diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile index 2c7b60e..9c333e9 100644 --- a/lib/raid6/test/Makefile +++ b/lib/raid6/test/Makefile @@ -97,6 +97,18 @@ altivec4.c: altivec.uc ../unroll.awk altivec8.c: altivec.uc ../unroll.awk $(AWK) ../unroll.awk -vN=8 < altivec.uc > $@ +vpermxor1.c: vpermxor.uc ../unroll.awk + $(AWK) ../unroll.awk -vN=1 < vpermxor.uc > $@ + +vpermxor2.c: vpermxor.uc ../unroll.awk + $(AWK) ../unroll.awk -vN=2 < vpermxor.uc > $@ + +vpermxor4.c: vpermxor.uc ../unroll.awk + $(AWK) ../unroll.awk -vN=4 < vpermxor.uc > $@ + +vpermxor8.c: vpermxor.uc ../unroll.awk + $(AWK) ../unroll.awk -vN=8 < vpermxor.uc > $@ + int1.c: int.uc ../unroll.awk $(AWK) ../unroll.awk -vN=1 <
[v5 1/2] lib/raid6: Build proper files on corresponding arch
Previously the raid6 test Makefile did not correctly build the files for testing on PowerPC. This patch fixes the bug, so that all appropriate files for PowerPC are built. This patch also fixes the missing and mismatched ifdef statements to allow the altivec.uc file to be built correctly. Signed-off-by: Matt Brown --- Changelog v5 - moved altivec.uc fix into this patch - updates commit message --- lib/raid6/altivec.uc| 3 +++ lib/raid6/test/Makefile | 8 +--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc index 682aae8..d20ed0d 100644 --- a/lib/raid6/altivec.uc +++ b/lib/raid6/altivec.uc @@ -24,10 +24,13 @@ #include +#ifdef CONFIG_ALTIVEC + #include #ifdef __KERNEL__ # include # include +#endif /* __KERNEL__ */ /* * This is the C data type to use. We use a vector of diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile index 9c333e9..b64a267 100644 --- a/lib/raid6/test/Makefile +++ b/lib/raid6/test/Makefile @@ -44,10 +44,12 @@ else ifeq ($(HAS_NEON),yes) CFLAGS += -DCONFIG_KERNEL_MODE_NEON=1 else HAS_ALTIVEC := $(shell printf '\#include \nvector int a;\n' |\ - gcc -c -x c - >&/dev/null && \ - rm ./-.o && echo yes) + gcc -c -x c - >/dev/null && rm ./-.o && echo yes) ifeq ($(HAS_ALTIVEC),yes) -OBJS += altivec1.o altivec2.o altivec4.o altivec8.o +CFLAGS += -I../../../arch/powerpc/include +CFLAGS += -DCONFIG_ALTIVEC +OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \ +vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o endif endif ifeq ($(ARCH),tilegx) -- 2.9.3
[PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing
Some powerpc hardware features may want to gain access to a chunk of undisturbed real memory. This update provides a means to unplug said memory from the kernel with a set of debugfs calls. By writing an integer containing the size of memory to be unplugged into /sys/kernel/debug/powerpc/memtrace/enable, the code will remove that much memory from the end of each available chip's memory space (ie each memory node). In addition, the means to read out the contents of the unplugged memory is also provided by reading out the /sys/kernel/debug/powerpc/memtrace//trace file. Signed-off-by: Anton Blanchard Signed-off-by: Rashmica Gupta --- This requires the 'Wire up hpte_removebolted for powernv' patch. RFC -> v1: Added in two missing locks. Replaced the open-coded flush_memory_region() with the existing flush_inval_dcache_range(start, end). memtrace_offline_pages() is open-coded because offline_pages is designed to be called through the sysfs interface - not directly. We could move the offlining of pages to userspace, which removes some of this open-coding. This would then require passing info to the kernel such that it can then remove the memory that has been offlined. This could be done using notifiers, but this isn't simple due to locking (remove_memory needs mem_hotplug_begin() which the sysfs interface already has). This could also be done through the debugfs interface (similar to what is done here). Either way, this would require the process that needs the memory to have open-coded code which it shouldn't really be involved with. As the current remove_memory() function requires the memory to already be offlined, it makes sense to keep the offlining and removal of memory functionality grouped together so that a process can simply make one request to unplug some memory. Ideally there would be a kernel function we could call that would offline the memory and then remove it. arch/powerpc/platforms/powernv/memtrace.c | 276 ++ 1 file changed, 276 insertions(+) create mode 100644 arch/powerpc/platforms/powernv/memtrace.c diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c new file mode 100644 index 000..86184b1 --- /dev/null +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -0,0 +1,276 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Copyright (C) IBM Corporation, 2014 + * + * Author: Anton Blanchard + */ + +#define pr_fmt(fmt) "powernv-memtrace: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct memtrace_entry { + void *mem; + u64 start; + u64 size; + u32 nid; + struct dentry *dir; + char name[16]; +}; + +static struct memtrace_entry *memtrace_array; +static unsigned int memtrace_array_nr; + +static ssize_t memtrace_read(struct file *filp, char __user *ubuf, +size_t count, loff_t *ppos) +{ + struct memtrace_entry *ent = filp->private_data; + + return simple_read_from_buffer(ubuf, count, ppos, ent->mem, ent->size); +} + +static bool valid_memtrace_range(struct memtrace_entry *dev, +unsigned long start, unsigned long size) +{ + if ((dev->start <= start) && + ((start + size) <= (dev->start + dev->size))) + return true; + + return false; +} + +static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long size = vma->vm_end - vma->vm_start; + struct memtrace_entry *dev = filp->private_data; + + if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size)) + return -EINVAL; + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + + if (io_remap_pfn_range(vma, vma->vm_start, + vma->vm_pgoff + (dev->start >> PAGE_SHIFT), + size, vma->vm_page_prot)) + return -EAGAIN; + + return 0; +} + +static const struct file_operations memtrace_fops = { + .llseek = default_llseek, + .read = memtrace_read, + .mmap = memtrace_mmap, + .open = simple_open, +}; + +static int check_memblock_online(struct memory_block *mem, void *arg) +{ + if (mem->state != MEM_ONLINE) + return -1; + + return 0; +} + +static int change_memblock_state(struct memory_block *mem, void *arg) +{ + unsigned long state = (unsigned lon
[PATCH 1/2] powerpc/powernv: Add config option for removal of memory
Signed-off-by: Rashmica Gupta --- arch/powerpc/platforms/powernv/Kconfig | 4 arch/powerpc/platforms/powernv/Makefile | 1 + 2 files changed, 5 insertions(+) diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index 6a6f4ef..1b8b3a8 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -30,3 +30,7 @@ config OPAL_PRD help This enables the opal-prd driver, a facility to run processor recovery diagnostics on OpenPower machines + +config HARDWARE_TRACING + bool 'Enable removal of memory for hardware memory tracing' + depends on PPC_POWERNV && MEMORY_HOTPLUG diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index b5d98cb..e61be1b 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -12,3 +12,4 @@ obj-$(CONFIG_PPC_SCOM)+= opal-xscom.o obj-$(CONFIG_MEMORY_FAILURE) += opal-memory-errors.o obj-$(CONFIG_TRACEPOINTS) += opal-tracepoints.o obj-$(CONFIG_OPAL_PRD) += opal-prd.o +obj-$(CONFIG_HARDWARE_TRACING) += memtrace.o -- 2.9.3
Re: [PATCH] powerpc/xive: Fix/improve verbose debug output
On Fri, 2017-04-28 at 13:07 +1000, Michael Ellerman wrote: > Benjamin Herrenschmidt writes: > > > The existing verbose debug code doesn't build when enabled. > > So why don't we convert all the DBG_VERBOSE() to pr_devel()? pr_devel provides a bunch of debug at init/setup/mask/unmask etc... but the system is still usable DBG_VERBOSE starts spewing stuff on every interrupt and eoi, the system is no longer usable. > If there's non-verbose debug that we think would be useful to > differentiate from verbose then those could be pr_debug() - which means > they'll be jump labelled off in most production kernels, but still able > to be enabled. Maybe... I don't like the giant "debug" switch accross the whole kernel, though. Ben.
Re: [PATCH v3] cxl: mask slice error interrupts after first occurrence
On 28/04/17 13:20, Alastair D'Silva wrote: From: Alastair D'Silva In some situations, a faulty AFU slice may create an interrupt storm, rendering the machine unusable. Since these interrupts are informational only, present the interrupt once, then mask it off to prevent it from being retriggered until the card is reset. Signed-off-by: Alastair D'Silva LGTM Reviewed-by: Andrew Donnellan -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
linux-next: build failure after merge of the kvm-ppc tree
Hi Paul, After merging the kvm-ppc tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: arch/powerpc/kvm/book3s_xive.c: In function 'xive_debugfs_init': arch/powerpc/kvm/book3s_xive.c:1852:52: error: 'powerpc_debugfs_root' undeclared (first use in this function) xive->dentry = debugfs_create_file(name, S_IRUGO, powerpc_debugfs_root, ^ Caused by commit 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") interacting with commit 7644d5819cf8 ("powerpc: Create asm/debugfs.h and move powerpc_debugfs_root there") from the powerpc tree. I have added the following merge fix patch. From: Stephen Rothwell Date: Fri, 28 Apr 2017 14:28:17 +1000 Subject: [PATCH] powerpc: merge fix for powerpc_debugfs_root move. Signed-off-by: Stephen Rothwell --- arch/powerpc/kvm/book3s_xive.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index 7807ee17af4b..ffe1da95033a 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include -- 2.11.0 -- Cheers, Stephen Rothwell
Re: [PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing
On Fri, Apr 28, 2017 at 2:06 AM, Hari Bathini wrote: > Hi Pingfan, > > > On Thursday 27 April 2017 01:13 PM, Pingfan Liu wrote: >> >> E.g after fadump reserves mem regions, these regions should not be removed >> before fadump explicitly free them. >> Signed-off-by: Pingfan Liu >> --- >> arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c >> b/arch/powerpc/platforms/pseries/hotplug-memory.c >> index e104c71..201be23 100644 >> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c >> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c >> @@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base, >> unsigned int memblock_siz >> >> if (!pfn_valid(start_pfn)) >> goto out; >> + if (memblock_is_reserved(base)) >> + return -EINVAL; > > > I think memblock reserved regions are not hot removed even without this > patch. > So, can you elaborate on when/why this patch is needed? > I have not found code to prevent the reserved regions from free. Do I miss anything? I will try to reserve a ppc to have a test. Thx, Pingfan
[PATCH v3] cxl: mask slice error interrupts after first occurrence
From: Alastair D'Silva In some situations, a faulty AFU slice may create an interrupt storm, rendering the machine unusable. Since these interrupts are informational only, present the interrupt once, then mask it off to prevent it from being retriggered until the card is reset. Signed-off-by: Alastair D'Silva --- Changelog: v3 Add CXL_PSL_SERR_An_IRQS, CXL_PSL_SERR_An_IRQ_MASKS macros Explicitly reenable masked interrupts after reset Issue an info line that subsequent interrupts will be masked v2 Rebase against linux-next --- drivers/misc/cxl/cxl.h| 18 ++ drivers/misc/cxl/native.c | 19 +-- 2 files changed, 35 insertions(+), 2 deletions(-) diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 452e209..6b00952 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -228,6 +228,24 @@ static const cxl_p2n_reg_t CXL_PSL_WED_An = {0x0A0}; #define CXL_PSL_SERR_An_llcmdto(1ull << (63-6)) #define CXL_PSL_SERR_An_afupar (1ull << (63-7)) #define CXL_PSL_SERR_An_afudup (1ull << (63-8)) +#define CXL_PSL_SERR_An_IRQS ( \ + CXL_PSL_SERR_An_afuto | CXL_PSL_SERR_An_afudup | CXL_PSL_SERR_An_afuov | \ + CXL_PSL_SERR_An_badsrc | CXL_PSL_SERR_An_badctx | CXL_PSL_SERR_An_llcmdis | \ + CXL_PSL_SERR_An_llcmdto | CXL_PSL_SERR_An_afupar | CXL_PSL_SERR_An_afudup) +#define CXL_PSL_SERR_An_afuto_mask (1ull << (63-32)) +#define CXL_PSL_SERR_An_afudis_mask(1ull << (63-33)) +#define CXL_PSL_SERR_An_afuov_mask (1ull << (63-34)) +#define CXL_PSL_SERR_An_badsrc_mask(1ull << (63-35)) +#define CXL_PSL_SERR_An_badctx_mask(1ull << (63-36)) +#define CXL_PSL_SERR_An_llcmdis_mask (1ull << (63-37)) +#define CXL_PSL_SERR_An_llcmdto_mask (1ull << (63-38)) +#define CXL_PSL_SERR_An_afupar_mask(1ull << (63-39)) +#define CXL_PSL_SERR_An_afudup_mask(1ull << (63-40)) +#define CXL_PSL_SERR_An_IRQ_MASKS ( \ + CXL_PSL_SERR_An_afuto_mask | CXL_PSL_SERR_An_afudup_mask | CXL_PSL_SERR_An_afuov_mask | \ + CXL_PSL_SERR_An_badsrc_mask | CXL_PSL_SERR_An_badctx_mask | CXL_PSL_SERR_An_llcmdis_mask | \ + CXL_PSL_SERR_An_llcmdto_mask | CXL_PSL_SERR_An_afupar_mask | CXL_PSL_SERR_An_afudup_mask) + #define CXL_PSL_SERR_An_AE (1ull << (63-30)) /** CXL_PSL_SCNTL_An / diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c index 194c58e..3e7fc86 100644 --- a/drivers/misc/cxl/native.c +++ b/drivers/misc/cxl/native.c @@ -95,12 +95,23 @@ int cxl_afu_disable(struct cxl_afu *afu) /* This will disable as well as reset */ static int native_afu_reset(struct cxl_afu *afu) { + int rc; + u64 serr; + pr_devel("AFU reset request\n"); - return afu_control(afu, CXL_AFU_Cntl_An_RA, 0, + rc = afu_control(afu, CXL_AFU_Cntl_An_RA, 0, CXL_AFU_Cntl_An_RS_Complete | CXL_AFU_Cntl_An_ES_Disabled, CXL_AFU_Cntl_An_RS_MASK | CXL_AFU_Cntl_An_ES_MASK, false); + + /* Re-enable any masked interrupts */ + serr = cxl_p1n_read(afu, CXL_PSL_SERR_An); + serr &= ~CXL_PSL_SERR_An_IRQ_MASKS; + cxl_p1n_write(afu, CXL_PSL_SERR_An, serr); + + + return rc; } static int native_afu_check_and_enable(struct cxl_afu *afu) @@ -1205,7 +1216,7 @@ static irqreturn_t native_slice_irq_err(int irq, void *data) { struct cxl_afu *afu = data; u64 errstat, serr, afu_error, dsisr; - u64 fir_slice, afu_debug; + u64 fir_slice, afu_debug, irq_mask; /* * slice err interrupt is only used with full PSL (no XSL) @@ -1226,7 +1237,11 @@ static irqreturn_t native_slice_irq_err(int irq, void *data) dev_crit(&afu->dev, "AFU_ERR_An: 0x%.16llx\n", afu_error); dev_crit(&afu->dev, "PSL_DSISR_An: 0x%.16llx\n", dsisr); + /* mask off the IRQ so it won't retrigger until the card is reset */ + irq_mask = (serr & CXL_PSL_SERR_An_IRQS) >> 32; + serr |= irq_mask; cxl_p1n_write(afu, CXL_PSL_SERR_An, serr); + dev_info(&afu->dev, "Further interrupts will be masked until the AFU is reset\n"); return IRQ_HANDLED; } -- 2.9.3
Re: [PATCH] powerpc/xive: Fix/improve verbose debug output
Benjamin Herrenschmidt writes: > The existing verbose debug code doesn't build when enabled. So why don't we convert all the DBG_VERBOSE() to pr_devel()? If there's non-verbose debug that we think would be useful to differentiate from verbose then those could be pr_debug() - which means they'll be jump labelled off in most production kernels, but still able to be enabled. cheers
Re: [PATCH] Enabled pstore write for powerpc
Kees Cook writes: > On Thu, Apr 27, 2017 at 4:33 AM, Ankit Kumar wrote: >> After commit c950fd6f201a kernel registers pstore write based on flag set. >> Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) is not set >> for >> powerpc architecture. On panic, kernel doesn't write message to >> /fs/pstore/dmesg*(Entry doesn't gets created at all). >> >> This patch enables pstore write for powerpc architecture by setting >> PSTORE_FLAGS_DMESG flag. >> >> Fixes:c950fd6f201a pstore: Split pstore fragile flags >> Signed-off-by: Ankit Kumar > > Argh, thanks! I thought I'd caught all of these. I'll include this for > -stable. I see you've picked it up, thanks. cheers
Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI
On (04/27/17 12:14), Steven Rostedt wrote: [..] > I tried this patch. It's better because I get the end of the trace, but > I do lose the beginning of it: > > ** 196358 printk messages dropped ** [ 102.321182] perf-59810 > 12983650us : d_path <-seq_path many thanks! so we now drop messages from logbuf, not from per-CPU buffers. that "queue printk_deferred irq_work on every online CPU when we bypass per-CPU buffers from NMI" idea *probably* might help here - we need someone to emit messages from the logbuf while we printk from NMI. there is still a possibility that we can drop messages, though, since log_store() from NMI CPU can be much-much faster than call_console_drivers() on other CPU. -ss
Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI
On (04/20/17 15:11), Petr Mladek wrote: [..] > void printk_nmi_enter(void) > { > - this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK); > + /* > + * The size of the extra per-CPU buffer is limited. Use it > + * only when really needed. > + */ > + if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK || > + raw_spin_is_locked(&logbuf_lock)) { can we please have && here? [..] > diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c > index 4e8a30d1c22f..0bc0a3535a8a 100644 > --- a/lib/nmi_backtrace.c > +++ b/lib/nmi_backtrace.c > @@ -86,9 +86,11 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask, > > bool nmi_cpu_backtrace(struct pt_regs *regs) > { > + static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED; > int cpu = smp_processor_id(); > > if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) { > + arch_spin_lock(&lock); > if (regs && cpu_in_idle(instruction_pointer(regs))) { > pr_warn("NMI backtrace for cpu %d skipped: idling at pc > %#lx\n", > cpu, instruction_pointer(regs)); > @@ -99,6 +101,7 @@ bool nmi_cpu_backtrace(struct pt_regs *regs) > else > dump_stack(); > } > + arch_spin_unlock(&lock); > cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask)); > return true; > } can the nmi_backtrace part be a patch on its own? -ss
Re: [PATCH 0/8] Fix clean target warnings
On 04/21/2017 05:14 PM, Shuah Khan wrote: > This patch series consists of changes to lib.mk to allow overriding > common clean target from Makefiles. This fixes warnings when clean > overriding and ignoring warnings. Also fixes splice clean target > removing a script that runs the test from its clean target. > > Shuah Khan (8): > selftests: splice: fix clean target to not remove > default_file_splice_read.sh > selftests: lib.mk: define CLEAN macro to allow Makefiles to override > clean Applied with amended change log and Michael's ack to linux-kselftest next > selftests: futex: override clean in lib.mk to fix warnings > selftests: gpio: override clean in lib.mk to fix warnings > selftests: powerpc: override clean in lib.mk to fix warnings Applied all of the above to linux-kseltftest next > selftests: splice: override clean in lib.mk to fix warnings > selftests: sync: override clean in lib.mk to fix warnings > selftests: x86: override clean in lib.mk to fix warnings Applied v2s addressing Michael's comments to linux-kselftest next x86 fix also addresses not being able to build ldt_gdt make -C tools/testing/selftests/x86 ldt_gdt thanks, -- Shuah
Re: [PATCH 3/8] selftests: futex: override clean in lib.mk to fix warnings
On 04/27/2017 03:54 PM, Darren Hart wrote: > On Fri, Apr 21, 2017 at 05:14:45PM -0600, Shuah Khan wrote: >> Add override for lib.mk clean to fix the following warnings from clean >> target run. >> >> Makefile:36: warning: overriding recipe for target 'clean' >> ../lib.mk:55: warning: ignoring old recipe for target 'clean' >> >> Signed-off-by: Shuah Khan >> --- >> tools/testing/selftests/futex/Makefile | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/tools/testing/selftests/futex/Makefile >> b/tools/testing/selftests/futex/Makefile >> index c8095e6..e2fbb89 100644 >> --- a/tools/testing/selftests/futex/Makefile >> +++ b/tools/testing/selftests/futex/Makefile >> @@ -32,9 +32,10 @@ override define EMIT_TESTS >> echo "./run.sh" >> endef >> >> -clean: >> +override define CLEAN >> for DIR in $(SUBDIRS); do \ >> BUILD_TARGET=$(OUTPUT)/$$DIR; \ >> mkdir $$BUILD_TARGET -p; \ >> make OUTPUT=$$BUILD_TARGET -C $$DIR $@;\ >> done >> +endef > > Taking the move of clean into lib.mk as a given, Yeah I considered undoing that, and chose to fix the missed issues instead. > > Acked-by: Darren Hart (VMware) > thanks, -- Shuah
Re: [PATCH 3/8] selftests: futex: override clean in lib.mk to fix warnings
On Fri, Apr 21, 2017 at 05:14:45PM -0600, Shuah Khan wrote: > Add override for lib.mk clean to fix the following warnings from clean > target run. > > Makefile:36: warning: overriding recipe for target 'clean' > ../lib.mk:55: warning: ignoring old recipe for target 'clean' > > Signed-off-by: Shuah Khan > --- > tools/testing/selftests/futex/Makefile | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/futex/Makefile > b/tools/testing/selftests/futex/Makefile > index c8095e6..e2fbb89 100644 > --- a/tools/testing/selftests/futex/Makefile > +++ b/tools/testing/selftests/futex/Makefile > @@ -32,9 +32,10 @@ override define EMIT_TESTS > echo "./run.sh" > endef > > -clean: > +override define CLEAN > for DIR in $(SUBDIRS); do \ > BUILD_TARGET=$(OUTPUT)/$$DIR; \ > mkdir $$BUILD_TARGET -p; \ > make OUTPUT=$$BUILD_TARGET -C $$DIR $@;\ > done > +endef Taking the move of clean into lib.mk as a given, Acked-by: Darren Hart (VMware) -- Darren Hart VMware Open Source Technology Center
Re: [PATCH] Enabled pstore write for powerpc
On Thu, Apr 27, 2017 at 4:33 AM, Ankit Kumar wrote: > After commit c950fd6f201a kernel registers pstore write based on flag set. > Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) is not set for > powerpc architecture. On panic, kernel doesn't write message to > /fs/pstore/dmesg*(Entry doesn't gets created at all). > > This patch enables pstore write for powerpc architecture by setting > PSTORE_FLAGS_DMESG flag. > > Fixes:c950fd6f201a pstore: Split pstore fragile flags > Signed-off-by: Ankit Kumar Argh, thanks! I thought I'd caught all of these. I'll include this for -stable. -Kees > --- > > arch/powerpc/kernel/nvram_64.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c > index d5e2b83..021db31 100644 > --- a/arch/powerpc/kernel/nvram_64.c > +++ b/arch/powerpc/kernel/nvram_64.c > @@ -561,6 +561,7 @@ static ssize_t nvram_pstore_read(u64 *id, enum > pstore_type_id *type, > static struct pstore_info nvram_pstore_info = { > .owner = THIS_MODULE, > .name = "nvram", > + .flags = PSTORE_FLAGS_DMESG, > .open = nvram_pstore_open, > .read = nvram_pstore_read, > .write = nvram_pstore_write, > -- > 2.7.4 > -- Kees Cook Pixel Security
Re: [PATCH] Enabled pstore write for powerpc
Hi Ankit, > After commit c950fd6f201a kernel registers pstore write based on flag > set. Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) > is not set for powerpc architecture. On panic, kernel doesn't write > message to /fs/pstore/dmesg*(Entry doesn't gets created at all). > > This patch enables pstore write for powerpc architecture by setting > PSTORE_FLAGS_DMESG flag. > > Fixes:c950fd6f201a pstore: Split pstore fragile flags Ouch! We've used pstore to shoot customer bugs, so we should also mark this for stable. Looks like 4.9 onwards? Anton > Signed-off-by: Ankit Kumar > --- > > arch/powerpc/kernel/nvram_64.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/kernel/nvram_64.c > b/arch/powerpc/kernel/nvram_64.c index d5e2b83..021db31 100644 > --- a/arch/powerpc/kernel/nvram_64.c > +++ b/arch/powerpc/kernel/nvram_64.c > @@ -561,6 +561,7 @@ static ssize_t nvram_pstore_read(u64 *id, enum > pstore_type_id *type, static struct pstore_info nvram_pstore_info = { > .owner = THIS_MODULE, > .name = "nvram", > + .flags = PSTORE_FLAGS_DMESG, > .open = nvram_pstore_open, > .read = nvram_pstore_read, > .write = nvram_pstore_write,
[PATCH 1/1] powerpc/traps : Updated MC for E6500 L1D cache err
This patch updates the machine check handler of Linux kernel to handle the e6500 architecture case. In e6500 core, L1 Data Cache Write Shadow Mode (DCWS) register is not implemented but L1 data cache always runs in write shadow mode. So, on L1 data cache parity errors, hardware will automatically invalidate the data cache but will still log a machine check interrupt. Signed-off-by: Ronak Desai Signed-off-by: Matthew Weber --- arch/powerpc/include/asm/reg_booke.h | 1 + arch/powerpc/kernel/traps.c | 12 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/reg_booke.h b/arch/powerpc/include/asm/reg_booke.h index 737e012..c811128 100644 --- a/arch/powerpc/include/asm/reg_booke.h +++ b/arch/powerpc/include/asm/reg_booke.h @@ -196,6 +196,7 @@ #define SPRN_DEAR 0x03D /* Data Error Address Register */ #define SPRN_ESR 0x03E /* Exception Syndrome Register */ #define SPRN_PIR 0x11E /* Processor Identification Register */ +#define SPRN_PVR 0x11F /* Processor Version Register */ #define SPRN_DBSR 0x130 /* Debug Status Register */ #define SPRN_DBCR0 0x134 /* Debug Control Register 0 */ #define SPRN_DBCR1 0x135 /* Debug Control Register 1 */ diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 76f6045..d5bc3ab 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -504,6 +504,7 @@ int machine_check_47x(struct pt_regs *regs) int machine_check_e500mc(struct pt_regs *regs) { unsigned long mcsr = mfspr(SPRN_MCSR); + unsigned long pvr = mfspr(SPRN_PVR); unsigned long reason = mcsr; int recoverable = 1; @@ -545,8 +546,15 @@ int machine_check_e500mc(struct pt_regs *regs) * may still get logged and cause a machine check. We should * only treat the non-write shadow case as non-recoverable. */ - if (!(mfspr(SPRN_L1CSR2) & L1CSR2_DCWS)) - recoverable = 0; + /* On e6500 core, L1 DCWS (Data cache write shadow mode) bit is +* not implemented but L1 data cache is by default configured +* to run in write shadow mode. Hence on data cache parity errors +* HW will automatically invalidate the L1 Data Cache. +*/ + if (PVR_VER(pvr) != PVR_VER_E6500) { + if (!(mfspr(SPRN_L1CSR2) & L1CSR2_DCWS)) + recoverable = 0; + } } if (reason & MCSR_L2MMU_MHIT) { -- 1.9.1
Re: [PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing
Hi Pingfan, On Thursday 27 April 2017 01:13 PM, Pingfan Liu wrote: E.g after fadump reserves mem regions, these regions should not be removed before fadump explicitly free them. Signed-off-by: Pingfan Liu --- arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index e104c71..201be23 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base, unsigned int memblock_siz if (!pfn_valid(start_pfn)) goto out; + if (memblock_is_reserved(base)) + return -EINVAL; I think memblock reserved regions are not hot removed even without this patch. So, can you elaborate on when/why this patch is needed? Thanks Hari
Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI
On Thu, 20 Apr 2017 15:11:54 +0200 Petr Mladek wrote: > > >From c530d9dee91c74db5e6a198479e2e63b24cb84a2 Mon Sep 17 00:00:00 2001 > From: Petr Mladek > Date: Thu, 20 Apr 2017 10:52:31 +0200 > Subject: [PATCH] printk: Use the main logbuf in NMI when logbuf_lock is > available I tried this patch. It's better because I get the end of the trace, but I do lose the beginning of it: ** 196358 printk messages dropped ** [ 102.321182] perf-59810 12983650us : d_path <-seq_path The way I tested it was by adding this: Index: linux-trace.git/kernel/trace/trace_functions.c === --- linux-trace.git.orig/kernel/trace/trace_functions.c +++ linux-trace.git/kernel/trace/trace_functions.c @@ -469,8 +469,11 @@ ftrace_cpudump_probe(unsigned long ip, u struct trace_array *tr, struct ftrace_probe_ops *ops, void *data) { - if (update_count(ops, ip, data)) - ftrace_dump(DUMP_ORIG); + char *killer = NULL; + + panic_on_oops = 1; /* force panic */ + wmb(); + *killer = 1; } static int Then doing the following: # echo 1 > /proc/sys/kernel/ftrace_dump_on_oops # trace-cmd start -p function # echo nmi_handle:cpudump > /debug/tracing/set_ftrace_filter # perf record -c 100 -a sleep 1 And that triggers the crash. -- Steve
[PATCH] crypto: talitos: Extend max key length for SHA384/512-HMAC
The max keysize for both of these is 128, not 96. Before, with keysizes over 96, the memcpy in ahash_setkey() would overwrite memory beyond the key field. Signed-off-by: Martin Hicks --- drivers/crypto/talitos.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c index 0bba6a1..97dc85e 100644 --- a/drivers/crypto/talitos.c +++ b/drivers/crypto/talitos.c @@ -816,7 +816,7 @@ static void talitos_unregister_rng(struct device *dev) * HMAC_SNOOP_NO_AFEA (HSNA) instead of type IPSEC_ESP */ #define TALITOS_CRA_PRIORITY_AEAD_HSNA (TALITOS_CRA_PRIORITY - 1) -#define TALITOS_MAX_KEY_SIZE 96 +#define TALITOS_MAX_KEY_SIZE SHA512_BLOCK_SIZE /* SHA512 has the largest keysize input */ #define TALITOS_MAX_IV_LENGTH 16 /* max of AES_BLOCK_SIZE, DES3_EDE_BLOCK_SIZE */ struct talitos_ctx { -- 1.7.10.4 -- Martin Hicks P.Eng.| m...@bork.org Bork Consulting Inc. | +1 (613) 266-2296
Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI
On Thu, 27 Apr 2017 17:28:07 +0200 Petr Mladek wrote: > > When I get a chance, I'll see if I can insert a trigger to crash the > > kernel from NMI on another box and see if this patch helps. > > I actually tested it here using this hack: > > diff --cc lib/nmi_backtrace.c > index d531f85c0c9b,0bc0a3535a8a.. > --- a/lib/nmi_backtrace.c > +++ b/lib/nmi_backtrace.c > @@@ -89,8 -90,7 +90,9 @@@ bool nmi_cpu_backtrace(struct pt_regs * > int cpu = smp_processor_id(); > > if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) { > + if (in_nmi()) > + panic("Simulating panic in NMI\n"); > + arch_spin_lock(&lock); I was going to create a ftrace trigger, to crash on demand, but this may do as well. > if (regs && cpu_in_idle(instruction_pointer(regs))) { > pr_warn("NMI backtrace for cpu %d skipped: idling at > pc %#lx\n", > cpu, instruction_pointer(regs)); > > and triggered by: > >echo l > /proc/sysrq-trigger > > The patch really helped to see much more (all) messages from the ftrace > buffers in NMI mode. > > But the test is a bit artifical. The patch might not help when there > is a big printk() activity on the system when the panic() is > triggered. We might wrongly use the small per-CPU buffer when > the logbuf_lock is tested and taken on another CPU at the same time. > It means that it will not always help. > > I personally think that the patch might be good enough. I am not sure > if a perfect (more comlpex) solution is worth it. I wasn't asking for perfect, as the previous solutions never were either. I just want an optimistic dump if possible. I'll try to get some time today to test this, and let you know. But it wont be on the machine that I originally had the issue with. Thanks, -- Steve
Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI
On Thu 2017-04-27 10:31:18, Steven Rostedt wrote: > On Thu, 27 Apr 2017 15:38:19 +0200 > Petr Mladek wrote: > > > > by the way, > > > does this `nmi_print_seq' bypass even fix anything for Steven? > > > > I think that this is the most important question. > > > > Steven, does the patch from > > https://lkml.kernel.org/r/20170420131154.gl3...@pathway.suse.cz > > help you to see the debug messages, please? > > You'll have to wait for a bit. The box that I was debugging takes 45 > minutes to reboot. And I don't have much more time to play on it before > I have to give it back. I already found the bug I was looking for and > I'm trying not to crash it again (due to the huge bring up time). I see. > When I get a chance, I'll see if I can insert a trigger to crash the > kernel from NMI on another box and see if this patch helps. I actually tested it here using this hack: diff --cc lib/nmi_backtrace.c index d531f85c0c9b,0bc0a3535a8a.. --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@@ -89,8 -90,7 +90,9 @@@ bool nmi_cpu_backtrace(struct pt_regs * int cpu = smp_processor_id(); if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) { + if (in_nmi()) + panic("Simulating panic in NMI\n"); + arch_spin_lock(&lock); if (regs && cpu_in_idle(instruction_pointer(regs))) { pr_warn("NMI backtrace for cpu %d skipped: idling at pc %#lx\n", cpu, instruction_pointer(regs)); and triggered by: echo l > /proc/sysrq-trigger The patch really helped to see much more (all) messages from the ftrace buffers in NMI mode. But the test is a bit artifical. The patch might not help when there is a big printk() activity on the system when the panic() is triggered. We might wrongly use the small per-CPU buffer when the logbuf_lock is tested and taken on another CPU at the same time. It means that it will not always help. I personally think that the patch might be good enough. I am not sure if a perfect (more comlpex) solution is worth it. Best Regards, Petr
Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI
On Thu, 27 Apr 2017 15:38:19 +0200 Petr Mladek wrote: > > by the way, > > does this `nmi_print_seq' bypass even fix anything for Steven? > > I think that this is the most important question. > > Steven, does the patch from > https://lkml.kernel.org/r/20170420131154.gl3...@pathway.suse.cz > help you to see the debug messages, please? You'll have to wait for a bit. The box that I was debugging takes 45 minutes to reboot. And I don't have much more time to play on it before I have to give it back. I already found the bug I was looking for and I'm trying not to crash it again (due to the huge bring up time). When I get a chance, I'll see if I can insert a trigger to crash the kernel from NMI on another box and see if this patch helps. Thanks, -- Steve
[PATCH] powerpc/xive: Fix/improve verbose debug output
The existing verbose debug code doesn't build when enabled. This fixes it and generally improves the output to make it more useful. Signed-off-by: Benjamin Herrenschmidt --- arch/powerpc/sysdev/xive/common.c | 37 ++--- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 6a98efb..2305aa9 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -143,7 +143,6 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek) struct xive_q *q; prio = ffs(xc->pending_prio) - 1; - DBG_VERBOSE("scan_irq: trying prio %d\n", prio); /* Try to fetch */ irq = xive_read_eq(&xc->queue[prio], just_peek); @@ -171,12 +170,18 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool just_peek) } /* If nothing was found, set CPPR to 0xff */ - if (irq == 0) + if (irq == 0) { prio = 0xff; + DBG_VERBOSE("scan_irq(%d): nothing found\n", just_peek); + } else { + DBG_VERBOSE("scan_irq(%d): found irq %d prio %d\n", + just_peek, irq, prio); + } /* Update HW CPPR to match if necessary */ if (prio != xc->cppr) { - DBG_VERBOSE("scan_irq: adjusting CPPR to %d\n", prio); + DBG_VERBOSE("scan_irq(%d): adjusting CPPR %d->%d\n", + just_peek, xc->cppr, prio); xc->cppr = prio; out_8(xive_tima + xive_tima_offset + TM_CPPR, prio); } @@ -260,7 +265,7 @@ static unsigned int xive_get_irq(void) /* Scan our queue(s) for interrupts */ irq = xive_scan_interrupts(xc, false); - DBG_VERBOSE("get_irq: got irq 0x%x, new pending=0x%02x\n", + DBG_VERBOSE("get_irq: got irq %d new pending=0x%02x\n", irq, xc->pending_prio); /* Return pending interrupt if any */ @@ -282,7 +287,7 @@ static unsigned int xive_get_irq(void) static void xive_do_queue_eoi(struct xive_cpu *xc) { if (xive_scan_interrupts(xc, true) != 0) { - DBG_VERBOSE("eoi: pending=0x%02x\n", xc->pending_prio); + DBG_VERBOSE("eoi_irq: more pending !\n"); force_external_irq_replay(); } } @@ -327,11 +332,13 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data *xd) in_be64(xd->eoi_mmio); else { eoi_val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_00); - DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val); + DBG_VERBOSE("hwirq 0x%x eoi_val=%x\n", hw_irq, eoi_val); /* Re-trigger if needed */ - if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio) + if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio) { + DBG_VERBOSE(" -> eoi retrigger !\n"); out_be64(xd->trig_mmio, 0); + } } } } @@ -380,10 +387,15 @@ static void xive_do_source_set_mask(struct xive_irq_data *xd, if (mask) { val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_01); xd->saved_p = !!(val & XIVE_ESB_VAL_P); - } else if (xd->saved_p) - xive_poke_esb(xd, XIVE_ESB_SET_PQ_10); - else - xive_poke_esb(xd, XIVE_ESB_SET_PQ_00); + DBG_VERBOSE("masking val=%llx, sp=%d\n", + val, xd->saved_p); + } else { + DBG_VERBOSE("unmasking sp=%d\n", xd->saved_p); + if (xd->saved_p) + xive_poke_esb(xd, XIVE_ESB_SET_PQ_10); + else + xive_poke_esb(xd, XIVE_ESB_SET_PQ_00); + } } /* @@ -526,6 +538,7 @@ static unsigned int xive_irq_startup(struct irq_data *d) pr_devel("xive_irq_startup: irq %d [0x%x] data @%p\n", d->irq, hw_irq, d); + pr_devel(" eoi_mmio=%p trig_mmio=%p\n", xd->eoi_mmio, xd->trig_mmio); #ifdef CONFIG_PCI_MSI /* @@ -754,6 +767,8 @@ static int xive_irq_retrigger(struct irq_data *d) if (WARN_ON(xd->flags & XIVE_IRQ_FLAG_LSI)) return 0; + DBG_VERBOSE("retrigger irq %d\n", d->irq); + /* * To perform a retrigger, we first set the PQ bits to * 11, then perform an EOI.
Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI
On Mon 2017-04-24 11:17:47, Sergey Senozhatsky wrote: > On (04/21/17 14:06), Petr Mladek wrote: > [..] > > > I agree that this_cpu_read(printk_context) covers slightly more than > > > logbuf_lock scope, so we may get positive this_cpu_read(printk_context) > > > with unlocked logbuf_lock, but I don't tend to think that it's a big > > > problem. > > > > PRINTK_SAFE_CONTEXT is set also in call_console_drivers(). > > It might take rather long and logbuf_lock is availe. So, it is > > noticeable source of false positives. > > yes, agree. > > probably we need additional printk_safe annotations for > "logbuf_lock is locked from _this_ CPU" > > false positives there can be very painful. > > [..] > > if (raw_spin_is_locked(&logbuf_lock)) > > this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK); > > else > > this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK); > > well, if everyone is fine with logbuf_lock access from every CPU from every > NMI then I won't object either. but may be it makes sense to reduce the > possibility of false positives. Steven is loosing critically important logs, > after all. > > > by the way, > does this `nmi_print_seq' bypass even fix anything for Steven? I think that this is the most important question. Steven, does the patch from https://lkml.kernel.org/r/20170420131154.gl3...@pathway.suse.cz help you to see the debug messages, please? > it sort of > can, in theory, but just in theory. so may be we need direct message flush > from NMI handler (printk->console_unlock), which will be a really big problem. I thought about it a lot and got scared where this might go. We need to balance the usefulness and the complexity of the solution. It took one year to discover this regression. Before it was suggested to avoid calling printk() in NMI context at all. Now, we are trying to fix printk() to handle MBs of messages in NMI context. If my proposed patch solves the problem for Steven, I would still like to get similar solution in. It is not that complex and helps to bypass the limited per-CPU buffer in most cases. I always thought that 8kB might be not enough in some cases. Note that my patch is very defensive. It uses the main log buffer only when it is really safe. It has higher potential for unneeded fallback but if it works for Steven (really existing usecase), ... On the other hand, I would prefer to avoid any much more complex solution until we have a real reports that they are needed. Also we need to look for alternatives. There is a chance to create crashdump and get the ftrace messages from it. Also this might be scenario when we might need to suggest the early_printk() patchset from Peter Zijlstra. > logbuf might not be big enough for 4890096 messages (Steven's report > mentions "Lost 4890096 message(s)!"). we are counting on the fact that > in case of `nmi_print_seq' bypass some other CPU will call console_unlock() > and print pending logbuf messages, but this is not guaranteed and the > messages can be dropped even from logbuf. Yup. I tested the patch here and I needed to increase the main log buffer size to see all ftrace messages. Fortunately, it was possible to use a really huge global buffer. But it is not realistic to use huge per-CPU ones. Best Regards, Petr
Re: [PATCH v2 2/3] powerpc/kprobes: un-blacklist system_call() from kprobes
On 2017/04/27 08:19PM, Michael Ellerman wrote: > "Naveen N. Rao" writes: > > > It is actually safe to probe system_call() in entry_64.S, but only till > > .Lsyscall_exit. To allow this, convert .Lsyscall_exit to a non-local > > symbol __system_call() and blacklist that symbol, rather than > > system_call(). > > I'm not sure I like this. The reason we made it a local symbol in the > first place is because it made backtraces look odd: > > commit 4c3b21686111e0ac6018469dacbc5549f9915cf8 > Author: Michael Ellerman > AuthorDate: Fri Dec 5 21:16:59 2014 +1100 > > powerpc/kernel: Make syscall_exit a local label > > Currently when we back trace something that is in a syscall we see > something like this: > > [c000] [c000] SyS_read+0x6c/0x110 > [c000] [c000] syscall_exit+0x0/0x98 > > Although it's entirely correct, seeing syscall_exit at the bottom can be > confusing - we were exiting from a syscall and then called SyS_read() ? > > If we instead change syscall_exit to be a local label we get something > more intuitive: > > [c001fa46fde0] [c026719c] SyS_read+0x6c/0x110 > [c001fa46fe30] [c0009264] system_call+0x38/0xd0 > > ie. we were handling a system call, and it was SyS_read(). > > > I think you know that, although you didn't mention it in the change log, > because you've called the new symbol __system_call. But that is not a > great name either because that's not what it does. Yes, you're right. I used __system_call since I felt that it won't cause confusion like syscall_exit did. I agree it's not a great name, but we need _some_ label other than system_call if we want to allow probing at this point. Also, if I'm reading this right, there is no other place to probe if we want to capture all system call entries. So, I felt this would be good to have. > > > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > > index 380361c0bb6a..e030ce34dd66 100644 > > --- a/arch/powerpc/kernel/entry_64.S > > +++ b/arch/powerpc/kernel/entry_64.S > > @@ -176,7 +176,7 @@ system_call:/* label this so stack > > traces look sane */ > > mtctr r12 > > bctrl /* Call handler */ > > > > -.Lsyscall_exit: > > +__system_call: > > std r3,RESULT(r1) > > CURRENT_THREAD_INFO(r12, r1) > > Why can't we kprobe the std and the rotate to current thread info? > > Is the real no-probe point just here, prior to the clearing of MSR_RI ? > > ld r8,_MSR(r1) > #ifdef CONFIG_PPC_BOOK3S > /* No MSR:RI on BookE */ We can probe at all those places, just not once MSR_RI is unset. So, the no-probe point is just *after* the mtmsrd. However, for kprobe blacklisting, the granularity is at a function level (or ASM labels). As such, we will have to blacklist all of syscall_exit/__system_call. Regards, Naveen
Re: [PATCH v3] KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller
Paul Mackerras writes: > To get this to compile for all my test configs takes this additional > patch. I test-build configs with PR KVM and not HV (both modular and > built-in) and a config with HV enabled but CONFIG_KVM_XICS=n. Please > squash this into your topic branch. Thanks, squashed and pushed as: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") cheers
[RFC PATCH 4.8] powerpc/slb: Force a full SLB flush when we insert for a bad EA
The SLB miss handler calls slb_allocate_realmode() in order to create an SLB entry for the faulting address. At the very start of that function we check that the faulting Effective Address (EA) is less than PGTABLE_RANGE (ignoring the region), ie. is it an address which could possibly fit in the virtual address space. For an EA which fails that test, we branch out of line (to label 8), but we still go on to create an SLB entry for the address. The SLB entry we create has a VSID of 0, which means it will never match anything in the hash table and so can't actually translate to a physical address. However that SLB entry will be inserted in the SLB, and so needs to be managed properly like any other SLB entry. In particular we need to insert the SLB entry in the SLB cache, so that it will be flushed when the process is descheduled. And that is where the bugs begin. The first bug is that slb_finish_load() uses cr7 to decide if it should insert the SLB entry into the SLB cache. When we come from the invalid EA case we don't set cr7, it just has some junk value from userspace. So we may or may not insert the SLB entry in the SLB cache. If we fail to insert it, we may then incorrectly leave it in the SLB when the process is descheduled. The second bug is that even if we do happen to add the entry to the SLB cache, we do not have enough bits in the SLB cache to remember the full ESID value for very large EAs. For example if a process branches to 0x788c545a1800, that results in a 256MB SLB entry with an ESID of 0x788c545a1. But each entry in the SLB cache is only 32-bits, meaning we truncate the ESID to 0x88c545a1. This has the same effect as the first bug, we incorrectly leave the SLB entry in the SLB when the process is descheduled. When a process accesses an invalid EA it results in a SEGV signal being sent to the process, which typically results in the process being killed. Process death isn't instantaneous however, the process may catch the SEGV signal and continue somehow, or the kernel may start writing a core dump for the process, either of which means it's possible for the process to be preempted while its processing the SEGV but before it's been killed. If that happens, when the process is scheduled back onto the CPU we will allocate a new SLB entry for the NIP, which will insert a second entry into the SLB for the bad EA. Because we never flushed the original entry, due to either bug one or two, we now have two SLB entries that match the same EA. If another access is made to that EA, either by the process continuing after catching the SEGV, or by a second process accessing the same bad EA on the same CPU, we will trigger an SLB multi-hit machine check exception. This has been observed happening in the wild. The fix is when we hit the invalid EA case, we mark the SLB cache as being full. This causes us to not insert the truncated ESID into the SLB cache, and means when the process is switched out we will flush the entire SLB. Note that this works both for the original fault and for a subsequent call to slb_allocate_realmode() from switch_slb(). Because we mark the SLB cache as full, it doesn't really matter what value is in cr7, but rather than leaving it as something random we set it to indicate the address was a kernel address. That also skips the attempt to insert it in the SLB cache which is a nice side effect. Another way to fix the bug would be to make the entries in the SLB cache wider, so that we don't truncate the ESID. However this would be a more intrusive change as it alters the size and layout of the paca. This bug was fixed in upstream by commit f0f558b131db ("powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address"), which changed the way we handle a bad EA entirely removing this bug in the process. Cc: sta...@vger.kernel.org Signed-off-by: Michael Ellerman --- arch/powerpc/mm/slb_low.S | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S index dfdb90cb4403..1348c4862b08 100644 --- a/arch/powerpc/mm/slb_low.S +++ b/arch/powerpc/mm/slb_low.S @@ -174,6 +174,16 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT) b slb_finish_load 8: /* invalid EA */ + /* +* It's possible the bad EA is too large to fit in the SLB cache, which +* would mean we'd fail to invalidate it on context switch. So mark the +* SLB cache as full so we force a full flush. We also set cr7+eq to +* mark the address as a kernel address, so slb_finish_load() skips +* trying to insert it into the SLB cache. +*/ + li r9,SLB_CACHE_ENTRIES + 1 + sth r9,PACASLBCACHEPTR(r13) + crset 4*cr7+eq li r10,0 /* BAD_VSID */ li r9,0/* BAD_VSID */ li r11,SLB_VSID_USER /* flags don't much matter */ -- 2.7.4
[PATCH] Enabled pstore write for powerpc
After commit c950fd6f201a kernel registers pstore write based on flag set. Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) is not set for powerpc architecture. On panic, kernel doesn't write message to /fs/pstore/dmesg*(Entry doesn't gets created at all). This patch enables pstore write for powerpc architecture by setting PSTORE_FLAGS_DMESG flag. Fixes:c950fd6f201a pstore: Split pstore fragile flags Signed-off-by: Ankit Kumar --- arch/powerpc/kernel/nvram_64.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c index d5e2b83..021db31 100644 --- a/arch/powerpc/kernel/nvram_64.c +++ b/arch/powerpc/kernel/nvram_64.c @@ -561,6 +561,7 @@ static ssize_t nvram_pstore_read(u64 *id, enum pstore_type_id *type, static struct pstore_info nvram_pstore_info = { .owner = THIS_MODULE, .name = "nvram", + .flags = PSTORE_FLAGS_DMESG, .open = nvram_pstore_open, .read = nvram_pstore_read, .write = nvram_pstore_write, -- 2.7.4
Re: [PATCH] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations
"Naveen N. Rao" writes: > Excerpts from Masami Hiramatsu's message of April 26, 2017 10:11: >> On Tue, 25 Apr 2017 21:37:11 +0530 >> "Naveen N. Rao" wrote: >>> - addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name); >>> - if (!addr && dot_appended) { >>> - /* Let's try the original non-dot symbol lookup */ >>> + ret = strscpy(dot_name + len, c, KSYM_NAME_LEN); >>> + if (ret >= 0) >> >> Here, maybe you can skip the case of ret == 0. (Or, would we have >> a symbol which only has "."?) > > Ah, indeed. Good point. We just need the test to be (ret > 0). > > Michael, > If the rest of the patch is fine by you, would you be ok to make the > small change above? If not, please let me know and I'll re-spin. Thanks. I'd rather you change it, test and then resend. cheers
Re: [PATCH v2] cxl: Prevent IRQ storm
Andrew Donnellan writes: > On 27/04/17 11:37, Alastair D'Silva wrote: >> From: Alastair D'Silva >> >> In some situations, a faulty AFU slice may create an interrupt storm, >> rendering the machine unusable. Since these interrupts are informational >> only, present the interrupt once, then mask it off to prevent it from >> being retriggered until the card is reset. >> >> Changelog: >> v2 >> Rebase against linux-next > > The patch changelog shouldn't be part of the commit message - it should > go under a "---" line after the sign-off so it doesn't get included in > the final commit. > > Also now that I've taken a second look, I think the summary line of the > commit message could be more descriptive, something like: > > "cxl: mask slice error interrupt after first occurrence" ^ M :D cheers
Re: [PATCH 2/7] mm/follow_page_mask: Split follow_page_mask to smaller functions.
On Mon, Apr 17, 2017 at 10:41:41PM +0530, Aneesh Kumar K.V wrote: > Makes code reading easy. No functional changes in this patch. In a followup > patch, we will be updating the follow_page_mask to handle hugetlb hugepd > format > so that archs like ppc64 can switch to the generic version. This split helps > in doing that nicely. > > Signed-off-by: Aneesh Kumar K.V Reviewed-by: Naoya Horiguchi
Re: [PATCH 1/7] mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at
On Mon, Apr 17, 2017 at 10:41:40PM +0530, Aneesh Kumar K.V wrote: > The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use > that instead of set_pte_at. > > Signed-off-by: Aneesh Kumar K.V Reviewed-by: Naoya Horiguchi
Re: [PATCH 3/7] mm/hugetlb: export hugetlb_entry_migration helper
On Mon, Apr 17, 2017 at 10:41:42PM +0530, Aneesh Kumar K.V wrote: > We will be using this later from the ppc64 code. Change the return type to > bool. > > Signed-off-by: Aneesh Kumar K.V Reviewed-by: Naoya Horiguchi
Re: powerpc/powernv: Fix missing attr initialisation in opal_export_attrs()
On Thu, 2017-04-27 at 01:37:32 UTC, Michael Ellerman wrote: > In opal_export_attrs() we dynamically allocate some bin_attributes. They're > allocated with kmalloc() and although we initialise most of the fields, we > don't > initialise write() or mmap(), and in particular we don't initialise the > lockdep > related fields in the embedded struct attribute. > > This leads to a lockdep warning at boot: > > BUG: key c000f11906d8 not in .data! > WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 > lockdep_init_map+0x28c/0x2a0 > ... > Call Trace: > lockdep_init_map+0x288/0x2a0 (unreliable) > __kernfs_create_file+0x8c/0x170 > sysfs_add_file_mode_ns+0xc8/0x240 > __machine_initcall_powernv_opal_init+0x60c/0x684 > do_one_initcall+0x60/0x1c0 > kernel_init_freeable+0x2f4/0x3d4 > kernel_init+0x24/0x160 > ret_from_kernel_thread+0x5c/0xb0 > > Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and > mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep > fields. > > Fixes: 11fe909d2362 ("powerpc/powernv: Add OPAL exports attributes to sysfs") > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/83c4919058459c32138a1ebe35f72b cheers
Re: [v2,1/2] powerpc/mm/radix: Optimise Page Walk Cache flush
On Wed, 2017-04-26 at 13:27:19 UTC, Michael Ellerman wrote: > Currently we implement flushing of the page walk cache (PWC) by calling > _tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to > only flush the PWC. > > But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is > not necessary when we're just flushing the PWC. > > In fact the set argument is ignored for a PWC flush, so essentially we're just > flushing the PWC 127 extra times for no benefit. > > Fix it by adding tlbiel_pwc() which just does a single flush of the PWC. > > Signed-off-by: Aneesh Kumar K.V > [mpe: Split out of combined patch, drop _ in name, rewrite change log] > Signed-off-by: Michael Ellerman Series applied to powerpc next. https://git.kernel.org/powerpc/c/5a9853946c2e7a5ef9ef5302ecada6 cheers
Re: powerpc/powernv: Fix oops on P9 DD1 in cause_ipi()
On Wed, 2017-04-26 at 10:57:47 UTC, Michael Ellerman wrote: > Recently we merged the native xive support for Power9, and then separately > some > reworks for doorbell IPI support. In isolation both series were OK, but the > merged result had a bug in one case. > > On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then > falls back to the interrupt controller. However the fallback is implemented by > calling icp_ops->cause_ipi. But now that xive support is merged we might be > using xive, in which case icp_ops is not initialised, it's a xics specific > structure. This leads to an oops such as: > > Unable to handle kernel paging request for data at address 0x0028 > Oops: Kernel access of bad area, sig: 11 [#1] > NIP pnv_p9_dd1_cause_ipi+0x74/0xe0 > LR smp_muxed_ipi_message_pass+0x54/0x70 > > To fix it, rather than using icp_ops which might be NULL, have both xics and > xive set smp_ops->cause_ipi, and then in the powernv code we save that as > ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a > WARN_ON() > to check if somehow smp_ops->cause_ipi is NULL. > > Fixes: b866cc2199d6 ("powerpc: Change the doorbell IPI calling convention") > Signed-off-by: Michael Ellerman Applied to powerpc next. https://git.kernel.org/powerpc/c/45b21cfeb22087795f0b49397fbe52 cheers
Re: [REBASED,v4,1/2] powerpc: split ftrace bits into a separate file
On Tue, 2017-04-25 at 13:55:53 UTC, "Naveen N. Rao" wrote: > entry_*.S now includes a lot more than just kernel entry/exit code. As a > first step at cleaning this up, let's split out the ftrace bits into > separate files. Also move all related tracing code into a new trace/ > subdirectory. > > No functional changes. > > Suggested-by: Michael Ellerman > Signed-off-by: Naveen N. Rao Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/4781f015d35ce2e83632b6a938093b cheers
Re: powerpc/mm: Fix possible out-of-bounds shift in arch_mmap_rnd()
On Tue, 2017-04-25 at 12:09:41 UTC, Michael Ellerman wrote: > The recent patch to add runtime configuration of the ASLR limits added a bug > in > arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits, > leading to undefined behaviour. > > In practice it exhibits as every process seg faulting instantly, presumably > because the rnd value hasn't been restricited by the modulus at all. We didn't > notice because it only happens under certain kernel configurations and if the > number of bits is actually set to a large value. > > Fix it by switching to unsigned long. > > Fixes: 9fea59bd7ca5 ("powerpc/mm: Add support for runtime configuration of > ASLR limits") > Reported-by: Balbir Singh > Signed-off-by: Michael Ellerman > Reviewed-by: Kees Cook Applied to powerpc next. https://git.kernel.org/powerpc/c/b409946b2a3c1ddcde75e5f35a77e0 cheers
Re: [v2] powerpc/mm: Fix page table dump build on PPC32
On Tue, 2017-04-18 at 06:20:13 UTC, Christophe Leroy wrote: > On PPC32 (ex: mpc885_ads_defconfig), page table dump compilation > fails as follows. This is because the memory layout is slightly > different on PPC32. This patch adapts it. > > CC arch/powerpc/mm/dump_linuxpagetables.o > arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables': > arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' > undeclared (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c:369:10: note: each undeclared > identifier is reported only once for each function it appears in > arch/powerpc/mm/dump_linuxpagetables.c: In function 'populate_markers': > arch/powerpc/mm/dump_linuxpagetables.c:383:37: error: 'ISA_IO_BASE' > undeclared (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c:384:37: error: 'ISA_IO_END' undeclared > (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c:385:37: error: 'PHB_IO_BASE' > undeclared (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c:386:37: error: 'PHB_IO_END' undeclared > (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c:387:37: error: 'IOREMAP_BASE' > undeclared (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c:388:37: error: 'IOREMAP_END' > undeclared (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c:392:38: error: 'VMEMMAP_BASE' > undeclared (first use in this function) > arch/powerpc/mm/dump_linuxpagetables.c: In function 'ptdump_show': > arch/powerpc/mm/dump_linuxpagetables.c:400:20: error: 'KERN_VIRT_START' > undeclared (first use in this function) > make[1]: *** [arch/powerpc/mm/dump_linuxpagetables.o] Error 1 > make: *** [arch/powerpc/mm] Error 2 > > Fixes: 8eb07b187000d ("powerpc/mm: Dump linux pagetables") > Signed-off-by: Christophe Leroy Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/2fab9fe1f9ff6836a82bf8bdb26e67 cheers
Re: powerpc/mm: Rename table dump file name
On Tue, 2017-04-18 at 06:20:15 UTC, Christophe Leroy wrote: > Page table dump debugfs file is named 'kernel_page_tables' on > all other architectures implementing it, while is is named > 'kernel_pagetables' on powerpc. This patch renames it. > > Signed-off-by: Christophe Leroy Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/ec95e15862e31f8dfb6218ca111548 cheers
Re: [v2] powerpc/mm: Fix missing page attributes in page table dump
On Fri, 2017-04-14 at 05:45:16 UTC, Christophe Leroy wrote: > On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used. > There is also _PAGE_SHARED that is missing. > > Signed-off-by: Christophe Leroy Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/c99317953323d6251245022cb3af54 cheers
Re: powerpc/mm: On PPC32, display 32 bits addresses in page table dump
On Thu, 2017-04-13 at 12:41:40 UTC, Christophe Leroy wrote: > Signed-off-by: Christophe Leroy Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/e1f2c9d97d932812d17509e86246c6 cheers
Re: [PATCH v2 2/3] powerpc/kprobes: un-blacklist system_call() from kprobes
"Naveen N. Rao" writes: > It is actually safe to probe system_call() in entry_64.S, but only till > .Lsyscall_exit. To allow this, convert .Lsyscall_exit to a non-local > symbol __system_call() and blacklist that symbol, rather than > system_call(). I'm not sure I like this. The reason we made it a local symbol in the first place is because it made backtraces look odd: commit 4c3b21686111e0ac6018469dacbc5549f9915cf8 Author: Michael Ellerman AuthorDate: Fri Dec 5 21:16:59 2014 +1100 powerpc/kernel: Make syscall_exit a local label Currently when we back trace something that is in a syscall we see something like this: [c000] [c000] SyS_read+0x6c/0x110 [c000] [c000] syscall_exit+0x0/0x98 Although it's entirely correct, seeing syscall_exit at the bottom can be confusing - we were exiting from a syscall and then called SyS_read() ? If we instead change syscall_exit to be a local label we get something more intuitive: [c001fa46fde0] [c026719c] SyS_read+0x6c/0x110 [c001fa46fe30] [c0009264] system_call+0x38/0xd0 ie. we were handling a system call, and it was SyS_read(). I think you know that, although you didn't mention it in the change log, because you've called the new symbol __system_call. But that is not a great name either because that's not what it does. > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > index 380361c0bb6a..e030ce34dd66 100644 > --- a/arch/powerpc/kernel/entry_64.S > +++ b/arch/powerpc/kernel/entry_64.S > @@ -176,7 +176,7 @@ system_call: /* label this so stack > traces look sane */ > mtctr r12 > bctrl /* Call handler */ > > -.Lsyscall_exit: > +__system_call: > std r3,RESULT(r1) > CURRENT_THREAD_INFO(r12, r1) Why can't we kprobe the std and the rotate to current thread info? Is the real no-probe point just here, prior to the clearing of MSR_RI ? ld r8,_MSR(r1) #ifdef CONFIG_PPC_BOOK3S /* No MSR:RI on BookE */ cheers
Re: [PATCH v3] KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller
To get this to compile for all my test configs takes this additional patch. I test-build configs with PR KVM and not HV (both modular and built-in) and a config with HV enabled but CONFIG_KVM_XICS=n. Please squash this into your topic branch. Paul. diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index c56939ecc554..24de532c1736 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -200,7 +200,7 @@ config KVM_XICS config KVM_XIVE bool default y - depends on KVM_XICS && PPC_XIVE_NATIVE + depends on KVM_XICS && PPC_XIVE_NATIVE && KVM_BOOK3S_HV_POSSIBLE source drivers/vhost/Kconfig diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 5c00813e1e0e..846b40cb3a62 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -513,6 +513,7 @@ static long kvmppc_read_one_intr(bool *again) return kvmppc_check_passthru(xisr, xirr, again); } +#ifdef CONFIG_KVM_XICS static inline bool is_rm(void) { return !(mfmsr() & MSR_DR); @@ -591,3 +592,4 @@ int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr) } else return xics_rm_h_eoi(vcpu, xirr); } +#endif /* CONFIG_KVM_XICS */ diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h index 5016676847c9..453c9e518c19 100644 --- a/arch/powerpc/kvm/book3s_xics.h +++ b/arch/powerpc/kvm/book3s_xics.h @@ -10,6 +10,7 @@ #ifndef _KVM_PPC_BOOK3S_XICS_H #define _KVM_PPC_BOOK3S_XICS_H +#ifdef CONFIG_KVM_XICS /* * We use a two-level tree to store interrupt source information. * There are up to 1024 ICS nodes, each of which can represent @@ -150,4 +151,5 @@ extern int xics_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long server, extern int xics_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr); extern int xics_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr); +#endif /* CONFIG_KVM_XICS */ #endif /* _KVM_PPC_BOOK3S_XICS_H */ diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index fcccfbc2c4f4..5938f7644dc1 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -9,6 +9,7 @@ #ifndef _KVM_PPC_BOOK3S_XIVE_H #define _KVM_PPC_BOOK3S_XIVE_H +#ifdef CONFIG_KVM_XICS #include "book3s_xics.h" /* @@ -251,4 +252,5 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, unsigned long server, extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr); extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr); +#endif /* CONFIG_KVM_XICS */ #endif /* _KVM_PPC_BOOK3S_XICS_H */ diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c index 9d312c96a897..6feac0a758e1 100644 --- a/arch/powerpc/sysdev/xive/native.c +++ b/arch/powerpc/sysdev/xive/native.c @@ -267,6 +267,7 @@ static int xive_native_get_ipi(unsigned int cpu, struct xive_cpu *xc) } return 0; } +#endif /* CONFIG_SMP */ u32 xive_native_alloc_irq(void) { @@ -295,6 +296,7 @@ void xive_native_free_irq(u32 irq) } EXPORT_SYMBOL_GPL(xive_native_free_irq); +#ifdef CONFIG_SMP static void xive_native_put_ipi(unsigned int cpu, struct xive_cpu *xc) { s64 rc;
[PATCH v5 3/3] kdump: Protect vmcoreinfo data under the crash memory
Currently vmcoreinfo data is updated at boot time subsys_initcall(), it has the risk of being modified by some wrong code during system is running. As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on, when using "crash", "makedumpfile", etc utility to parse this vmcore, we probably will get "Segmentation fault" or other unexpected errors. E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the system; 3) trigger kdump, then we obviously will fail to recognize the crash context correctly due to the corrupted vmcoreinfo. Now except for vmcoreinfo, all the crash data is well protected(including the cpu note which is fully updated in the crash path, thus its correctness is guaranteed). Given that vmcoreinfo data is a large chunk prepared for kdump, we better protect it as well. To solve this, we relocate and copy vmcoreinfo_data to the crash memory when kdump is loading via kexec syscalls. Because the whole crash memory will be protected by existing arch_kexec_protect_crashkres() mechanism, we naturally protect vmcoreinfo_data from write(even read) access under kernel direct mapping after kdump is loaded. Since kdump is usually loaded at the very early stage after boot, we can trust the correctness of the vmcoreinfo data copied. On the other hand, we still need to operate the vmcoreinfo safe copy when crash happens to generate vmcoreinfo_note again, we rely on vmap() to map out a new kernel virtual address and update to use this new one instead in the following crash_save_vmcoreinfo(). BTW, we do not touch vmcoreinfo_note, because it will be fully updated using the protected vmcoreinfo_data after crash which is surely correct just like the cpu crash note. Tested-by: Michael Holzheu Signed-off-by: Xunlei Pang --- v4->v5: - Moved vunmap(image->vmcoreinfo_data_copy) above to avoid confusion. - No functional change. v3->v4: -Rebased on the latest linux-next -Copy vmcoreinfo after machine_kexec_prepare() include/linux/crash_core.h | 2 +- include/linux/kexec.h | 2 ++ kernel/crash_core.c| 17 - kernel/kexec.c | 8 kernel/kexec_core.c| 39 +++ kernel/kexec_file.c| 8 6 files changed, 74 insertions(+), 2 deletions(-) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index 4555c09..e9de6b4 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -23,6 +23,7 @@ typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4]; +void crash_update_vmcoreinfo_safecopy(void *ptr); void crash_save_vmcoreinfo(void); void arch_crash_save_vmcoreinfo(void); __printf(1, 2) @@ -54,7 +55,6 @@ vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value) extern u32 *vmcoreinfo_note; -extern size_t vmcoreinfo_size; Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); diff --git a/include/linux/kexec.h b/include/linux/kexec.h index c9481eb..3ea8275 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -181,6 +181,7 @@ struct kimage { unsigned long start; struct page *control_code_page; struct page *swap_page; + void *vmcoreinfo_data_copy; /* locates in the crash memory */ unsigned long nr_segments; struct kexec_segment segment[KEXEC_SEGMENT_MAX]; @@ -250,6 +251,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct kimage *image, int kexec_should_crash(struct task_struct *); int kexec_crash_loaded(void); void crash_save_cpu(struct pt_regs *regs, int cpu); +extern int kimage_crash_copy_vmcoreinfo(struct kimage *image); extern struct kimage *kexec_image; extern struct kimage *kexec_crash_image; diff --git a/kernel/crash_core.c b/kernel/crash_core.c index c2fd0d2..4a4a4ba 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -15,9 +15,12 @@ /* vmcoreinfo stuff */ static unsigned char *vmcoreinfo_data; -size_t vmcoreinfo_size; +static size_t vmcoreinfo_size; u32 *vmcoreinfo_note; +/* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */ +static unsigned char *vmcoreinfo_data_safecopy; + /* * parsing the "crashkernel" commandline * @@ -323,11 +326,23 @@ static void update_vmcoreinfo_note(void) final_note(buf); } +void crash_update_vmcoreinfo_safecopy(void *ptr) +{ + if (ptr) + memcpy(ptr, vmcoreinfo_data, vmcoreinfo_size); + + vmcoreinfo_data_safecopy = ptr; +} + void crash_save_vmcoreinfo(void) { if (!vmcoreinfo_note) return; + /* Use the safe copy to generate vmcoreinfo note if have */ + if (vmcoreinfo_data_safecopy) + vmcoreinfo_data = vmcoreinfo_data_safecopy; + vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds()); update_vmcoreinfo_note(); } diff --git a/kernel/kexec.c b/kernel/kexec.c index 980936a..e62ec4d 100644 --- a/kernel/kexec.c +++ b/kernel/k
[PATCH v5 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr
vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE. Like explained in commit 77019967f06b ("kdump: fix exported size of vmcoreinfo note"), it should not affect the actual function, but we better fix it, also this change should be safe and backward compatible. After this, we can get rid of variable vmcoreinfo_max_size, let's use the corresponding macros directly, fewer variables means more safety for vmcoreinfo operation. Cc: Hari Bathini Reviewed-by: Mahesh Salgaonkar Reviewed-by: Dave Young Signed-off-by: Xunlei Pang --- v4->v5: No change. v3->v4: -Rebased on the latest linux-next arch/powerpc/kernel/fadump.c | 3 +-- include/linux/crash_core.h | 1 - kernel/crash_core.c | 3 +-- 3 files changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 466569e..7bd6cd0 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -893,8 +893,7 @@ static int fadump_create_elfcore_headers(char *bufp) phdr->p_paddr = fadump_relocate(paddr_vmcoreinfo_note()); phdr->p_offset = phdr->p_paddr; - phdr->p_memsz = vmcoreinfo_max_size; - phdr->p_filesz = vmcoreinfo_max_size; + phdr->p_memsz = phdr->p_filesz = VMCOREINFO_NOTE_SIZE; /* Increment number of program headers. */ (elf->e_phnum)++; diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index ec9d415..4555c09 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -55,7 +55,6 @@ extern u32 *vmcoreinfo_note; extern size_t vmcoreinfo_size; -extern size_t vmcoreinfo_max_size; Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 2837d61..c2fd0d2 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -16,7 +16,6 @@ /* vmcoreinfo stuff */ static unsigned char *vmcoreinfo_data; size_t vmcoreinfo_size; -size_t vmcoreinfo_max_size = VMCOREINFO_BYTES; u32 *vmcoreinfo_note; /* @@ -343,7 +342,7 @@ void vmcoreinfo_append_str(const char *fmt, ...) r = vscnprintf(buf, sizeof(buf), fmt, args); va_end(args); - r = min(r, vmcoreinfo_max_size - vmcoreinfo_size); + r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size); memcpy(&vmcoreinfo_data[vmcoreinfo_size], buf, r); -- 1.8.3.1
[PATCH v5 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section
As Eric said, "what we need to do is move the variable vmcoreinfo_note out of the kernel's .bss section. And modify the code to regenerate and keep this information in something like the control page. Definitely something like this needs a page all to itself, and ideally far away from any other kernel data structures. I clearly was not watching closely the data someone decided to keep this silly thing in the kernel's .bss section." This patch allocates extra pages for these vmcoreinfo_XXX variables, one advantage is that it enhances some safety of vmcoreinfo, because vmcoreinfo now is kept far away from other kernel data structures. Cc: Juergen Gross Suggested-by: Eric Biederman Tested-by: Michael Holzheu Reviewed-by: Juergen Gross Signed-off-by: Xunlei Pang --- v4->v5: Changed VMCOREINFO_BYTES definition to PAGE_SIZE according to Dave's comment v3->v4: -Rebased on the latest linux-next -Handle S390 vmcoreinfo_note properly -Handle the newly-added xen/mmu_pv.c arch/ia64/kernel/machine_kexec.c | 5 - arch/s390/kernel/machine_kexec.c | 1 + arch/s390/kernel/setup.c | 6 -- arch/x86/kernel/crash.c | 2 +- arch/x86/xen/mmu_pv.c| 4 ++-- include/linux/crash_core.h | 4 ++-- kernel/crash_core.c | 26 ++ kernel/ksysfs.c | 2 +- 8 files changed, 29 insertions(+), 21 deletions(-) diff --git a/arch/ia64/kernel/machine_kexec.c b/arch/ia64/kernel/machine_kexec.c index 599507b..c14815d 100644 --- a/arch/ia64/kernel/machine_kexec.c +++ b/arch/ia64/kernel/machine_kexec.c @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void) #endif } -phys_addr_t paddr_vmcoreinfo_note(void) -{ - return ia64_tpa((unsigned long)(char *)&vmcoreinfo_note); -} - diff --git a/arch/s390/kernel/machine_kexec.c b/arch/s390/kernel/machine_kexec.c index 49a6bd4..3d0b14a 100644 --- a/arch/s390/kernel/machine_kexec.c +++ b/arch/s390/kernel/machine_kexec.c @@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void) VMCOREINFO_SYMBOL(lowcore_ptr); VMCOREINFO_SYMBOL(high_memory); VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS); + mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note()); } void machine_shutdown(void) diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c index 3ae756c..3d1d808 100644 --- a/arch/s390/kernel/setup.c +++ b/arch/s390/kernel/setup.c @@ -496,11 +496,6 @@ static void __init setup_memory_end(void) pr_notice("The maximum memory size is %luMB\n", memory_end >> 20); } -static void __init setup_vmcoreinfo(void) -{ - mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note()); -} - #ifdef CONFIG_CRASH_DUMP /* @@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p) #endif setup_resources(); - setup_vmcoreinfo(); setup_lowcore(); smp_fill_possible_mask(); cpu_detect_mhz_feature(); diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 22217ec..44404e2 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data *ced, bufp += sizeof(Elf64_Phdr); phdr->p_type = PT_NOTE; phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note(); - phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note); + phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE; (ehdr->e_phnum)++; #ifdef CONFIG_X86_64 diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 9d9ae66..35543fa 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2723,8 +2723,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, unsigned int order) phys_addr_t paddr_vmcoreinfo_note(void) { if (xen_pv_domain()) - return virt_to_machine(&vmcoreinfo_note).maddr; + return virt_to_machine(vmcoreinfo_note).maddr; else - return __pa_symbol(&vmcoreinfo_note); + return __pa(vmcoreinfo_note); } #endif /* CONFIG_KEXEC_CORE */ diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index eb71a70..ec9d415 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -14,7 +14,7 @@ CRASH_CORE_NOTE_NAME_BYTES + \ CRASH_CORE_NOTE_DESC_BYTES) -#define VMCOREINFO_BYTES (4096) +#define VMCOREINFO_BYTES PAGE_SIZE #define VMCOREINFO_NOTE_NAME "VMCOREINFO" #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4) #define VMCOREINFO_NOTE_SIZE ((CRASH_CORE_NOTE_HEAD_BYTES * 2) + \ @@ -53,7 +53,7 @@ #define VMCOREINFO_PHYS_BASE(value) \ vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value) -extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4]; +extern u32 *vmcoreinfo_note; extern size_t vmcoreinfo_size; extern size_t vmcoreinfo_max_size; diff --git a
[PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing
E.g after fadump reserves mem regions, these regions should not be removed before fadump explicitly free them. Signed-off-by: Pingfan Liu --- arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index e104c71..201be23 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base, unsigned int memblock_siz if (!pfn_valid(start_pfn)) goto out; + if (memblock_is_reserved(base)) + return -EINVAL; block_sz = pseries_memory_block_size(); sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE; @@ -388,8 +390,7 @@ static int pseries_remove_mem_node(struct device_node *np) base = be64_to_cpu(*(unsigned long *)regs); lmb_size = be32_to_cpu(regs[3]); - pseries_remove_memblock(base, lmb_size); - return 0; + return pseries_remove_memblock(base, lmb_size); } static bool lmb_is_removable(struct of_drconf_cell *lmb) -- 2.7.4
[PATCH v2 2/3] powerpc/kprobes: un-blacklist system_call() from kprobes
It is actually safe to probe system_call() in entry_64.S, but only till .Lsyscall_exit. To allow this, convert .Lsyscall_exit to a non-local symbol __system_call() and blacklist that symbol, rather than system_call(). Reviewed-by: Masami Hiramatsu Signed-off-by: Naveen N. Rao --- arch/powerpc/kernel/entry_64.S | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 380361c0bb6a..e030ce34dd66 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -176,7 +176,7 @@ system_call:/* label this so stack traces look sane */ mtctr r12 bctrl /* Call handler */ -.Lsyscall_exit: +__system_call: std r3,RESULT(r1) CURRENT_THREAD_INFO(r12, r1) @@ -294,12 +294,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) blt+system_call /* Return code is already in r3 thanks to do_syscall_trace_enter() */ - b .Lsyscall_exit + b __system_call .Lsyscall_enosys: li r3,-ENOSYS - b .Lsyscall_exit + b __system_call .Lsyscall_exit_work: #ifdef CONFIG_PPC_BOOK3S @@ -388,7 +388,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) b . /* prevent speculative execution */ #endif _ASM_NOKPROBE_SYMBOL(system_call_common); -_ASM_NOKPROBE_SYMBOL(system_call); +_ASM_NOKPROBE_SYMBOL(__system_call); /* Save non-volatile GPRs, if not already saved. */ _GLOBAL(save_nvgprs) @@ -413,38 +413,38 @@ _GLOBAL(save_nvgprs) _GLOBAL(ppc_fork) bl save_nvgprs bl sys_fork - b .Lsyscall_exit + b __system_call _GLOBAL(ppc_vfork) bl save_nvgprs bl sys_vfork - b .Lsyscall_exit + b __system_call _GLOBAL(ppc_clone) bl save_nvgprs bl sys_clone - b .Lsyscall_exit + b __system_call _GLOBAL(ppc32_swapcontext) bl save_nvgprs bl compat_sys_swapcontext - b .Lsyscall_exit + b __system_call _GLOBAL(ppc64_swapcontext) bl save_nvgprs bl sys_swapcontext - b .Lsyscall_exit + b __system_call _GLOBAL(ppc_switch_endian) bl save_nvgprs bl sys_switch_endian - b .Lsyscall_exit + b __system_call _GLOBAL(ret_from_fork) bl schedule_tail REST_NVGPRS(r1) li r3,0 - b .Lsyscall_exit + b __system_call _GLOBAL(ret_from_kernel_thread) bl schedule_tail @@ -456,7 +456,7 @@ _GLOBAL(ret_from_kernel_thread) #endif blrl li r3,0 - b .Lsyscall_exit + b __system_call /* * This routine switches between two different tasks. The process -- 2.12.2
[PATCH v2 3/3] powerpc/kprobes: blacklist functions invoked on a trap
Blacklist all functions involved while handling a trap. We: - convert some of the labels into private labels, - remove the duplicate 'restore' label, and - blacklist most functions involved while handling a trap. Reviewed-by: Masami Hiramatsu Signed-off-by: Naveen N. Rao --- arch/powerpc/kernel/entry_64.S | 47 +--- arch/powerpc/kernel/exceptions-64s.S | 1 + arch/powerpc/kernel/traps.c | 3 +++ 3 files changed, 31 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index e030ce34dd66..e7e05eb590a5 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -184,7 +184,7 @@ __system_call: #ifdef CONFIG_PPC_BOOK3S /* No MSR:RI on BookE */ andi. r10,r8,MSR_RI - beq-unrecov_restore + beq-.Lunrecov_restore #endif /* * Disable interrupts so current_thread_info()->flags can't change, @@ -399,6 +399,7 @@ _GLOBAL(save_nvgprs) clrrdi r0,r11,1 std r0,_TRAP(r1) blr +_ASM_NOKPROBE_SYMBOL(save_nvgprs); /* @@ -642,18 +643,18 @@ _GLOBAL(ret_from_except_lite) * Use the internal debug mode bit to do this. */ andis. r0,r3,DBCR0_IDM@h - beq restore + beq fast_exc_return_irq mfmsr r0 rlwinm r0,r0,0,~MSR_DE /* Clear MSR.DE */ mtmsr r0 mtspr SPRN_DBCR0,r3 li r10, -1 mtspr SPRN_DBSR,r10 - b restore + b fast_exc_return_irq #else addir3,r1,STACK_FRAME_OVERHEAD bl restore_math - b restore + b fast_exc_return_irq #endif 1: andi. r0,r4,_TIF_NEED_RESCHED beq 2f @@ -666,7 +667,7 @@ _GLOBAL(ret_from_except_lite) bne 3f /* only restore TM if nothing else to do */ addir3,r1,STACK_FRAME_OVERHEAD bl restore_tm_state - b restore + b fast_exc_return_irq 3: #endif bl save_nvgprs @@ -718,14 +719,14 @@ resume_kernel: #ifdef CONFIG_PREEMPT /* Check if we need to preempt */ andi. r0,r4,_TIF_NEED_RESCHED - beq+restore + beq+fast_exc_return_irq /* Check that preempt_count() == 0 and interrupts are enabled */ lwz r8,TI_PREEMPT(r9) cmpwi cr1,r8,0 ld r0,SOFTE(r1) cmpdi r0,0 crandc eq,cr1*4+eq,eq - bne restore + bne fast_exc_return_irq /* * Here we are preempting the current task. We want to make @@ -756,7 +757,6 @@ resume_kernel: .globl fast_exc_return_irq fast_exc_return_irq: -restore: /* * This is the main kernel exit path. First we check if we * are about to re-enable interrupts @@ -764,11 +764,11 @@ restore: ld r5,SOFTE(r1) lbz r6,PACASOFTIRQEN(r13) cmpwi cr0,r5,0 - beq restore_irq_off + beq .Lrestore_irq_off /* We are enabling, were we already enabled ? Yes, just return */ cmpwi cr0,r6,1 - beq cr0,do_restore + beq cr0,.Ldo_restore /* * We are about to soft-enable interrupts (we are hard disabled @@ -777,14 +777,14 @@ restore: */ lbz r0,PACAIRQHAPPENED(r13) cmpwi cr0,r0,0 - bne-restore_check_irq_replay + bne-.Lrestore_check_irq_replay /* * Get here when nothing happened while soft-disabled, just * soft-enable and move-on. We will hard-enable as a side * effect of rfi */ -restore_no_replay: +.Lrestore_no_replay: TRACE_ENABLE_INTS li r0,1 stb r0,PACASOFTIRQEN(r13); @@ -792,7 +792,7 @@ restore_no_replay: /* * Final return path. BookE is handled in a different file */ -do_restore: +.Ldo_restore: #ifdef CONFIG_PPC_BOOK3E b exception_return_book3e #else @@ -826,7 +826,7 @@ fast_exception_return: REST_8GPRS(5, r1) andi. r0,r3,MSR_RI - beq-unrecov_restore + beq-.Lunrecov_restore /* Load PPR from thread struct before we clear MSR:RI */ BEGIN_FTR_SECTION @@ -884,7 +884,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) * make sure that in this case, we also clear PACA_IRQ_HARD_DIS * or that bit can get out of sync and bad things will happen */ -restore_irq_off: +.Lrestore_irq_off: ld r3,_MSR(r1) lbz r7,PACAIRQHAPPENED(r13) andi. r0,r3,MSR_EE @@ -894,13 +894,13 @@ restore_irq_off: 1: li r0,0 stb r0,PACASOFTIRQEN(r13); TRACE_DISABLE_INTS - b do_restore + b .Ldo_restore /* * Something did happen, check if a re-emit is needed * (this also clears paca->irq_happened) */ -restore_check_irq_replay: +.Lre
[PATCH v2 1/3] powerpc/kprobes: cleanup system_call_common and blacklist it from kprobes
Convert some of the labels into private labels and blacklist system_call_common() and system_call() from kprobes. We can't take a trap at parts of these functions as either MSR_RI is unset or the kernel stack pointer is not yet setup. Reviewed-by: Masami Hiramatsu Signed-off-by: Naveen N. Rao --- arch/powerpc/kernel/entry_64.S | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 9b541d22595a..380361c0bb6a 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -52,12 +52,11 @@ exception_marker: .section".text" .align 7 - .globl system_call_common -system_call_common: +_GLOBAL(system_call_common) #ifdef CONFIG_PPC_TRANSACTIONAL_MEM BEGIN_FTR_SECTION extrdi. r10, r12, 1, (63-MSR_TS_T_LG) /* transaction active? */ - bne tabort_syscall + bne .Ltabort_syscall END_FTR_SECTION_IFSET(CPU_FTR_TM) #endif andi. r10,r12,MSR_PR @@ -152,9 +151,9 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR) CURRENT_THREAD_INFO(r11, r1) ld r10,TI_FLAGS(r11) andi. r11,r10,_TIF_SYSCALL_DOTRACE - bne syscall_dotrace /* does not return */ + bne .Lsyscall_dotrace /* does not return */ cmpldi 0,r0,NR_syscalls - bge-syscall_enosys + bge-.Lsyscall_enosys system_call: /* label this so stack traces look sane */ /* @@ -208,7 +207,7 @@ system_call:/* label this so stack traces look sane */ ld r9,TI_FLAGS(r12) li r11,-MAX_ERRNO andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) - bne-syscall_exit_work + bne-.Lsyscall_exit_work andi. r0,r8,MSR_FP beq 2f @@ -232,7 +231,7 @@ system_call:/* label this so stack traces look sane */ 3: cmpld r3,r11 ld r5,_CCR(r1) - bge-syscall_error + bge-.Lsyscall_error .Lsyscall_error_cont: ld r7,_NIP(r1) BEGIN_FTR_SECTION @@ -258,14 +257,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) RFI b . /* prevent speculative execution */ -syscall_error: +.Lsyscall_error: orisr5,r5,0x1000/* Set SO bit in CR */ neg r3,r3 std r5,_CCR(r1) b .Lsyscall_error_cont /* Traced system call support */ -syscall_dotrace: +.Lsyscall_dotrace: bl save_nvgprs addir3,r1,STACK_FRAME_OVERHEAD bl do_syscall_trace_enter @@ -298,11 +297,11 @@ syscall_dotrace: b .Lsyscall_exit -syscall_enosys: +.Lsyscall_enosys: li r3,-ENOSYS b .Lsyscall_exit -syscall_exit_work: +.Lsyscall_exit_work: #ifdef CONFIG_PPC_BOOK3S li r10,MSR_RI mtmsrd r10,1 /* Restore RI */ @@ -362,7 +361,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) b ret_from_except #ifdef CONFIG_PPC_TRANSACTIONAL_MEM -tabort_syscall: +.Ltabort_syscall: /* Firstly we need to enable TM in the kernel */ mfmsr r10 li r9, 1 @@ -388,6 +387,8 @@ tabort_syscall: rfid b . /* prevent speculative execution */ #endif +_ASM_NOKPROBE_SYMBOL(system_call_common); +_ASM_NOKPROBE_SYMBOL(system_call); /* Save non-volatile GPRs, if not already saved. */ _GLOBAL(save_nvgprs) -- 2.12.2
[PATCH v2 0/3] powerpc: build out kprobes blacklist
v2 changes: - Patches 3 and 4 from the previous series have been merged. - Updated to no longer blacklist functions involved with stolen time accounting. v1: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg117514.html -- This is the second in the series of patches to build out an appropriate kprobes blacklist. This series blacklists system_call() and functions involved when handling the trap itself. Not everything is covered, but this is the first set of functions that I have tested with. More patches to follow once I expand my tests. I have converted many labels into private -- these are labels that I felt are not necessary to read stack traces. If any of those are important to have, please let me know. - Naveen Naveen N. Rao (3): powerpc/kprobes: cleanup system_call_common and blacklist it from kprobes powerpc/kprobes: un-blacklist system_call() from kprobes powerpc/kprobes: blacklist functions invoked on a trap arch/powerpc/kernel/entry_64.S | 94 +++- arch/powerpc/kernel/exceptions-64s.S | 1 + arch/powerpc/kernel/traps.c | 3 ++ 3 files changed, 55 insertions(+), 43 deletions(-) -- 2.12.2
Re: [PATCH v4 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr
On 04/26/2017 12:41 PM, Dave Young wrote: > Ccing ppc list > On 04/20/17 at 07:39pm, Xunlei Pang wrote: >> vmcoreinfo_max_size stands for the vmcoreinfo_data, the >> correct one we should use is vmcoreinfo_note whose total >> size is VMCOREINFO_NOTE_SIZE. >> >> Like explained in commit 77019967f06b ("kdump: fix exported >> size of vmcoreinfo note"), it should not affect the actual >> function, but we better fix it, also this change should be >> safe and backward compatible. >> >> After this, we can get rid of variable vmcoreinfo_max_size, >> let's use the corresponding macros directly, fewer variables >> means more safety for vmcoreinfo operation. >> >> Cc: Mahesh Salgaonkar >> Cc: Hari Bathini >> Signed-off-by: Xunlei Pang Reviewed-by: Mahesh Salgaonkar Thanks, -Mahesh. >> --- >> v3->v4: >> -Rebased on the latest linux-next >> >> arch/powerpc/kernel/fadump.c | 3 +-- >> include/linux/crash_core.h | 1 - >> kernel/crash_core.c | 3 +-- >> 3 files changed, 2 insertions(+), 5 deletions(-) >> >> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c >> index 466569e..7bd6cd0 100644 >> --- a/arch/powerpc/kernel/fadump.c >> +++ b/arch/powerpc/kernel/fadump.c >> @@ -893,8 +893,7 @@ static int fadump_create_elfcore_headers(char *bufp) >> >> phdr->p_paddr = fadump_relocate(paddr_vmcoreinfo_note()); >> phdr->p_offset = phdr->p_paddr; >> -phdr->p_memsz = vmcoreinfo_max_size; >> -phdr->p_filesz = vmcoreinfo_max_size; >> +phdr->p_memsz = phdr->p_filesz = VMCOREINFO_NOTE_SIZE; >> >> /* Increment number of program headers. */ >> (elf->e_phnum)++; >> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h >> index ba283a2..7d6bc7b 100644 >> --- a/include/linux/crash_core.h >> +++ b/include/linux/crash_core.h >> @@ -55,7 +55,6 @@ >> >> extern u32 *vmcoreinfo_note; >> extern size_t vmcoreinfo_size; >> -extern size_t vmcoreinfo_max_size; >> >> Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, >>void *data, size_t data_len); >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c >> index 0321f04..43cdb00 100644 >> --- a/kernel/crash_core.c >> +++ b/kernel/crash_core.c >> @@ -16,7 +16,6 @@ >> /* vmcoreinfo stuff */ >> static unsigned char *vmcoreinfo_data; >> size_t vmcoreinfo_size; >> -size_t vmcoreinfo_max_size = VMCOREINFO_BYTES; >> u32 *vmcoreinfo_note; >> >> /* >> @@ -343,7 +342,7 @@ void vmcoreinfo_append_str(const char *fmt, ...) >> r = vscnprintf(buf, sizeof(buf), fmt, args); >> va_end(args); >> >> -r = min(r, vmcoreinfo_max_size - vmcoreinfo_size); >> +r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size); >> >> memcpy(&vmcoreinfo_data[vmcoreinfo_size], buf, r); >> >> -- >> 1.8.3.1 >> >> >> ___ >> kexec mailing list >> ke...@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/kexec > > Reviewed-by: Dave Young > > Thanks > Dave >