date:20170427

Re: [PATCH] powerpc/xive: Fix/improve verbose debug output

2017-04-27 Thread Benjamin Herrenschmidt

On Fri, 2017-04-28 at 16:34 +1000, Michael Ellerman wrote:
> > > If there's non-verbose debug that we think would be useful to
> > > differentiate from verbose then those could be pr_debug() - which means
> > > they'll be jump labelled off in most production kernels, but still able
> > > to be enabled.
> > 
> > Maybe... I don't like the giant "debug" switch accross the whole
> > kernel, though.
> 
> Not sure what you mean. You can enable pr_debug()s individually, by
> function, by module, by file, or for the whole kernel.
> 
> To enable everything in xive you'd do:
> 
> # echo 'file *xive* +p' > /sys/kernel/debug/dynamic_debug/control
> 
> Or boot with: loglevel=8 dyndbg="file *xive* +p"

Ah that's new goodness I wasn't aware of. Anyway, I can spin that
later, not planning on doing any work today ;-)

Cheers,
Ben.

Re: [PATCH v3] cxl: mask slice error interrupts after first occurrence

2017-04-27 Thread Vaibhav Jain

Hi Alastair,

Thanks for addressing previous review comments. Few additional and very
minor comments.

Alastair D'Silva  writes:
> From: Alastair D'Silva 
>
> In some situations, a faulty AFU slice may create an interrupt storm,
'interrupt storm of slice-errors,'
> rendering the machine unusable. Since these interrupts are informational
> only, present the interrupt once, then mask it off to prevent it from
> being retriggered until the card is reset.
s|card|card/afu

> @@ -1226,7 +1237,11 @@ static irqreturn_t native_slice_irq_err(int irq, void 
> *data)
>   dev_crit(&afu->dev, "AFU_ERR_An: 0x%.16llx\n", afu_error);
>   dev_crit(&afu->dev, "PSL_DSISR_An: 0x%.16llx\n", dsisr);
>
> + /* mask off the IRQ so it won't retrigger until the card is reset */
> + irq_mask = (serr & CXL_PSL_SERR_An_IRQS) >> 32;
> + serr |= irq_mask;
>   cxl_p1n_write(afu, CXL_PSL_SERR_An, serr);
> + dev_info(&afu->dev, "Further interrupts will be masked until the
Optional: Just to be explicit, since you are only masking a subset of possible
slice errors hence I would suggest rephrasing the message as:
"Further such interrupts

> AFU is reset\n");
To be consistent with the patch description  s|AFU|AFU/Card

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.

Re: [PATCH] powerpc/xive: Fix/improve verbose debug output

2017-04-27 Thread Michael Ellerman

Benjamin Herrenschmidt  writes:

> On Fri, 2017-04-28 at 13:07 +1000, Michael Ellerman wrote:
>> Benjamin Herrenschmidt  writes:
>> 
>> > The existing verbose debug code doesn't build when enabled.
>> 
>> So why don't we convert all the DBG_VERBOSE() to pr_devel()? 
>
> pr_devel provides a bunch of debug at init/setup/mask/unmask etc... but
> the system is still usable

OK so those could be converted to pr_debug().

> DBG_VERBOSE starts spewing stuff on every interrupt and eoi, the system
> is no longer usable.

And those could stay at pr_devel(), requiring a #define DEBUG and
recompile to enable.

>> If there's non-verbose debug that we think would be useful to
>> differentiate from verbose then those could be pr_debug() - which means
>> they'll be jump labelled off in most production kernels, but still able
>> to be enabled.
>
> Maybe... I don't like the giant "debug" switch accross the whole
> kernel, though.

Not sure what you mean. You can enable pr_debug()s individually, by
function, by module, by file, or for the whole kernel.

To enable everything in xive you'd do:

# echo 'file *xive* +p' > /sys/kernel/debug/dynamic_debug/control

Or boot with: loglevel=8 dyndbg="file *xive* +p"

cheers

Re: [PATCH kernel v2] powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc

2017-04-27 Thread Alexey Kardashevskiy

On Tue, 11 Apr 2017 18:28:42 +1000
Alexey Kardashevskiy  wrote:

> On 27/03/17 19:27, Alexey Kardashevskiy wrote:
> > pnv_pci_table_alloc() ignores possible failure from kzalloc_node(),
> > this adds a check. There are 2 callers of pnv_pci_table_alloc(),
> > one already checks for tbl!=NULL, this adds WARN_ON() to the other path
> > which only happens during boot time in IODA1 and not expected to fail.
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> > Changes:
> > v2:
> > * s/BUG_ON/WARN_ON/  
> 
> Bad/good?

Ping?


> 
> 
> > ---
> >  arch/powerpc/platforms/powernv/pci-ioda.c | 3 +++
> >  arch/powerpc/platforms/powernv/pci.c  | 3 +++
> >  2 files changed, 6 insertions(+)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> > b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index e36738291c32..04ef03a5201b 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -2128,6 +2128,9 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
> > *phb,
> >  
> >  found:
> > tbl = pnv_pci_table_alloc(phb->hose->node);
> > +   if (WARN_ON(!tbl))
> > +   return;
> > +
> > iommu_register_group(&pe->table_group, phb->hose->global_number,
> > pe->pe_number);
> > pnv_pci_link_table_and_group(phb->hose->node, 0, tbl, &pe->table_group);
> > diff --git a/arch/powerpc/platforms/powernv/pci.c 
> > b/arch/powerpc/platforms/powernv/pci.c
> > index eb835e977e33..9acdf6889c0d 100644
> > --- a/arch/powerpc/platforms/powernv/pci.c
> > +++ b/arch/powerpc/platforms/powernv/pci.c
> > @@ -766,6 +766,9 @@ struct iommu_table *pnv_pci_table_alloc(int nid)
> > struct iommu_table *tbl;
> >  
> > tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, nid);
> > +   if (!tbl)
> > +   return NULL;
> > +
> > INIT_LIST_HEAD_RCU(&tbl->it_group_list);
> >  
> > return tbl;
> >   
> 
> 



--
Alexey

Re: [PATCH kernel] powerpc/powernv: Fix iommu table size calculation hook for small tables

2017-04-27 Thread Alexey Kardashevskiy

On Thu, 13 Apr 2017 17:05:27 +1000
Alexey Kardashevskiy  wrote:

> When the userspace requests a small TCE table (which takes less than
> the system page size) and more than 1 TCE level, the existing code
> returns a single page size which is a bug as each additional TCE level
> requires at least one page and this is what
> pnv_pci_ioda2_table_alloc_pages() does. And we end up seeing
> WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size))
> in drivers/vfio/vfio_iommu_spapr_tce.c.
> 
> This replaces incorrect _ALIGN_UP() (which aligns zero up to zero) with
> max_t() to fix the bug.
> 
> Besides removing WARN_ON(), there should be no other changes in
> behaviour.

Ping?


> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 6d0da5dfc955..a0d046adcf45 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2538,7 +2538,8 @@ static unsigned long pnv_pci_ioda2_get_table_size(__u32 
> page_shift,
>  
>   tce_table_size /= direct_table_size;
>   tce_table_size <<= 3;
> - tce_table_size = _ALIGN_UP(tce_table_size, direct_table_size);
> + tce_table_size = max_t(unsigned long,
> + tce_table_size, direct_table_size);
>   }
>  
>   return bytes;



--
Alexey

[PATCH v2] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2017-04-27 Thread Christophe Leroy

Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
userspace instruction miss") has shown that limiting the read of
faulting instruction to likely cases improves performance.

This patch goes further into this direction by limiting the read
of the faulting instruction to the only cases where it is definitly
needed.

On an MPC885, with the same benchmark app as in the commit referred
above, we see a reduction of 4000 dTLB misses (approx 3%):

Before the patch:
 Performance counter stats for './fault 500' (10 runs):

 720495838  cpu-cycles  
  ( +-  0.04% )
141769  dTLB-load-misses
  ( +-  0.02% )
 52722  iTLB-load-misses
  ( +-  0.01% )
 19611  faults  
  ( +-  0.02% )

   5.750535176 seconds time elapsed 
 ( +-  0.16% )

With the patch:
 Performance counter stats for './fault 500' (10 runs):

 717669123  cpu-cycles  
  ( +-  0.02% )
137344  dTLB-load-misses
  ( +-  0.03% )
 52731  iTLB-load-misses
  ( +-  0.01% )
 19614  faults  
  ( +-  0.03% )

   5.728423115 seconds time elapsed 
 ( +-  0.14% )

Signed-off-by: Christophe Leroy 
---
 v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'

 In case the instruction we read has value 0, store_update_sp() will
 return false, so it will bail out.

 This patch applies after the serie "powerpc/mm: some cleanup of 
do_page_fault()"

 arch/powerpc/mm/fault.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 400f2d0d42f8..2ec82a279d28 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -280,14 +280,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
-   /*
-* We want to do this outside mmap_sem, because reading code around nip
-* can result in fault, which will cause a deadlock when called with
-* mmap_sem held
-*/
-   if (is_write && is_user)
-   __get_user(inst, (unsigned int __user *)regs->nip);
-
if (is_user)
flags |= FAULT_FLAG_USER;
 
@@ -356,8 +348,18 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
 * between the last mapped region and the stack will
 * expand the stack rather than segfaulting.
 */
-   if (address + 2048 < uregs->gpr[1] && !store_updates_sp(inst))
-   goto bad_area;
+   if (address + 2048 < uregs->gpr[1] && !inst) {
+   /*
+* We want to do this outside mmap_sem, because reading
+* code around nip can result in fault, which will cause
+* a deadlock when called with mmap_sem held
+*/
+   up_read(&mm->mmap_sem);
+   __get_user(inst, (unsigned int __user *)regs->nip);
+   if (!store_updates_sp(inst))
+   goto bad_area_nosemaphore;
+   goto retry;
+   }
}
if (expand_stack(vma, address))
goto bad_area;
-- 
2.12.0

[PATCH V2] hwmon: (ibmpowernv) Add min/max attributes and current sensors

2017-04-27 Thread Shilpasri G Bhat

Add support for adding min/max values for the inband sensors copied by
OCC to main memory. And also add current(mA) sensors to the list.

Signed-off-by: Shilpasri G Bhat 
---
Changes from V1:
- Add functions to get min and max attribute strings
- Add function 'populate_sensor' to fill in the 'struct sensor_data'
  for each sensor.

 drivers/hwmon/ibmpowernv.c | 96 +-
 1 file changed, 77 insertions(+), 19 deletions(-)

diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
index 6d2e660..d59262c 100644
--- a/drivers/hwmon/ibmpowernv.c
+++ b/drivers/hwmon/ibmpowernv.c
@@ -50,6 +50,7 @@ enum sensors {
TEMP,
POWER_SUPPLY,
POWER_INPUT,
+   CURRENT,
MAX_SENSOR_TYPE,
 };
 
@@ -65,7 +66,8 @@ enum sensors {
{"fan", "ibm,opal-sensor-cooling-fan"},
{"temp", "ibm,opal-sensor-amb-temp"},
{"in", "ibm,opal-sensor-power-supply"},
-   {"power", "ibm,opal-sensor-power"}
+   {"power", "ibm,opal-sensor-power"},
+   {"curr"}, /* Follows newer device tree compatible ibm,opal-sensor */
 };
 
 struct sensor_data {
@@ -287,6 +289,7 @@ static int populate_attr_groups(struct platform_device 
*pdev)
opal = of_find_node_by_path("/ibm,opal/sensors");
for_each_child_of_node(opal, np) {
const char *label;
+   int len;
 
if (np->name == NULL)
continue;
@@ -298,10 +301,14 @@ static int populate_attr_groups(struct platform_device 
*pdev)
sensor_groups[type].attr_count++;
 
/*
-* add a new attribute for labels
+* add attributes for labels, min and max
 */
if (!of_property_read_string(np, "label", &label))
sensor_groups[type].attr_count++;
+   if (of_find_property(np, "sensor-data-min", &len))
+   sensor_groups[type].attr_count++;
+   if (of_find_property(np, "sensor-data-max", &len))
+   sensor_groups[type].attr_count++;
}
 
of_node_put(opal);
@@ -337,6 +344,49 @@ static void create_hwmon_attr(struct sensor_data *sdata, 
const char *attr_name,
sdata->dev_attr.show = show;
 }
 
+static void populate_sensor(struct sensor_data *sdata, int od, int hd, int sid,
+   const char *attr_name, enum sensors type,
+   const struct attribute_group *pgroup,
+   ssize_t (*show)(struct device *dev,
+   struct device_attribute *attr,
+   char *buf))
+{
+   sdata->id = sid;
+   sdata->type = type;
+   sdata->opal_index = od;
+   sdata->hwmon_index = hd;
+   create_hwmon_attr(sdata, attr_name, show);
+   pgroup->attrs[sensor_groups[type].attr_count++] = &sdata->dev_attr.attr;
+}
+
+static char *get_max_attr(enum sensors type)
+{
+   switch (type) {
+   case POWER_INPUT:
+   return "input_highest";
+   case TEMP:
+   return "max";
+   default:
+   break;
+   }
+
+   return "highest";
+}
+
+static char *get_min_attr(enum sensors type)
+{
+   switch (type) {
+   case POWER_INPUT:
+   return "input_lowest";
+   case TEMP:
+   return "min";
+   default:
+   break;
+   }
+
+   return "lowest";
+}
+
 /*
  * Iterate through the device tree for each child of 'sensors' node, create
  * a sysfs attribute file, the file is named by translating the DT node name
@@ -365,6 +415,7 @@ static int create_device_attrs(struct platform_device *pdev)
for_each_child_of_node(opal, np) {
const char *attr_name;
u32 opal_index;
+   u32 hwmon_index;
const char *label;
 
if (np->name == NULL)
@@ -386,9 +437,6 @@ static int create_device_attrs(struct platform_device *pdev)
continue;
}
 
-   sdata[count].id = sensor_id;
-   sdata[count].type = type;
-
/*
 * If we can not parse the node name, it means we are
 * running on a newer device tree. We can just forget
@@ -401,14 +449,12 @@ static int create_device_attrs(struct platform_device 
*pdev)
opal_index = INVALID_INDEX;
}
 
-   sdata[count].opal_index = opal_index;
-   sdata[count].hwmon_index =
-   get_sensor_hwmon_index(&sdata[count], sdata, count);
-
-   create_hwmon_attr(&sdata[count], attr_name, show_sensor);
-
-   pgroups[type]->attrs[sensor_groups[type].attr_count++] =
-   &sdata[count++].dev_attr.attr;
+   hwmon_index = get_sensor_hwmon_index(&sdata[count], sdata,
+

[v5 2/2] raid6/altivec: Add vpermxor implementation for raid6 Q syndrome

2017-04-27 Thread Matt Brown

The raid6 Q syndrome check has been optimised using the vpermxor
instruction. This instruction was made available with POWER8, ISA version
2.07. It allows for both vperm and vxor instructions to be done in a single
instruction. This has been tested for correctness on a ppc64le vm with a
basic RAID6 setup containing 5 drives.

The performance benchmarks are from the raid6test in the /lib/raid6/test
directory. These results are from an IBM Firestone machine with ppc64le
architecture. The benchmark results show a 35% speed increase over the best
existing algorithm for powerpc (altivec). The raid6test has also been run
on a big-endian ppc64 vm to ensure it also works for big-endian
architectures.

Performance benchmarks:
raid6: altivecx4 gen() 18773 MB/s
raid6: altivecx8 gen() 19438 MB/s

raid6: vpermxor4 gen() 25112 MB/s
raid6: vpermxor8 gen() 26279 MB/s

Signed-off-by: Matt Brown 
---
Changelog
v5
- moved altivec.uc fix into other patch in series
---
 include/linux/raid/pq.h |   4 ++
 lib/raid6/Makefile  |  27 -
 lib/raid6/algos.c   |   4 ++
 lib/raid6/test/Makefile |  14 ++-
 lib/raid6/vpermxor.uc   | 104 
 5 files changed, 151 insertions(+), 2 deletions(-)
 create mode 100644 lib/raid6/vpermxor.uc

diff --git a/include/linux/raid/pq.h b/include/linux/raid/pq.h
index 4d57bba..3df9aa6 100644
--- a/include/linux/raid/pq.h
+++ b/include/linux/raid/pq.h
@@ -107,6 +107,10 @@ extern const struct raid6_calls raid6_avx512x2;
 extern const struct raid6_calls raid6_avx512x4;
 extern const struct raid6_calls raid6_tilegx8;
 extern const struct raid6_calls raid6_s390vx8;
+extern const struct raid6_calls raid6_vpermxor1;
+extern const struct raid6_calls raid6_vpermxor2;
+extern const struct raid6_calls raid6_vpermxor4;
+extern const struct raid6_calls raid6_vpermxor8;
 
 struct raid6_recov_calls {
void (*data2)(int, size_t, int, int, void **);
diff --git a/lib/raid6/Makefile b/lib/raid6/Makefile
index 3057011..db095a7 100644
--- a/lib/raid6/Makefile
+++ b/lib/raid6/Makefile
@@ -4,7 +4,8 @@ raid6_pq-y  += algos.o recov.o tables.o int1.o int2.o 
int4.o \
   int8.o int16.o int32.o
 
 raid6_pq-$(CONFIG_X86) += recov_ssse3.o recov_avx2.o mmx.o sse1.o sse2.o 
avx2.o avx512.o recov_avx512.o
-raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o
+raid6_pq-$(CONFIG_ALTIVEC) += altivec1.o altivec2.o altivec4.o altivec8.o \
+  vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 raid6_pq-$(CONFIG_KERNEL_MODE_NEON) += neon.o neon1.o neon2.o neon4.o neon8.o
 raid6_pq-$(CONFIG_TILEGX) += tilegx8.o
 raid6_pq-$(CONFIG_S390) += s390vx8.o recov_s390xc.o
@@ -88,6 +89,30 @@ $(obj)/altivec8.c:   UNROLL := 8
 $(obj)/altivec8.c:   $(src)/altivec.uc $(src)/unroll.awk FORCE
$(call if_changed,unroll)
 
+CFLAGS_vpermxor1.o += $(altivec_flags)
+targets += vpermxor1.c
+$(obj)/vpermxor1.c: UNROLL := 1
+$(obj)/vpermxor1.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor2.o += $(altivec_flags)
+targets += vpermxor2.c
+$(obj)/vpermxor2.c: UNROLL := 2
+$(obj)/vpermxor2.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor4.o += $(altivec_flags)
+targets += vpermxor4.c
+$(obj)/vpermxor4.c: UNROLL := 4
+$(obj)/vpermxor4.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
+CFLAGS_vpermxor8.o += $(altivec_flags)
+targets += vpermxor8.c
+$(obj)/vpermxor8.c: UNROLL := 8
+$(obj)/vpermxor8.c: $(src)/vpermxor.uc $(src)/unroll.awk FORCE
+   $(call if_changed,unroll)
+
 CFLAGS_neon1.o += $(NEON_FLAGS)
 targets += neon1.c
 $(obj)/neon1.c:   UNROLL := 1
diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 7857049..edd4f69 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -74,6 +74,10 @@ const struct raid6_calls * const raid6_algos[] = {
&raid6_altivec2,
&raid6_altivec4,
&raid6_altivec8,
+   &raid6_vpermxor1,
+   &raid6_vpermxor2,
+   &raid6_vpermxor4,
+   &raid6_vpermxor8,
 #endif
 #if defined(CONFIG_TILEGX)
&raid6_tilegx8,
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 2c7b60e..9c333e9 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -97,6 +97,18 @@ altivec4.c: altivec.uc ../unroll.awk
 altivec8.c: altivec.uc ../unroll.awk
$(AWK) ../unroll.awk -vN=8 < altivec.uc > $@
 
+vpermxor1.c: vpermxor.uc ../unroll.awk
+   $(AWK) ../unroll.awk -vN=1 < vpermxor.uc > $@
+
+vpermxor2.c: vpermxor.uc ../unroll.awk
+   $(AWK) ../unroll.awk -vN=2 < vpermxor.uc > $@
+
+vpermxor4.c: vpermxor.uc ../unroll.awk
+   $(AWK) ../unroll.awk -vN=4 < vpermxor.uc > $@
+
+vpermxor8.c: vpermxor.uc ../unroll.awk
+   $(AWK) ../unroll.awk -vN=8 < vpermxor.uc > $@
+
 int1.c: int.uc ../unroll.awk
$(AWK) ../unroll.awk -vN=1 <

[v5 1/2] lib/raid6: Build proper files on corresponding arch

2017-04-27 Thread Matt Brown

Previously the raid6 test Makefile did not correctly build the files for
testing on PowerPC. This patch fixes the bug, so that all appropriate files
for PowerPC are built.
This patch also fixes the missing and mismatched ifdef statements to allow the
altivec.uc file to be built correctly.

Signed-off-by: Matt Brown 
---
Changelog
v5
- moved altivec.uc fix into this patch
- updates commit message
---
 lib/raid6/altivec.uc| 3 +++
 lib/raid6/test/Makefile | 8 +---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/raid6/altivec.uc b/lib/raid6/altivec.uc
index 682aae8..d20ed0d 100644
--- a/lib/raid6/altivec.uc
+++ b/lib/raid6/altivec.uc
@@ -24,10 +24,13 @@
 
 #include 
 
+#ifdef CONFIG_ALTIVEC
+
 #include 
 #ifdef __KERNEL__
 # include 
 # include 
+#endif /* __KERNEL__ */
 
 /*
  * This is the C data type to use.  We use a vector of
diff --git a/lib/raid6/test/Makefile b/lib/raid6/test/Makefile
index 9c333e9..b64a267 100644
--- a/lib/raid6/test/Makefile
+++ b/lib/raid6/test/Makefile
@@ -44,10 +44,12 @@ else ifeq ($(HAS_NEON),yes)
 CFLAGS += -DCONFIG_KERNEL_MODE_NEON=1
 else
 HAS_ALTIVEC := $(shell printf '\#include \nvector int a;\n' 
|\
- gcc -c -x c - >&/dev/null && \
- rm ./-.o && echo yes)
+ gcc -c -x c - >/dev/null && rm ./-.o && echo yes)
 ifeq ($(HAS_ALTIVEC),yes)
-OBJS += altivec1.o altivec2.o altivec4.o altivec8.o
+CFLAGS += -I../../../arch/powerpc/include
+CFLAGS += -DCONFIG_ALTIVEC
+OBJS += altivec1.o altivec2.o altivec4.o altivec8.o \
+vpermxor1.o vpermxor2.o vpermxor4.o vpermxor8.o
 endif
 endif
 ifeq ($(ARCH),tilegx)
-- 
2.9.3

[PATCH 2/2] v1 powerpc/powernv: Enable removal of memory for in memory tracing

2017-04-27 Thread Rashmica Gupta

Some powerpc hardware features may want to gain access to a chunk of
undisturbed real memory.  This update provides a means to unplug said memory
from the kernel with a set of debugfs calls.  By writing an integer containing
 the size of memory to be unplugged into
/sys/kernel/debug/powerpc/memtrace/enable, the code will remove that much
memory from the end of each available chip's memory space (ie each memory node).
In addition, the means to read out the contents of the unplugged memory is also
provided by reading out the /sys/kernel/debug/powerpc/memtrace//trace
file.

Signed-off-by: Anton Blanchard 
Signed-off-by: Rashmica Gupta 

---
This requires the 'Wire up hpte_removebolted for powernv' patch.

RFC -> v1: Added in two missing locks. Replaced the open-coded 
flush_memory_region() with the existing
flush_inval_dcache_range(start, end).

memtrace_offline_pages() is open-coded because offline_pages is designed to be
called through the sysfs interface - not directly.

We could move the offlining of pages to userspace, which removes some of this
open-coding. This would then require passing info to the kernel such that it
can then remove the memory that has been offlined. This could be done using
notifiers, but this isn't simple due to locking (remove_memory needs
mem_hotplug_begin() which the sysfs interface already has). This could also be
done through the debugfs interface (similar to what is done here). Either way,
this would require the process that needs the memory to have open-coded code
which it shouldn't really be involved with.

As the current remove_memory() function requires the memory to already be
offlined, it makes sense to keep the offlining and removal of memory
functionality grouped together so that a process can simply make one request to
unplug some memory. Ideally there would be a kernel function we could call that
would offline the memory and then remove it.


 arch/powerpc/platforms/powernv/memtrace.c | 276 ++
 1 file changed, 276 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/memtrace.c

diff --git a/arch/powerpc/platforms/powernv/memtrace.c 
b/arch/powerpc/platforms/powernv/memtrace.c
new file mode 100644
index 000..86184b1
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -0,0 +1,276 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) IBM Corporation, 2014
+ *
+ * Author: Anton Blanchard 
+ */
+
+#define pr_fmt(fmt) "powernv-memtrace: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct memtrace_entry {
+   void *mem;
+   u64 start;
+   u64 size;
+   u32 nid;
+   struct dentry *dir;
+   char name[16];
+};
+
+static struct memtrace_entry *memtrace_array;
+static unsigned int memtrace_array_nr;
+
+static ssize_t memtrace_read(struct file *filp, char __user *ubuf,
+size_t count, loff_t *ppos)
+{
+   struct memtrace_entry *ent = filp->private_data;
+
+   return simple_read_from_buffer(ubuf, count, ppos, ent->mem, ent->size);
+}
+
+static bool valid_memtrace_range(struct memtrace_entry *dev,
+unsigned long start, unsigned long size)
+{
+   if ((dev->start <= start) &&
+   ((start + size) <= (dev->start + dev->size)))
+   return true;
+
+   return false;
+}
+
+static int memtrace_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   unsigned long size = vma->vm_end - vma->vm_start;
+   struct memtrace_entry *dev = filp->private_data;
+
+   if (!valid_memtrace_range(dev, vma->vm_pgoff << PAGE_SHIFT, size))
+   return -EINVAL;
+
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+   if (io_remap_pfn_range(vma, vma->vm_start,
+  vma->vm_pgoff + (dev->start >> PAGE_SHIFT),
+  size, vma->vm_page_prot))
+   return -EAGAIN;
+
+   return 0;
+}
+
+static const struct file_operations memtrace_fops = {
+   .llseek = default_llseek,
+   .read   = memtrace_read,
+   .mmap   = memtrace_mmap,
+   .open   = simple_open,
+};
+
+static int check_memblock_online(struct memory_block *mem, void *arg)
+{
+   if (mem->state != MEM_ONLINE)
+   return -1;
+
+   return 0;
+}
+
+static int change_memblock_state(struct memory_block *mem, void *arg)
+{
+   unsigned long state = (unsigned lon

[PATCH 1/2] powerpc/powernv: Add config option for removal of memory

2017-04-27 Thread Rashmica Gupta

Signed-off-by: Rashmica Gupta 
---
 arch/powerpc/platforms/powernv/Kconfig  | 4 
 arch/powerpc/platforms/powernv/Makefile | 1 +
 2 files changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index 6a6f4ef..1b8b3a8 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -30,3 +30,7 @@ config OPAL_PRD
help
  This enables the opal-prd driver, a facility to run processor
  recovery diagnostics on OpenPower machines
+
+config HARDWARE_TRACING
+   bool 'Enable removal of memory for hardware memory tracing'
+   depends on PPC_POWERNV && MEMORY_HOTPLUG
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb..e61be1b 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -12,3 +12,4 @@ obj-$(CONFIG_PPC_SCOM)+= opal-xscom.o
 obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
 obj-$(CONFIG_TRACEPOINTS)  += opal-tracepoints.o
 obj-$(CONFIG_OPAL_PRD) += opal-prd.o
+obj-$(CONFIG_HARDWARE_TRACING) += memtrace.o
-- 
2.9.3

Re: [PATCH] powerpc/xive: Fix/improve verbose debug output

2017-04-27 Thread Benjamin Herrenschmidt

On Fri, 2017-04-28 at 13:07 +1000, Michael Ellerman wrote:
> Benjamin Herrenschmidt  writes:
> 
> > The existing verbose debug code doesn't build when enabled.
> 
> So why don't we convert all the DBG_VERBOSE() to pr_devel()? 

pr_devel provides a bunch of debug at init/setup/mask/unmask etc... but
the system is still usable

DBG_VERBOSE starts spewing stuff on every interrupt and eoi, the system
is no longer usable.

> If there's non-verbose debug that we think would be useful to
> differentiate from verbose then those could be pr_debug() - which means
> they'll be jump labelled off in most production kernels, but still able
> to be enabled.

Maybe... I don't like the giant "debug" switch accross the whole
kernel, though.

Ben.

Re: [PATCH v3] cxl: mask slice error interrupts after first occurrence

2017-04-27 Thread Andrew Donnellan


On 28/04/17 13:20, Alastair D'Silva wrote:

From: Alastair D'Silva 

In some situations, a faulty AFU slice may create an interrupt storm,
rendering the machine unusable. Since these interrupts are informational
only, present the interrupt once, then mask it off to prevent it from
being retriggered until the card is reset.

Signed-off-by: Alastair D'Silva 


LGTM

Reviewed-by: Andrew Donnellan 

--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

linux-next: build failure after merge of the kvm-ppc tree

2017-04-27 Thread Stephen Rothwell

Hi Paul,

After merging the kvm-ppc tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/kvm/book3s_xive.c: In function 'xive_debugfs_init':
arch/powerpc/kvm/book3s_xive.c:1852:52: error: 'powerpc_debugfs_root' 
undeclared (first use in this function)
  xive->dentry = debugfs_create_file(name, S_IRUGO, powerpc_debugfs_root,
^

Caused by commit

  5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt 
controller")

interacting with commit

  7644d5819cf8 ("powerpc: Create asm/debugfs.h and move powerpc_debugfs_root 
there")

from the powerpc tree.

I have added the following merge fix patch.

From: Stephen Rothwell 
Date: Fri, 28 Apr 2017 14:28:17 +1000
Subject: [PATCH] powerpc: merge fix for powerpc_debugfs_root move.

Signed-off-by: Stephen Rothwell 
---
 arch/powerpc/kvm/book3s_xive.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 7807ee17af4b..ffe1da95033a 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-- 
2.11.0

-- 
Cheers,
Stephen Rothwell

Re: [PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing

2017-04-27 Thread Liu ping fan

On Fri, Apr 28, 2017 at 2:06 AM, Hari Bathini
 wrote:
> Hi Pingfan,
>
>
> On Thursday 27 April 2017 01:13 PM, Pingfan Liu wrote:
>>
>> E.g after fadump reserves mem regions, these regions should not be removed
>> before fadump explicitly free them.
>> Signed-off-by: Pingfan Liu 
>> ---
>>   arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index e104c71..201be23 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base,
>> unsigned int memblock_siz
>>
>> if (!pfn_valid(start_pfn))
>> goto out;
>> +   if (memblock_is_reserved(base))
>> +   return -EINVAL;
>
>
> I think memblock reserved regions are not hot removed even without this
> patch.
> So, can you elaborate on when/why this patch is needed?
>
I have not found code to prevent the reserved regions from free. Do I
miss anything?
I will try to reserve a ppc to have a test.

Thx,
Pingfan

[PATCH v3] cxl: mask slice error interrupts after first occurrence

2017-04-27 Thread Alastair D'Silva

From: Alastair D'Silva 

In some situations, a faulty AFU slice may create an interrupt storm,
rendering the machine unusable. Since these interrupts are informational
only, present the interrupt once, then mask it off to prevent it from
being retriggered until the card is reset.

Signed-off-by: Alastair D'Silva 
---
Changelog:
v3
Add CXL_PSL_SERR_An_IRQS, CXL_PSL_SERR_An_IRQ_MASKS macros
Explicitly reenable masked interrupts after reset
Issue an info line that subsequent interrupts will be masked
v2
Rebase against linux-next

---
 drivers/misc/cxl/cxl.h| 18 ++
 drivers/misc/cxl/native.c | 19 +--
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 452e209..6b00952 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -228,6 +228,24 @@ static const cxl_p2n_reg_t CXL_PSL_WED_An = {0x0A0};
 #define CXL_PSL_SERR_An_llcmdto(1ull << (63-6))
 #define CXL_PSL_SERR_An_afupar (1ull << (63-7))
 #define CXL_PSL_SERR_An_afudup (1ull << (63-8))
+#define CXL_PSL_SERR_An_IRQS   ( \
+   CXL_PSL_SERR_An_afuto | CXL_PSL_SERR_An_afudup | CXL_PSL_SERR_An_afuov 
| \
+   CXL_PSL_SERR_An_badsrc | CXL_PSL_SERR_An_badctx | 
CXL_PSL_SERR_An_llcmdis | \
+   CXL_PSL_SERR_An_llcmdto | CXL_PSL_SERR_An_afupar | 
CXL_PSL_SERR_An_afudup)
+#define CXL_PSL_SERR_An_afuto_mask (1ull << (63-32))
+#define CXL_PSL_SERR_An_afudis_mask(1ull << (63-33))
+#define CXL_PSL_SERR_An_afuov_mask (1ull << (63-34))
+#define CXL_PSL_SERR_An_badsrc_mask(1ull << (63-35))
+#define CXL_PSL_SERR_An_badctx_mask(1ull << (63-36))
+#define CXL_PSL_SERR_An_llcmdis_mask   (1ull << (63-37))
+#define CXL_PSL_SERR_An_llcmdto_mask   (1ull << (63-38))
+#define CXL_PSL_SERR_An_afupar_mask(1ull << (63-39))
+#define CXL_PSL_SERR_An_afudup_mask(1ull << (63-40))
+#define CXL_PSL_SERR_An_IRQ_MASKS  ( \
+   CXL_PSL_SERR_An_afuto_mask | CXL_PSL_SERR_An_afudup_mask | 
CXL_PSL_SERR_An_afuov_mask | \
+   CXL_PSL_SERR_An_badsrc_mask | CXL_PSL_SERR_An_badctx_mask | 
CXL_PSL_SERR_An_llcmdis_mask | \
+   CXL_PSL_SERR_An_llcmdto_mask | CXL_PSL_SERR_An_afupar_mask | 
CXL_PSL_SERR_An_afudup_mask)
+
 #define CXL_PSL_SERR_An_AE (1ull << (63-30))
 
 /** CXL_PSL_SCNTL_An /
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 194c58e..3e7fc86 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -95,12 +95,23 @@ int cxl_afu_disable(struct cxl_afu *afu)
 /* This will disable as well as reset */
 static int native_afu_reset(struct cxl_afu *afu)
 {
+   int rc;
+   u64 serr;
+
pr_devel("AFU reset request\n");
 
-   return afu_control(afu, CXL_AFU_Cntl_An_RA, 0,
+   rc = afu_control(afu, CXL_AFU_Cntl_An_RA, 0,
   CXL_AFU_Cntl_An_RS_Complete | 
CXL_AFU_Cntl_An_ES_Disabled,
   CXL_AFU_Cntl_An_RS_MASK | CXL_AFU_Cntl_An_ES_MASK,
   false);
+
+   /* Re-enable any masked interrupts */
+   serr = cxl_p1n_read(afu, CXL_PSL_SERR_An);
+   serr &= ~CXL_PSL_SERR_An_IRQ_MASKS;
+   cxl_p1n_write(afu, CXL_PSL_SERR_An, serr);
+
+
+   return rc;
 }
 
 static int native_afu_check_and_enable(struct cxl_afu *afu)
@@ -1205,7 +1216,7 @@ static irqreturn_t native_slice_irq_err(int irq, void 
*data)
 {
struct cxl_afu *afu = data;
u64 errstat, serr, afu_error, dsisr;
-   u64 fir_slice, afu_debug;
+   u64 fir_slice, afu_debug, irq_mask;
 
/*
 * slice err interrupt is only used with full PSL (no XSL)
@@ -1226,7 +1237,11 @@ static irqreturn_t native_slice_irq_err(int irq, void 
*data)
dev_crit(&afu->dev, "AFU_ERR_An: 0x%.16llx\n", afu_error);
dev_crit(&afu->dev, "PSL_DSISR_An: 0x%.16llx\n", dsisr);
 
+   /* mask off the IRQ so it won't retrigger until the card is reset */
+   irq_mask = (serr & CXL_PSL_SERR_An_IRQS) >> 32;
+   serr |= irq_mask;
cxl_p1n_write(afu, CXL_PSL_SERR_An, serr);
+   dev_info(&afu->dev, "Further interrupts will be masked until the AFU is 
reset\n");
 
return IRQ_HANDLED;
 }
-- 
2.9.3

Re: [PATCH] powerpc/xive: Fix/improve verbose debug output

2017-04-27 Thread Michael Ellerman

Benjamin Herrenschmidt  writes:

> The existing verbose debug code doesn't build when enabled.

So why don't we convert all the DBG_VERBOSE() to pr_devel()? 

If there's non-verbose debug that we think would be useful to
differentiate from verbose then those could be pr_debug() - which means
they'll be jump labelled off in most production kernels, but still able
to be enabled.

cheers

Re: [PATCH] Enabled pstore write for powerpc

2017-04-27 Thread Michael Ellerman

Kees Cook  writes:

> On Thu, Apr 27, 2017 at 4:33 AM, Ankit Kumar  wrote:
>> After commit c950fd6f201a kernel registers pstore write based on flag set.
>> Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) is not set 
>> for
>> powerpc architecture. On panic, kernel doesn't write message to
>> /fs/pstore/dmesg*(Entry doesn't gets created at all).
>>
>> This patch enables pstore write for powerpc architecture by setting
>> PSTORE_FLAGS_DMESG flag.
>>
>> Fixes:c950fd6f201a pstore: Split pstore fragile flags
>> Signed-off-by: Ankit Kumar 
>
> Argh, thanks! I thought I'd caught all of these. I'll include this for 
> -stable.

I see you've picked it up, thanks.

cheers

Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI

2017-04-27 Thread Sergey Senozhatsky

On (04/27/17 12:14), Steven Rostedt wrote:
[..]
> I tried this patch. It's better because I get the end of the trace, but
> I do lose the beginning of it:
> 
> ** 196358 printk messages dropped ** [  102.321182] perf-59810 
> 12983650us : d_path <-seq_path

many thanks!

so we now drop messages from logbuf, not from per-CPU buffers. that
"queue printk_deferred irq_work on every online CPU when we bypass per-CPU
buffers from NMI" idea *probably* might help here - we need someone to emit
messages from the logbuf while we printk from NMI. there is still a
possibility that we can drop messages, though, since log_store() from NMI
CPU can be much-much faster than call_console_drivers() on other CPU.

-ss

Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI

2017-04-27 Thread Sergey Senozhatsky


On (04/20/17 15:11), Petr Mladek wrote:
[..]
>  void printk_nmi_enter(void)
>  {
> - this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> + /*
> +  * The size of the extra per-CPU buffer is limited. Use it
> +  * only when really needed.
> +  */
> + if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK ||
> + raw_spin_is_locked(&logbuf_lock)) {

can we please have && here?


[..]
> diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
> index 4e8a30d1c22f..0bc0a3535a8a 100644
> --- a/lib/nmi_backtrace.c
> +++ b/lib/nmi_backtrace.c
> @@ -86,9 +86,11 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
>  
>  bool nmi_cpu_backtrace(struct pt_regs *regs)
>  {
> + static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED;
>   int cpu = smp_processor_id();
>  
>   if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
> + arch_spin_lock(&lock);
>   if (regs && cpu_in_idle(instruction_pointer(regs))) {
>   pr_warn("NMI backtrace for cpu %d skipped: idling at pc 
> %#lx\n",
>   cpu, instruction_pointer(regs));
> @@ -99,6 +101,7 @@ bool nmi_cpu_backtrace(struct pt_regs *regs)
>   else
>   dump_stack();
>   }
> + arch_spin_unlock(&lock);
>   cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
>   return true;
>   }

can the nmi_backtrace part be a patch on its own?

-ss

Re: [PATCH 0/8] Fix clean target warnings

2017-04-27 Thread Shuah Khan

On 04/21/2017 05:14 PM, Shuah Khan wrote:
> This patch series consists of changes to lib.mk to allow overriding
> common clean target from Makefiles. This fixes warnings when clean
> overriding and ignoring warnings. Also fixes splice clean target
> removing a script that runs the test from its clean target.
> 
> Shuah Khan (8):
>   selftests: splice: fix clean target to not remove
> default_file_splice_read.sh
>   selftests: lib.mk: define CLEAN macro to allow Makefiles to override
> clean

Applied with amended change log and Michael's ack to linux-kselftest next

>   selftests: futex: override clean in lib.mk to fix warnings
>   selftests: gpio: override clean in lib.mk to fix warnings
>   selftests: powerpc: override clean in lib.mk to fix warnings

Applied all of the above to linux-kseltftest next

>   selftests: splice: override clean in lib.mk to fix warnings
>   selftests: sync: override clean in lib.mk to fix warnings
>   selftests: x86: override clean in lib.mk to fix warnings

Applied v2s addressing Michael's comments to linux-kselftest next
x86 fix also addresses not being able to build ldt_gdt

make -C tools/testing/selftests/x86 ldt_gdt


thanks,
-- Shuah

Re: [PATCH 3/8] selftests: futex: override clean in lib.mk to fix warnings

2017-04-27 Thread Shuah Khan

On 04/27/2017 03:54 PM, Darren Hart wrote:
> On Fri, Apr 21, 2017 at 05:14:45PM -0600, Shuah Khan wrote:
>> Add override for lib.mk clean to fix the following warnings from clean
>> target run.
>>
>> Makefile:36: warning: overriding recipe for target 'clean'
>> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
>>
>> Signed-off-by: Shuah Khan 
>> ---
>>  tools/testing/selftests/futex/Makefile | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/futex/Makefile 
>> b/tools/testing/selftests/futex/Makefile
>> index c8095e6..e2fbb89 100644
>> --- a/tools/testing/selftests/futex/Makefile
>> +++ b/tools/testing/selftests/futex/Makefile
>> @@ -32,9 +32,10 @@ override define EMIT_TESTS
>>  echo "./run.sh"
>>  endef
>>  
>> -clean:
>> +override define CLEAN
>>  for DIR in $(SUBDIRS); do   \
>>  BUILD_TARGET=$(OUTPUT)/$$DIR;   \
>>  mkdir $$BUILD_TARGET  -p;   \
>>  make OUTPUT=$$BUILD_TARGET -C $$DIR $@;\
>>  done
>> +endef
> 
> Taking the move of clean into lib.mk as a given,

Yeah I considered undoing that, and chose to fix the missed
issues instead.
> 
> Acked-by: Darren Hart (VMware) 
> 

thanks,
-- Shuah

Re: [PATCH 3/8] selftests: futex: override clean in lib.mk to fix warnings

2017-04-27 Thread Darren Hart

On Fri, Apr 21, 2017 at 05:14:45PM -0600, Shuah Khan wrote:
> Add override for lib.mk clean to fix the following warnings from clean
> target run.
> 
> Makefile:36: warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
> 
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/futex/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/futex/Makefile 
> b/tools/testing/selftests/futex/Makefile
> index c8095e6..e2fbb89 100644
> --- a/tools/testing/selftests/futex/Makefile
> +++ b/tools/testing/selftests/futex/Makefile
> @@ -32,9 +32,10 @@ override define EMIT_TESTS
>   echo "./run.sh"
>  endef
>  
> -clean:
> +override define CLEAN
>   for DIR in $(SUBDIRS); do   \
>   BUILD_TARGET=$(OUTPUT)/$$DIR;   \
>   mkdir $$BUILD_TARGET  -p;   \
>   make OUTPUT=$$BUILD_TARGET -C $$DIR $@;\
>   done
> +endef

Taking the move of clean into lib.mk as a given,

Acked-by: Darren Hart (VMware) 

-- 
Darren Hart
VMware Open Source Technology Center

Re: [PATCH] Enabled pstore write for powerpc

2017-04-27 Thread Kees Cook

On Thu, Apr 27, 2017 at 4:33 AM, Ankit Kumar  wrote:
> After commit c950fd6f201a kernel registers pstore write based on flag set.
> Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) is not set for
> powerpc architecture. On panic, kernel doesn't write message to
> /fs/pstore/dmesg*(Entry doesn't gets created at all).
>
> This patch enables pstore write for powerpc architecture by setting
> PSTORE_FLAGS_DMESG flag.
>
> Fixes:c950fd6f201a pstore: Split pstore fragile flags
> Signed-off-by: Ankit Kumar 

Argh, thanks! I thought I'd caught all of these. I'll include this for -stable.

-Kees

> ---
>
>  arch/powerpc/kernel/nvram_64.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c
> index d5e2b83..021db31 100644
> --- a/arch/powerpc/kernel/nvram_64.c
> +++ b/arch/powerpc/kernel/nvram_64.c
> @@ -561,6 +561,7 @@ static ssize_t nvram_pstore_read(u64 *id, enum 
> pstore_type_id *type,
>  static struct pstore_info nvram_pstore_info = {
> .owner = THIS_MODULE,
> .name = "nvram",
> +   .flags = PSTORE_FLAGS_DMESG,
> .open = nvram_pstore_open,
> .read = nvram_pstore_read,
> .write = nvram_pstore_write,
> --
> 2.7.4
>



-- 
Kees Cook
Pixel Security

Re: [PATCH] Enabled pstore write for powerpc

2017-04-27 Thread Anton Blanchard

Hi Ankit,

> After commit c950fd6f201a kernel registers pstore write based on flag
> set. Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG)
> is not set for powerpc architecture. On panic, kernel doesn't write
> message to /fs/pstore/dmesg*(Entry doesn't gets created at all).
> 
> This patch enables pstore write for powerpc architecture by setting
> PSTORE_FLAGS_DMESG flag.
> 
> Fixes:c950fd6f201a pstore: Split pstore fragile flags

Ouch! We've used pstore to shoot customer bugs, so we should also mark
this for stable. Looks like 4.9 onwards?

Anton

> Signed-off-by: Ankit Kumar 
> ---
> 
>  arch/powerpc/kernel/nvram_64.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/nvram_64.c
> b/arch/powerpc/kernel/nvram_64.c index d5e2b83..021db31 100644
> --- a/arch/powerpc/kernel/nvram_64.c
> +++ b/arch/powerpc/kernel/nvram_64.c
> @@ -561,6 +561,7 @@ static ssize_t nvram_pstore_read(u64 *id, enum
> pstore_type_id *type, static struct pstore_info nvram_pstore_info = {
>   .owner = THIS_MODULE,
>   .name = "nvram",
> + .flags = PSTORE_FLAGS_DMESG,
>   .open = nvram_pstore_open,
>   .read = nvram_pstore_read,
>   .write = nvram_pstore_write,

[PATCH 1/1] powerpc/traps : Updated MC for E6500 L1D cache err

2017-04-27 Thread Matt Weber

This patch updates the machine check handler of Linux kernel to
handle the e6500 architecture case. In e6500 core, L1 Data Cache Write
Shadow Mode (DCWS) register is not implemented but L1 data cache always
runs in write shadow mode. So, on L1 data cache parity errors, hardware
will automatically invalidate the data cache but will still log a
machine check interrupt.

Signed-off-by: Ronak Desai 
Signed-off-by: Matthew Weber 
---
 arch/powerpc/include/asm/reg_booke.h |  1 +
 arch/powerpc/kernel/traps.c  | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/reg_booke.h 
b/arch/powerpc/include/asm/reg_booke.h
index 737e012..c811128 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -196,6 +196,7 @@
 #define SPRN_DEAR  0x03D   /* Data Error Address Register */
 #define SPRN_ESR   0x03E   /* Exception Syndrome Register */
 #define SPRN_PIR   0x11E   /* Processor Identification Register */
+#define SPRN_PVR   0x11F   /* Processor Version Register */
 #define SPRN_DBSR  0x130   /* Debug Status Register */
 #define SPRN_DBCR0 0x134   /* Debug Control Register 0 */
 #define SPRN_DBCR1 0x135   /* Debug Control Register 1 */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 76f6045..d5bc3ab 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -504,6 +504,7 @@ int machine_check_47x(struct pt_regs *regs)
 int machine_check_e500mc(struct pt_regs *regs)
 {
unsigned long mcsr = mfspr(SPRN_MCSR);
+   unsigned long pvr = mfspr(SPRN_PVR);
unsigned long reason = mcsr;
int recoverable = 1;
 
@@ -545,8 +546,15 @@ int machine_check_e500mc(struct pt_regs *regs)
 * may still get logged and cause a machine check.  We should
 * only treat the non-write shadow case as non-recoverable.
 */
-   if (!(mfspr(SPRN_L1CSR2) & L1CSR2_DCWS))
-   recoverable = 0;
+   /* On e6500 core, L1 DCWS (Data cache write shadow mode) bit is
+* not implemented but L1 data cache is by default configured
+* to run in write shadow mode. Hence on data cache parity 
errors
+* HW will automatically invalidate the L1 Data Cache.
+*/
+   if (PVR_VER(pvr) != PVR_VER_E6500) {
+   if (!(mfspr(SPRN_L1CSR2) & L1CSR2_DCWS))
+   recoverable = 0;
+   }
}
 
if (reason & MCSR_L2MMU_MHIT) {
-- 
1.9.1

Re: [PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing

2017-04-27 Thread Hari Bathini


Hi Pingfan,


On Thursday 27 April 2017 01:13 PM, Pingfan Liu wrote:

E.g after fadump reserves mem regions, these regions should not be removed
before fadump explicitly free them.
Signed-off-by: Pingfan Liu 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index e104c71..201be23 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base, 
unsigned int memblock_siz

if (!pfn_valid(start_pfn))
goto out;
+   if (memblock_is_reserved(base))
+   return -EINVAL;


I think memblock reserved regions are not hot removed even without this 
patch.

So, can you elaborate on when/why this patch is needed?

Thanks
Hari

Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI

2017-04-27 Thread Steven Rostedt

On Thu, 20 Apr 2017 15:11:54 +0200
Petr Mladek  wrote:



> 
> >From c530d9dee91c74db5e6a198479e2e63b24cb84a2 Mon Sep 17 00:00:00 2001  
> From: Petr Mladek 
> Date: Thu, 20 Apr 2017 10:52:31 +0200
> Subject: [PATCH] printk: Use the main logbuf in NMI when logbuf_lock is
>  available

I tried this patch. It's better because I get the end of the trace, but
I do lose the beginning of it:

** 196358 printk messages dropped ** [  102.321182] perf-59810 
12983650us : d_path <-seq_path

The way I tested it was by adding this:

Index: linux-trace.git/kernel/trace/trace_functions.c
===
--- linux-trace.git.orig/kernel/trace/trace_functions.c
+++ linux-trace.git/kernel/trace/trace_functions.c
@@ -469,8 +469,11 @@ ftrace_cpudump_probe(unsigned long ip, u
 struct trace_array *tr, struct ftrace_probe_ops *ops,
 void *data)
 {
-   if (update_count(ops, ip, data))
-   ftrace_dump(DUMP_ORIG);
+   char *killer = NULL;
+
+   panic_on_oops = 1;  /* force panic */
+   wmb();
+   *killer = 1;
 }
 
 static int


Then doing the following:

# echo 1 > /proc/sys/kernel/ftrace_dump_on_oops 
# trace-cmd start -p function
# echo nmi_handle:cpudump > /debug/tracing/set_ftrace_filter
# perf record -c 100 -a sleep 1

And that triggers the crash.

-- Steve

[PATCH] crypto: talitos: Extend max key length for SHA384/512-HMAC

2017-04-27 Thread Martin Hicks


The max keysize for both of these is 128, not 96.  Before, with keysizes
over 96, the memcpy in ahash_setkey() would overwrite memory beyond the
key field.

Signed-off-by: Martin Hicks 
---
 drivers/crypto/talitos.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 0bba6a1..97dc85e 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -816,7 +816,7 @@ static void talitos_unregister_rng(struct device *dev)
  * HMAC_SNOOP_NO_AFEA (HSNA) instead of type IPSEC_ESP
  */
 #define TALITOS_CRA_PRIORITY_AEAD_HSNA (TALITOS_CRA_PRIORITY - 1)
-#define TALITOS_MAX_KEY_SIZE   96
+#define TALITOS_MAX_KEY_SIZE   SHA512_BLOCK_SIZE /* SHA512 has the 
largest keysize input */
 #define TALITOS_MAX_IV_LENGTH  16 /* max of AES_BLOCK_SIZE, 
DES3_EDE_BLOCK_SIZE */
 
 struct talitos_ctx {
-- 
1.7.10.4


-- 
Martin Hicks P.Eng.|  m...@bork.org
Bork Consulting Inc.   |  +1 (613) 266-2296

Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI

2017-04-27 Thread Steven Rostedt

On Thu, 27 Apr 2017 17:28:07 +0200
Petr Mladek  wrote:


> > When I get a chance, I'll see if I can insert a trigger to crash the
> > kernel from NMI on another box and see if this patch helps.  
> 
> I actually tested it here using this hack:
> 
> diff --cc lib/nmi_backtrace.c
> index d531f85c0c9b,0bc0a3535a8a..
> --- a/lib/nmi_backtrace.c
> +++ b/lib/nmi_backtrace.c
> @@@ -89,8 -90,7 +90,9 @@@ bool nmi_cpu_backtrace(struct pt_regs *
> int cpu = smp_processor_id();
>   
> if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
>  +  if (in_nmi())
>  +  panic("Simulating panic in NMI\n");
> +   arch_spin_lock(&lock);

I was going to create a ftrace trigger, to crash on demand, but this
may do as well.

> if (regs && cpu_in_idle(instruction_pointer(regs))) {
> pr_warn("NMI backtrace for cpu %d skipped: idling at 
> pc %#lx\n",
> cpu, instruction_pointer(regs));
> 
> and triggered by:
> 
>echo  l > /proc/sysrq-trigger
> 
> The patch really helped to see much more (all) messages from the ftrace
> buffers in NMI mode.
> 
> But the test is a bit artifical. The patch might not help when there
> is a big printk() activity on the system when the panic() is
> triggered. We might wrongly use the small per-CPU buffer when
> the logbuf_lock is tested and taken on another CPU at the same time.
> It means that it will not always help.
> 
> I personally think that the patch might be good enough. I am not sure
> if a perfect (more comlpex) solution is worth it.

I wasn't asking for perfect, as the previous solutions never were
either. I just want an optimistic dump if possible.

I'll try to get some time today to test this, and let you know. But it
wont be on the machine that I originally had the issue with.

Thanks,

-- Steve

Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI

2017-04-27 Thread Petr Mladek

On Thu 2017-04-27 10:31:18, Steven Rostedt wrote:
> On Thu, 27 Apr 2017 15:38:19 +0200
> Petr Mladek  wrote:
> 
> > > by the way,
> > > does this `nmi_print_seq' bypass even fix anything for Steven?  
> > 
> > I think that this is the most important question.
> > 
> > Steven, does the patch from
> > https://lkml.kernel.org/r/20170420131154.gl3...@pathway.suse.cz
> > help you to see the debug messages, please?
> 
> You'll have to wait for a bit. The box that I was debugging takes 45
> minutes to reboot. And I don't have much more time to play on it before
> I have to give it back. I already found the bug I was looking for and
> I'm trying not to crash it again (due to the huge bring up time).

I see.

> When I get a chance, I'll see if I can insert a trigger to crash the
> kernel from NMI on another box and see if this patch helps.

I actually tested it here using this hack:

diff --cc lib/nmi_backtrace.c
index d531f85c0c9b,0bc0a3535a8a..
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@@ -89,8 -90,7 +90,9 @@@ bool nmi_cpu_backtrace(struct pt_regs *
int cpu = smp_processor_id();
  
if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
 +  if (in_nmi())
 +  panic("Simulating panic in NMI\n");
+   arch_spin_lock(&lock);
if (regs && cpu_in_idle(instruction_pointer(regs))) {
pr_warn("NMI backtrace for cpu %d skipped: idling at pc 
%#lx\n",
cpu, instruction_pointer(regs));

and triggered by:

   echo  l > /proc/sysrq-trigger

The patch really helped to see much more (all) messages from the ftrace
buffers in NMI mode.

But the test is a bit artifical. The patch might not help when there
is a big printk() activity on the system when the panic() is
triggered. We might wrongly use the small per-CPU buffer when
the logbuf_lock is tested and taken on another CPU at the same time.
It means that it will not always help.

I personally think that the patch might be good enough. I am not sure
if a perfect (more comlpex) solution is worth it.

Best Regards,
Petr

Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI

2017-04-27 Thread Steven Rostedt

On Thu, 27 Apr 2017 15:38:19 +0200
Petr Mladek  wrote:

> > by the way,
> > does this `nmi_print_seq' bypass even fix anything for Steven?  
> 
> I think that this is the most important question.
> 
> Steven, does the patch from
> https://lkml.kernel.org/r/20170420131154.gl3...@pathway.suse.cz
> help you to see the debug messages, please?

You'll have to wait for a bit. The box that I was debugging takes 45
minutes to reboot. And I don't have much more time to play on it before
I have to give it back. I already found the bug I was looking for and
I'm trying not to crash it again (due to the huge bring up time).

When I get a chance, I'll see if I can insert a trigger to crash the
kernel from NMI on another box and see if this patch helps.

Thanks,

-- Steve

[PATCH] powerpc/xive: Fix/improve verbose debug output

2017-04-27 Thread Benjamin Herrenschmidt

The existing verbose debug code doesn't build when enabled.

This fixes it and generally improves the output to make it
more useful.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/sysdev/xive/common.c | 37 ++---
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 6a98efb..2305aa9 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -143,7 +143,6 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool 
just_peek)
struct xive_q *q;
 
prio = ffs(xc->pending_prio) - 1;
-   DBG_VERBOSE("scan_irq: trying prio %d\n", prio);
 
/* Try to fetch */
irq = xive_read_eq(&xc->queue[prio], just_peek);
@@ -171,12 +170,18 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool 
just_peek)
}
 
/* If nothing was found, set CPPR to 0xff */
-   if (irq == 0)
+   if (irq == 0) {
prio = 0xff;
+   DBG_VERBOSE("scan_irq(%d): nothing found\n", just_peek);
+   } else {
+   DBG_VERBOSE("scan_irq(%d): found irq %d prio %d\n",
+   just_peek, irq, prio);
+   }
 
/* Update HW CPPR to match if necessary */
if (prio != xc->cppr) {
-   DBG_VERBOSE("scan_irq: adjusting CPPR to %d\n", prio);
+   DBG_VERBOSE("scan_irq(%d): adjusting CPPR %d->%d\n",
+   just_peek, xc->cppr, prio);
xc->cppr = prio;
out_8(xive_tima + xive_tima_offset + TM_CPPR, prio);
}
@@ -260,7 +265,7 @@ static unsigned int xive_get_irq(void)
/* Scan our queue(s) for interrupts */
irq = xive_scan_interrupts(xc, false);
 
-   DBG_VERBOSE("get_irq: got irq 0x%x, new pending=0x%02x\n",
+   DBG_VERBOSE("get_irq: got irq %d new pending=0x%02x\n",
irq, xc->pending_prio);
 
/* Return pending interrupt if any */
@@ -282,7 +287,7 @@ static unsigned int xive_get_irq(void)
 static void xive_do_queue_eoi(struct xive_cpu *xc)
 {
if (xive_scan_interrupts(xc, true) != 0) {
-   DBG_VERBOSE("eoi: pending=0x%02x\n", xc->pending_prio);
+   DBG_VERBOSE("eoi_irq: more pending !\n");
force_external_irq_replay();
}
 }
@@ -327,11 +332,13 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
*xd)
in_be64(xd->eoi_mmio);
else {
eoi_val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
-   DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val);
+   DBG_VERBOSE("hwirq 0x%x eoi_val=%x\n", hw_irq, eoi_val);
 
/* Re-trigger if needed */
-   if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio)
+   if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio) {
+   DBG_VERBOSE(" -> eoi retrigger !\n");
out_be64(xd->trig_mmio, 0);
+   }
}
}
 }
@@ -380,10 +387,15 @@ static void xive_do_source_set_mask(struct xive_irq_data 
*xd,
if (mask) {
val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_01);
xd->saved_p = !!(val & XIVE_ESB_VAL_P);
-   } else if (xd->saved_p)
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
-   else
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
+   DBG_VERBOSE("masking val=%llx, sp=%d\n",
+   val, xd->saved_p);
+   } else {
+   DBG_VERBOSE("unmasking sp=%d\n", xd->saved_p);
+   if (xd->saved_p)
+   xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
+   else
+   xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
+   }
 }
 
 /*
@@ -526,6 +538,7 @@ static unsigned int xive_irq_startup(struct irq_data *d)
 
pr_devel("xive_irq_startup: irq %d [0x%x] data @%p\n",
 d->irq, hw_irq, d);
+   pr_devel("  eoi_mmio=%p trig_mmio=%p\n", xd->eoi_mmio, xd->trig_mmio);
 
 #ifdef CONFIG_PCI_MSI
/*
@@ -754,6 +767,8 @@ static int xive_irq_retrigger(struct irq_data *d)
if (WARN_ON(xd->flags & XIVE_IRQ_FLAG_LSI))
return 0;
 
+   DBG_VERBOSE("retrigger irq %d\n", d->irq);
+
/*
 * To perform a retrigger, we first set the PQ bits to
 * 11, then perform an EOI.

Re: [PATCH v5 1/4] printk/nmi: generic solution for safe printk in NMI

2017-04-27 Thread Petr Mladek

On Mon 2017-04-24 11:17:47, Sergey Senozhatsky wrote:
> On (04/21/17 14:06), Petr Mladek wrote:
> [..]
> > > I agree that this_cpu_read(printk_context) covers slightly more than
> > > logbuf_lock scope, so we may get positive this_cpu_read(printk_context)
> > > with unlocked logbuf_lock, but I don't tend to think that it's a big
> > > problem.
> > 
> > PRINTK_SAFE_CONTEXT is set also in call_console_drivers().
> > It might take rather long and logbuf_lock is availe. So, it is
> > noticeable source of false positives.
> 
> yes, agree.
> 
> probably we need additional printk_safe annotations for
>   "logbuf_lock is locked from _this_ CPU"
> 
> false positives there can be very painful.
> 
> [..]
> > if (raw_spin_is_locked(&logbuf_lock))
> > this_cpu_or(printk_context, PRINTK_NMI_CONTEXT_MASK);
> > else
> > this_cpu_or(printk_context, PRINTK_NMI_DEFERRED_CONTEXT_MASK);
> 
> well, if everyone is fine with logbuf_lock access from every CPU from every
> NMI then I won't object either. but may be it makes sense to reduce the
> possibility of false positives. Steven is loosing critically important logs,
> after all.
> 
> 
> by the way,
> does this `nmi_print_seq' bypass even fix anything for Steven?

I think that this is the most important question.

Steven, does the patch from
https://lkml.kernel.org/r/20170420131154.gl3...@pathway.suse.cz
help you to see the debug messages, please?

> it sort of
> can, in theory, but just in theory. so may be we need direct message flush
> from NMI handler (printk->console_unlock), which will be a really big problem.

I thought about it a lot and got scared where this might go.
We need to balance the usefulness and the complexity of the solution.

It took one year to discover this regression. Before it was
suggested to avoid calling printk() in NMI context at all.
Now, we are trying to fix printk() to handle MBs of messages
in NMI context.

If my proposed patch solves the problem for Steven, I would still
like to get similar solution in. It is not that complex and helps
to bypass the limited per-CPU buffer in most cases. I always thought
that 8kB might be not enough in some cases.

Note that my patch is very defensive. It uses the main log buffer
only when it is really safe. It has higher potential for unneeded
fallback but if it works for Steven (really existing usecase), ...

On the other hand, I would prefer to avoid any much more complex
solution until we have a real reports that they are needed.

Also we need to look for alternatives. There is a chance
to create crashdump and get the ftrace messages from it.
Also this might be scenario when we might need to suggest
the early_printk() patchset from Peter Zijlstra.

> logbuf might not be big enough for 4890096 messages (Steven's report
> mentions "Lost 4890096 message(s)!"). we are counting on the fact that
> in case of `nmi_print_seq' bypass some other CPU will call console_unlock()
> and print pending logbuf messages, but this is not guaranteed and the
> messages can be dropped even from logbuf.

Yup. I tested the patch here and I needed to increase the main log buffer
size to see all ftrace messages. Fortunately, it was possible to use a really
huge global buffer. But it is not realistic to use huge per-CPU ones.

Best Regards,
Petr

Re: [PATCH v2 2/3] powerpc/kprobes: un-blacklist system_call() from kprobes

2017-04-27 Thread Naveen N. Rao

On 2017/04/27 08:19PM, Michael Ellerman wrote:
> "Naveen N. Rao"  writes:
> 
> > It is actually safe to probe system_call() in entry_64.S, but only till
> > .Lsyscall_exit. To allow this, convert .Lsyscall_exit to a non-local
> > symbol __system_call() and blacklist that symbol, rather than
> > system_call().
> 
> I'm not sure I like this. The reason we made it a local symbol in the
> first place is because it made backtraces look odd:
> 
>   commit 4c3b21686111e0ac6018469dacbc5549f9915cf8
>   Author: Michael Ellerman 
>   AuthorDate: Fri Dec 5 21:16:59 2014 +1100
>   
>   powerpc/kernel: Make syscall_exit a local label
>   
>   Currently when we back trace something that is in a syscall we see
>   something like this:
>   
>   [c000] [c000] SyS_read+0x6c/0x110
>   [c000] [c000] syscall_exit+0x0/0x98
>   
>   Although it's entirely correct, seeing syscall_exit at the bottom can be
>   confusing - we were exiting from a syscall and then called SyS_read() ?
>   
>   If we instead change syscall_exit to be a local label we get something
>   more intuitive:
>   
>   [c001fa46fde0] [c026719c] SyS_read+0x6c/0x110
>   [c001fa46fe30] [c0009264] system_call+0x38/0xd0
>   
>   ie. we were handling a system call, and it was SyS_read().
> 
> 
> I think you know that, although you didn't mention it in the change log,
> because you've called the new symbol __system_call. But that is not a
> great name either because that's not what it does.

Yes, you're right. I used __system_call since I felt that it won't cause 
confusion like syscall_exit did. I agree it's not a great name, but we 
need _some_ label other than system_call if we want to allow probing at 
this point.

Also, if I'm reading this right, there is no other place to probe if we 
want to capture all system call entries.

So, I felt this would be good to have.

> 
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index 380361c0bb6a..e030ce34dd66 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -176,7 +176,7 @@ system_call:/* label this so stack 
> > traces look sane */
> > mtctr   r12
> > bctrl   /* Call handler */
> >  
> > -.Lsyscall_exit:
> > +__system_call:
> > std r3,RESULT(r1)
> > CURRENT_THREAD_INFO(r12, r1)
>   
> Why can't we kprobe the std and the rotate to current thread info?
> 
> Is the real no-probe point just here, prior to the clearing of MSR_RI ?
> 
>   ld  r8,_MSR(r1)
> #ifdef CONFIG_PPC_BOOK3S
>   /* No MSR:RI on BookE */

We can probe at all those places, just not once MSR_RI is unset. So, the 
no-probe point is just *after* the mtmsrd.

However, for kprobe blacklisting, the granularity is at a function level 
(or ASM labels). As such, we will have to blacklist all of 
syscall_exit/__system_call.


Regards,
Naveen

Re: [PATCH v3] KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller

2017-04-27 Thread Michael Ellerman

Paul Mackerras  writes:

> To get this to compile for all my test configs takes this additional
> patch.  I test-build configs with PR KVM and not HV (both modular and
> built-in) and a config with HV enabled but CONFIG_KVM_XICS=n.  Please
> squash this into your topic branch.

Thanks, squashed and pushed as:

5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt 
controller")

cheers

[RFC PATCH 4.8] powerpc/slb: Force a full SLB flush when we insert for a bad EA

2017-04-27 Thread Michael Ellerman

The SLB miss handler calls slb_allocate_realmode() in order to create an
SLB entry for the faulting address. At the very start of that function
we check that the faulting Effective Address (EA) is less than
PGTABLE_RANGE (ignoring the region), ie. is it an address which could
possibly fit in the virtual address space.

For an EA which fails that test, we branch out of line (to label 8), but
we still go on to create an SLB entry for the address. The SLB entry we
create has a VSID of 0, which means it will never match anything in the
hash table and so can't actually translate to a physical address.

However that SLB entry will be inserted in the SLB, and so needs to be
managed properly like any other SLB entry. In particular we need to
insert the SLB entry in the SLB cache, so that it will be flushed when
the process is descheduled.

And that is where the bugs begin. The first bug is that slb_finish_load()
uses cr7 to decide if it should insert the SLB entry into the SLB cache.
When we come from the invalid EA case we don't set cr7, it just has some
junk value from userspace. So we may or may not insert the SLB entry in
the SLB cache. If we fail to insert it, we may then incorrectly leave it
in the SLB when the process is descheduled.

The second bug is that even if we do happen to add the entry to the SLB
cache, we do not have enough bits in the SLB cache to remember the full
ESID value for very large EAs.

For example if a process branches to 0x788c545a1800, that results in
a 256MB SLB entry with an ESID of 0x788c545a1. But each entry in the SLB
cache is only 32-bits, meaning we truncate the ESID to 0x88c545a1. This
has the same effect as the first bug, we incorrectly leave the SLB entry
in the SLB when the process is descheduled.

When a process accesses an invalid EA it results in a SEGV signal being
sent to the process, which typically results in the process being
killed. Process death isn't instantaneous however, the process may catch
the SEGV signal and continue somehow, or the kernel may start writing a
core dump for the process, either of which means it's possible for the
process to be preempted while its processing the SEGV but before it's
been killed.

If that happens, when the process is scheduled back onto the CPU we will
allocate a new SLB entry for the NIP, which will insert a second entry
into the SLB for the bad EA. Because we never flushed the original
entry, due to either bug one or two, we now have two SLB entries that
match the same EA.

If another access is made to that EA, either by the process continuing
after catching the SEGV, or by a second process accessing the same bad
EA on the same CPU, we will trigger an SLB multi-hit machine check
exception. This has been observed happening in the wild.

The fix is when we hit the invalid EA case, we mark the SLB cache as
being full. This causes us to not insert the truncated ESID into the SLB
cache, and means when the process is switched out we will flush the
entire SLB. Note that this works both for the original fault and for a
subsequent call to slb_allocate_realmode() from switch_slb().

Because we mark the SLB cache as full, it doesn't really matter what
value is in cr7, but rather than leaving it as something random we set
it to indicate the address was a kernel address. That also skips the
attempt to insert it in the SLB cache which is a nice side effect.

Another way to fix the bug would be to make the entries in the SLB cache
wider, so that we don't truncate the ESID. However this would be a more
intrusive change as it alters the size and layout of the paca.

This bug was fixed in upstream by commit f0f558b131db ("powerpc/mm:
Preserve CFAR value on SLB miss caused by access to bogus address"),
which changed the way we handle a bad EA entirely removing this bug in
the process.

Cc: sta...@vger.kernel.org
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/slb_low.S | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/mm/slb_low.S b/arch/powerpc/mm/slb_low.S
index dfdb90cb4403..1348c4862b08 100644
--- a/arch/powerpc/mm/slb_low.S
+++ b/arch/powerpc/mm/slb_low.S
@@ -174,6 +174,16 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
b   slb_finish_load
 
 8: /* invalid EA */
+   /*
+* It's possible the bad EA is too large to fit in the SLB cache, which
+* would mean we'd fail to invalidate it on context switch. So mark the
+* SLB cache as full so we force a full flush. We also set cr7+eq to
+* mark the address as a kernel address, so slb_finish_load() skips
+* trying to insert it into the SLB cache.
+*/
+   li  r9,SLB_CACHE_ENTRIES + 1
+   sth r9,PACASLBCACHEPTR(r13)
+   crset   4*cr7+eq
li  r10,0   /* BAD_VSID */
li  r9,0/* BAD_VSID */
li  r11,SLB_VSID_USER   /* flags don't much matter */
-- 
2.7.4

[PATCH] Enabled pstore write for powerpc

2017-04-27 Thread Ankit Kumar

After commit c950fd6f201a kernel registers pstore write based on flag set.
Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) is not set for
powerpc architecture. On panic, kernel doesn't write message to
/fs/pstore/dmesg*(Entry doesn't gets created at all).

This patch enables pstore write for powerpc architecture by setting
PSTORE_FLAGS_DMESG flag.

Fixes:c950fd6f201a pstore: Split pstore fragile flags
Signed-off-by: Ankit Kumar 
---

 arch/powerpc/kernel/nvram_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c
index d5e2b83..021db31 100644
--- a/arch/powerpc/kernel/nvram_64.c
+++ b/arch/powerpc/kernel/nvram_64.c
@@ -561,6 +561,7 @@ static ssize_t nvram_pstore_read(u64 *id, enum 
pstore_type_id *type,
 static struct pstore_info nvram_pstore_info = {
.owner = THIS_MODULE,
.name = "nvram",
+   .flags = PSTORE_FLAGS_DMESG,
.open = nvram_pstore_open,
.read = nvram_pstore_read,
.write = nvram_pstore_write,
-- 
2.7.4

Re: [PATCH] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations

2017-04-27 Thread Michael Ellerman

"Naveen N. Rao"  writes:
> Excerpts from Masami Hiramatsu's message of April 26, 2017 10:11:
>> On Tue, 25 Apr 2017 21:37:11 +0530
>> "Naveen N. Rao"  wrote:
>>> -   addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name);
>>> -   if (!addr && dot_appended) {
>>> -   /* Let's try the original non-dot symbol lookup */
>>> +   ret = strscpy(dot_name + len, c, KSYM_NAME_LEN);
>>> +   if (ret >= 0)
>> 
>> Here, maybe you can skip the case of ret == 0. (Or, would we have
>> a symbol which only has "."?)
>
> Ah, indeed. Good point. We just need the test to be (ret > 0).
>
> Michael,
> If the rest of the patch is fine by you, would you be ok to make the 
> small change above? If not, please let me know and I'll re-spin. Thanks.

I'd rather you change it, test and then resend.

cheers

Re: [PATCH v2] cxl: Prevent IRQ storm

2017-04-27 Thread Michael Ellerman

Andrew Donnellan  writes:

> On 27/04/17 11:37, Alastair D'Silva wrote:
>> From: Alastair D'Silva 
>>
>> In some situations, a faulty AFU slice may create an interrupt storm,
>> rendering the machine unusable. Since these interrupts are informational
>> only, present the interrupt once, then mask it off to prevent it from
>> being retriggered until the card is reset.
>>
>> Changelog:
>> v2
>>  Rebase against linux-next
>
> The patch changelog shouldn't be part of the commit message - it should 
> go under a "---" line after the sign-off so it doesn't get included in 
> the final commit.
>
> Also now that I've taken a second look, I think the summary line of the 
> commit message could be more descriptive, something like:
>
> "cxl: mask slice error interrupt after first occurrence"
^
M

:D

cheers

Re: [PATCH 2/7] mm/follow_page_mask: Split follow_page_mask to smaller functions.

2017-04-27 Thread Naoya Horiguchi

On Mon, Apr 17, 2017 at 10:41:41PM +0530, Aneesh Kumar K.V wrote:
> Makes code reading easy. No functional changes in this patch. In a followup
> patch, we will be updating the follow_page_mask to handle hugetlb hugepd 
> format
> so that archs like ppc64 can switch to the generic version. This split helps
> in doing that nicely.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Naoya Horiguchi

Re: [PATCH 1/7] mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at

2017-04-27 Thread Naoya Horiguchi

On Mon, Apr 17, 2017 at 10:41:40PM +0530, Aneesh Kumar K.V wrote:
> The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use
> that instead of set_pte_at.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Naoya Horiguchi

Re: [PATCH 3/7] mm/hugetlb: export hugetlb_entry_migration helper

2017-04-27 Thread Naoya Horiguchi

On Mon, Apr 17, 2017 at 10:41:42PM +0530, Aneesh Kumar K.V wrote:
> We will be using this later from the ppc64 code. Change the return type to 
> bool.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: Naoya Horiguchi

Re: powerpc/powernv: Fix missing attr initialisation in opal_export_attrs()

2017-04-27 Thread Michael Ellerman

On Thu, 2017-04-27 at 01:37:32 UTC, Michael Ellerman wrote:
> In opal_export_attrs() we dynamically allocate some bin_attributes. They're
> allocated with kmalloc() and although we initialise most of the fields, we 
> don't
> initialise write() or mmap(), and in particular we don't initialise the 
> lockdep
> related fields in the embedded struct attribute.
> 
> This leads to a lockdep warning at boot:
> 
>   BUG: key c000f11906d8 not in .data!
>   WARNING: CPU: 0 PID: 1 at ../kernel/locking/lockdep.c:3136 
> lockdep_init_map+0x28c/0x2a0
>   ...
>   Call Trace:
> lockdep_init_map+0x288/0x2a0 (unreliable)
> __kernfs_create_file+0x8c/0x170
> sysfs_add_file_mode_ns+0xc8/0x240
> __machine_initcall_powernv_opal_init+0x60c/0x684
> do_one_initcall+0x60/0x1c0
> kernel_init_freeable+0x2f4/0x3d4
> kernel_init+0x24/0x160
> ret_from_kernel_thread+0x5c/0xb0
> 
> Fix it by kzalloc'ing the attr, which fixes the uninitialised write() and
> mmap(), and calling sysfs_bin_attr_init() on it to initialise the lockdep
> fields.
> 
> Fixes: 11fe909d2362 ("powerpc/powernv: Add OPAL exports attributes to sysfs")
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/83c4919058459c32138a1ebe35f72b

cheers

Re: [v2,1/2] powerpc/mm/radix: Optimise Page Walk Cache flush

2017-04-27 Thread Michael Ellerman

On Wed, 2017-04-26 at 13:27:19 UTC, Michael Ellerman wrote:
> Currently we implement flushing of the page walk cache (PWC) by calling
> _tlbiel_pid() with a RIC (Radix Invalidation Control) value of 1 which says to
> only flush the PWC.
> 
> But _tlbiel_pid() loops over each set (congruence class) of the TLB, which is
> not necessary when we're just flushing the PWC.
> 
> In fact the set argument is ignored for a PWC flush, so essentially we're just
> flushing the PWC 127 extra times for no benefit.
> 
> Fix it by adding tlbiel_pwc() which just does a single flush of the PWC.
> 
> Signed-off-by: Aneesh Kumar K.V 
> [mpe: Split out of combined patch, drop _ in name, rewrite change log]
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/5a9853946c2e7a5ef9ef5302ecada6

cheers

Re: powerpc/powernv: Fix oops on P9 DD1 in cause_ipi()

2017-04-27 Thread Michael Ellerman

On Wed, 2017-04-26 at 10:57:47 UTC, Michael Ellerman wrote:
> Recently we merged the native xive support for Power9, and then separately 
> some
> reworks for doorbell IPI support. In isolation both series were OK, but the
> merged result had a bug in one case.
> 
> On P9 DD1 we use pnv_p9_dd1_cause_ipi() which tries to use doorbells, and then
> falls back to the interrupt controller. However the fallback is implemented by
> calling icp_ops->cause_ipi. But now that xive support is merged we might be
> using xive, in which case icp_ops is not initialised, it's a xics specific
> structure. This leads to an oops such as:
> 
>   Unable to handle kernel paging request for data at address 0x0028
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   NIP pnv_p9_dd1_cause_ipi+0x74/0xe0
>   LR smp_muxed_ipi_message_pass+0x54/0x70
> 
> To fix it, rather than using icp_ops which might be NULL, have both xics and
> xive set smp_ops->cause_ipi, and then in the powernv code we save that as
> ic_cause_ipi before overriding smp_ops->cause_ipi. For paranoia add a 
> WARN_ON()
> to check if somehow smp_ops->cause_ipi is NULL.
> 
> Fixes: b866cc2199d6 ("powerpc: Change the doorbell IPI calling convention")
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/45b21cfeb22087795f0b49397fbe52

cheers

Re: [REBASED,v4,1/2] powerpc: split ftrace bits into a separate file

2017-04-27 Thread Michael Ellerman

On Tue, 2017-04-25 at 13:55:53 UTC, "Naveen N. Rao" wrote:
> entry_*.S now includes a lot more than just kernel entry/exit code. As a
> first step at cleaning this up, let's split out the ftrace bits into
> separate files. Also move all related tracing code into a new trace/
> subdirectory.
> 
> No functional changes.
> 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Naveen N. Rao 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/4781f015d35ce2e83632b6a938093b

cheers

Re: powerpc/mm: Fix possible out-of-bounds shift in arch_mmap_rnd()

2017-04-27 Thread Michael Ellerman

On Tue, 2017-04-25 at 12:09:41 UTC, Michael Ellerman wrote:
> The recent patch to add runtime configuration of the ASLR limits added a bug 
> in
> arch_mmap_rnd() where we may shift an integer (32-bits) by up to 33 bits,
> leading to undefined behaviour.
> 
> In practice it exhibits as every process seg faulting instantly, presumably
> because the rnd value hasn't been restricited by the modulus at all. We didn't
> notice because it only happens under certain kernel configurations and if the
> number of bits is actually set to a large value.
> 
> Fix it by switching to unsigned long.
> 
> Fixes: 9fea59bd7ca5 ("powerpc/mm: Add support for runtime configuration of 
> ASLR limits")
> Reported-by: Balbir Singh 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Kees Cook 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/b409946b2a3c1ddcde75e5f35a77e0

cheers

Re: [v2] powerpc/mm: Fix page table dump build on PPC32

2017-04-27 Thread Michael Ellerman

On Tue, 2017-04-18 at 06:20:13 UTC, Christophe Leroy wrote:
> On PPC32 (ex: mpc885_ads_defconfig), page table dump compilation
> fails as follows. This is because the memory layout is slightly
> different on PPC32. This patch adapts it.
> 
>   CC  arch/powerpc/mm/dump_linuxpagetables.o
> arch/powerpc/mm/dump_linuxpagetables.c: In function 'walk_pagetables':
> arch/powerpc/mm/dump_linuxpagetables.c:369:10: error: 'KERN_VIRT_START' 
> undeclared (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c:369:10: note: each undeclared 
> identifier is reported only once for each function it appears in
> arch/powerpc/mm/dump_linuxpagetables.c: In function 'populate_markers':
> arch/powerpc/mm/dump_linuxpagetables.c:383:37: error: 'ISA_IO_BASE' 
> undeclared (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c:384:37: error: 'ISA_IO_END' undeclared 
> (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c:385:37: error: 'PHB_IO_BASE' 
> undeclared (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c:386:37: error: 'PHB_IO_END' undeclared 
> (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c:387:37: error: 'IOREMAP_BASE' 
> undeclared (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c:388:37: error: 'IOREMAP_END' 
> undeclared (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c:392:38: error: 'VMEMMAP_BASE' 
> undeclared (first use in this function)
> arch/powerpc/mm/dump_linuxpagetables.c: In function 'ptdump_show':
> arch/powerpc/mm/dump_linuxpagetables.c:400:20: error: 'KERN_VIRT_START' 
> undeclared (first use in this function)
> make[1]: *** [arch/powerpc/mm/dump_linuxpagetables.o] Error 1
> make: *** [arch/powerpc/mm] Error 2
> 
> Fixes: 8eb07b187000d ("powerpc/mm: Dump linux pagetables")
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/2fab9fe1f9ff6836a82bf8bdb26e67

cheers

Re: powerpc/mm: Rename table dump file name

2017-04-27 Thread Michael Ellerman

On Tue, 2017-04-18 at 06:20:15 UTC, Christophe Leroy wrote:
> Page table dump debugfs file is named 'kernel_page_tables' on
> all other architectures implementing it, while is is named
> 'kernel_pagetables' on powerpc. This patch renames it.
> 
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/ec95e15862e31f8dfb6218ca111548

cheers

Re: [v2] powerpc/mm: Fix missing page attributes in page table dump

2017-04-27 Thread Michael Ellerman

On Fri, 2017-04-14 at 05:45:16 UTC, Christophe Leroy wrote:
> On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used.
> There is also _PAGE_SHARED that is missing.
> 
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c99317953323d6251245022cb3af54

cheers

Re: powerpc/mm: On PPC32, display 32 bits addresses in page table dump

2017-04-27 Thread Michael Ellerman

On Thu, 2017-04-13 at 12:41:40 UTC, Christophe Leroy wrote:
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e1f2c9d97d932812d17509e86246c6

cheers

Re: [PATCH v2 2/3] powerpc/kprobes: un-blacklist system_call() from kprobes

2017-04-27 Thread Michael Ellerman

"Naveen N. Rao"  writes:

> It is actually safe to probe system_call() in entry_64.S, but only till
> .Lsyscall_exit. To allow this, convert .Lsyscall_exit to a non-local
> symbol __system_call() and blacklist that symbol, rather than
> system_call().

I'm not sure I like this. The reason we made it a local symbol in the
first place is because it made backtraces look odd:

  commit 4c3b21686111e0ac6018469dacbc5549f9915cf8
  Author: Michael Ellerman 
  AuthorDate: Fri Dec 5 21:16:59 2014 +1100

  powerpc/kernel: Make syscall_exit a local label

  Currently when we back trace something that is in a syscall we see
  something like this:

  [c000] [c000] SyS_read+0x6c/0x110
  [c000] [c000] syscall_exit+0x0/0x98

  Although it's entirely correct, seeing syscall_exit at the bottom can be
  confusing - we were exiting from a syscall and then called SyS_read() ?

  If we instead change syscall_exit to be a local label we get something
  more intuitive:

  [c001fa46fde0] [c026719c] SyS_read+0x6c/0x110
  [c001fa46fe30] [c0009264] system_call+0x38/0xd0

  ie. we were handling a system call, and it was SyS_read().

I think you know that, although you didn't mention it in the change log,
because you've called the new symbol __system_call. But that is not a
great name either because that's not what it does.

> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 380361c0bb6a..e030ce34dd66 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -176,7 +176,7 @@ system_call:  /* label this so stack 
> traces look sane */
>   mtctr   r12
>   bctrl   /* Call handler */
>  
> -.Lsyscall_exit:
> +__system_call:
>   std r3,RESULT(r1)
>   CURRENT_THREAD_INFO(r12, r1)

Why can't we kprobe the std and the rotate to current thread info?

Is the real no-probe point just here, prior to the clearing of MSR_RI ?

ld  r8,_MSR(r1)
#ifdef CONFIG_PPC_BOOK3S
/* No MSR:RI on BookE */

cheers

Re: [PATCH v3] KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller

2017-04-27 Thread Paul Mackerras

To get this to compile for all my test configs takes this additional
patch.  I test-build configs with PR KVM and not HV (both modular and
built-in) and a config with HV enabled but CONFIG_KVM_XICS=n.  Please
squash this into your topic branch.

Paul.

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c56939ecc554..24de532c1736 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -200,7 +200,7 @@ config KVM_XICS
 config KVM_XIVE
bool
default y
-   depends on KVM_XICS && PPC_XIVE_NATIVE
+   depends on KVM_XICS && PPC_XIVE_NATIVE && KVM_BOOK3S_HV_POSSIBLE
 
 source drivers/vhost/Kconfig
 
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 5c00813e1e0e..846b40cb3a62 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -513,6 +513,7 @@ static long kvmppc_read_one_intr(bool *again)
return kvmppc_check_passthru(xisr, xirr, again);
 }
 
+#ifdef CONFIG_KVM_XICS
 static inline bool is_rm(void)
 {
return !(mfmsr() & MSR_DR);
@@ -591,3 +592,4 @@ int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long 
xirr)
} else
return xics_rm_h_eoi(vcpu, xirr);
 }
+#endif /* CONFIG_KVM_XICS */
diff --git a/arch/powerpc/kvm/book3s_xics.h b/arch/powerpc/kvm/book3s_xics.h
index 5016676847c9..453c9e518c19 100644
--- a/arch/powerpc/kvm/book3s_xics.h
+++ b/arch/powerpc/kvm/book3s_xics.h
@@ -10,6 +10,7 @@
 #ifndef _KVM_PPC_BOOK3S_XICS_H
 #define _KVM_PPC_BOOK3S_XICS_H
 
+#ifdef CONFIG_KVM_XICS
 /*
  * We use a two-level tree to store interrupt source information.
  * There are up to 1024 ICS nodes, each of which can represent
@@ -150,4 +151,5 @@ extern int xics_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned 
long server,
 extern int xics_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
 extern int xics_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
 
+#endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index fcccfbc2c4f4..5938f7644dc1 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -9,6 +9,7 @@
 #ifndef _KVM_PPC_BOOK3S_XIVE_H
 #define _KVM_PPC_BOOK3S_XIVE_H
 
+#ifdef CONFIG_KVM_XICS
 #include "book3s_xics.h"
 
 /*
@@ -251,4 +252,5 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, 
unsigned long server,
 extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr);
 extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr);
 
+#endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 9d312c96a897..6feac0a758e1 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -267,6 +267,7 @@ static int xive_native_get_ipi(unsigned int cpu, struct 
xive_cpu *xc)
}
return 0;
 }
+#endif /* CONFIG_SMP */
 
 u32 xive_native_alloc_irq(void)
 {
@@ -295,6 +296,7 @@ void xive_native_free_irq(u32 irq)
 }
 EXPORT_SYMBOL_GPL(xive_native_free_irq);
 
+#ifdef CONFIG_SMP
 static void xive_native_put_ipi(unsigned int cpu, struct xive_cpu *xc)
 {
s64 rc;

[PATCH v5 3/3] kdump: Protect vmcoreinfo data under the crash memory

2017-04-27 Thread Xunlei Pang

Currently vmcoreinfo data is updated at boot time subsys_initcall(),
it has the risk of being modified by some wrong code during system
is running.

As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
when using "crash", "makedumpfile", etc utility to parse this vmcore,
we probably will get "Segmentation fault" or other unexpected errors.

E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
system; 3) trigger kdump, then we obviously will fail to recognize the
crash context correctly due to the corrupted vmcoreinfo.

Now except for vmcoreinfo, all the crash data is well protected(including
the cpu note which is fully updated in the crash path, thus its correctness
is guaranteed). Given that vmcoreinfo data is a large chunk prepared for
kdump, we better protect it as well.

To solve this, we relocate and copy vmcoreinfo_data to the crash memory
when kdump is loading via kexec syscalls. Because the whole crash memory
will be protected by existing arch_kexec_protect_crashkres() mechanism,
we naturally protect vmcoreinfo_data from write(even read) access under
kernel direct mapping after kdump is loaded.

Since kdump is usually loaded at the very early stage after boot, we can
trust the correctness of the vmcoreinfo data copied.

On the other hand, we still need to operate the vmcoreinfo safe copy when
crash happens to generate vmcoreinfo_note again, we rely on vmap() to map
out a new kernel virtual address and update to use this new one instead in
the following crash_save_vmcoreinfo().

BTW, we do not touch vmcoreinfo_note, because it will be fully updated
using the protected vmcoreinfo_data after crash which is surely correct
just like the cpu crash note.

Tested-by: Michael Holzheu 
Signed-off-by: Xunlei Pang 
---
v4->v5:
- Moved vunmap(image->vmcoreinfo_data_copy) above to avoid confusion.
- No functional change.

v3->v4:
-Rebased on the latest linux-next
-Copy vmcoreinfo after machine_kexec_prepare()

 include/linux/crash_core.h |  2 +-
 include/linux/kexec.h  |  2 ++
 kernel/crash_core.c| 17 -
 kernel/kexec.c |  8 
 kernel/kexec_core.c| 39 +++
 kernel/kexec_file.c|  8 
 6 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 4555c09..e9de6b4 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -23,6 +23,7 @@
 
 typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
 
+void crash_update_vmcoreinfo_safecopy(void *ptr);
 void crash_save_vmcoreinfo(void);
 void arch_crash_save_vmcoreinfo(void);
 __printf(1, 2)
@@ -54,7 +55,6 @@
vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
 
 extern u32 *vmcoreinfo_note;
-extern size_t vmcoreinfo_size;
 
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
  void *data, size_t data_len);
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index c9481eb..3ea8275 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -181,6 +181,7 @@ struct kimage {
unsigned long start;
struct page *control_code_page;
struct page *swap_page;
+   void *vmcoreinfo_data_copy; /* locates in the crash memory */
 
unsigned long nr_segments;
struct kexec_segment segment[KEXEC_SEGMENT_MAX];
@@ -250,6 +251,7 @@ extern void *kexec_purgatory_get_symbol_addr(struct kimage 
*image,
 int kexec_should_crash(struct task_struct *);
 int kexec_crash_loaded(void);
 void crash_save_cpu(struct pt_regs *regs, int cpu);
+extern int kimage_crash_copy_vmcoreinfo(struct kimage *image);
 
 extern struct kimage *kexec_image;
 extern struct kimage *kexec_crash_image;
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index c2fd0d2..4a4a4ba 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -15,9 +15,12 @@
 
 /* vmcoreinfo stuff */
 static unsigned char *vmcoreinfo_data;
-size_t vmcoreinfo_size;
+static size_t vmcoreinfo_size;
 u32 *vmcoreinfo_note;
 
+/* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
+static unsigned char *vmcoreinfo_data_safecopy;
+
 /*
  * parsing the "crashkernel" commandline
  *
@@ -323,11 +326,23 @@ static void update_vmcoreinfo_note(void)
final_note(buf);
 }
 
+void crash_update_vmcoreinfo_safecopy(void *ptr)
+{
+   if (ptr)
+   memcpy(ptr, vmcoreinfo_data, vmcoreinfo_size);
+
+   vmcoreinfo_data_safecopy = ptr;
+}
+
 void crash_save_vmcoreinfo(void)
 {
if (!vmcoreinfo_note)
return;
 
+   /* Use the safe copy to generate vmcoreinfo note if have */
+   if (vmcoreinfo_data_safecopy)
+   vmcoreinfo_data = vmcoreinfo_data_safecopy;
+
vmcoreinfo_append_str("CRASHTIME=%ld\n", get_seconds());
update_vmcoreinfo_note();
 }
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 980936a..e62ec4d 100644
--- a/kernel/kexec.c
+++ b/kernel/k

[PATCH v5 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr

2017-04-27 Thread Xunlei Pang

vmcoreinfo_max_size stands for the vmcoreinfo_data, the
correct one we should use is vmcoreinfo_note whose total
size is VMCOREINFO_NOTE_SIZE.

Like explained in commit 77019967f06b ("kdump: fix exported
size of vmcoreinfo note"), it should not affect the actual
function, but we better fix it, also this change should be
safe and backward compatible.

After this, we can get rid of variable vmcoreinfo_max_size,
let's use the corresponding macros directly, fewer variables
means more safety for vmcoreinfo operation.

Cc: Hari Bathini 
Reviewed-by: Mahesh Salgaonkar 
Reviewed-by: Dave Young 
Signed-off-by: Xunlei Pang 
---
v4->v5:
No change.

v3->v4:
-Rebased on the latest linux-next

 arch/powerpc/kernel/fadump.c | 3 +--
 include/linux/crash_core.h   | 1 -
 kernel/crash_core.c  | 3 +--
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 466569e..7bd6cd0 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -893,8 +893,7 @@ static int fadump_create_elfcore_headers(char *bufp)
 
phdr->p_paddr   = fadump_relocate(paddr_vmcoreinfo_note());
phdr->p_offset  = phdr->p_paddr;
-   phdr->p_memsz   = vmcoreinfo_max_size;
-   phdr->p_filesz  = vmcoreinfo_max_size;
+   phdr->p_memsz   = phdr->p_filesz = VMCOREINFO_NOTE_SIZE;
 
/* Increment number of program headers. */
(elf->e_phnum)++;
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index ec9d415..4555c09 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -55,7 +55,6 @@
 
 extern u32 *vmcoreinfo_note;
 extern size_t vmcoreinfo_size;
-extern size_t vmcoreinfo_max_size;
 
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
  void *data, size_t data_len);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 2837d61..c2fd0d2 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -16,7 +16,6 @@
 /* vmcoreinfo stuff */
 static unsigned char *vmcoreinfo_data;
 size_t vmcoreinfo_size;
-size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
 u32 *vmcoreinfo_note;
 
 /*
@@ -343,7 +342,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
r = vscnprintf(buf, sizeof(buf), fmt, args);
va_end(args);
 
-   r = min(r, vmcoreinfo_max_size - vmcoreinfo_size);
+   r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
 
memcpy(&vmcoreinfo_data[vmcoreinfo_size], buf, r);
 
-- 
1.8.3.1

[PATCH v5 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-27 Thread Xunlei Pang

As Eric said,
"what we need to do is move the variable vmcoreinfo_note out
of the kernel's .bss section.  And modify the code to regenerate
and keep this information in something like the control page.

Definitely something like this needs a page all to itself, and ideally
far away from any other kernel data structures.  I clearly was not
watching closely the data someone decided to keep this silly thing
in the kernel's .bss section."

This patch allocates extra pages for these vmcoreinfo_XXX variables,
one advantage is that it enhances some safety of vmcoreinfo, because
vmcoreinfo now is kept far away from other kernel data structures.

Cc: Juergen Gross 
Suggested-by: Eric Biederman 
Tested-by: Michael Holzheu 
Reviewed-by: Juergen Gross 
Signed-off-by: Xunlei Pang 
---
v4->v5:
Changed VMCOREINFO_BYTES definition to PAGE_SIZE according to Dave's comment

v3->v4:
-Rebased on the latest linux-next
-Handle S390 vmcoreinfo_note properly
-Handle the newly-added xen/mmu_pv.c

 arch/ia64/kernel/machine_kexec.c |  5 -
 arch/s390/kernel/machine_kexec.c |  1 +
 arch/s390/kernel/setup.c |  6 --
 arch/x86/kernel/crash.c  |  2 +-
 arch/x86/xen/mmu_pv.c|  4 ++--
 include/linux/crash_core.h   |  4 ++--
 kernel/crash_core.c  | 26 ++
 kernel/ksysfs.c  |  2 +-
 8 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/arch/ia64/kernel/machine_kexec.c b/arch/ia64/kernel/machine_kexec.c
index 599507b..c14815d 100644
--- a/arch/ia64/kernel/machine_kexec.c
+++ b/arch/ia64/kernel/machine_kexec.c
@@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
 }
 
-phys_addr_t paddr_vmcoreinfo_note(void)
-{
-   return ia64_tpa((unsigned long)(char *)&vmcoreinfo_note);
-}
-
diff --git a/arch/s390/kernel/machine_kexec.c b/arch/s390/kernel/machine_kexec.c
index 49a6bd4..3d0b14a 100644
--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void)
VMCOREINFO_SYMBOL(lowcore_ptr);
VMCOREINFO_SYMBOL(high_memory);
VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
+   mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
 }
 
 void machine_shutdown(void)
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index 3ae756c..3d1d808 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -496,11 +496,6 @@ static void __init setup_memory_end(void)
pr_notice("The maximum memory size is %luMB\n", memory_end >> 20);
 }
 
-static void __init setup_vmcoreinfo(void)
-{
-   mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
-}
-
 #ifdef CONFIG_CRASH_DUMP
 
 /*
@@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p)
 #endif
 
setup_resources();
-   setup_vmcoreinfo();
setup_lowcore();
smp_fill_possible_mask();
cpu_detect_mhz_feature();
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 22217ec..44404e2 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data *ced,
bufp += sizeof(Elf64_Phdr);
phdr->p_type = PT_NOTE;
phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
-   phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+   phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
(ehdr->e_phnum)++;
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 9d9ae66..35543fa 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2723,8 +2723,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
unsigned int order)
 phys_addr_t paddr_vmcoreinfo_note(void)
 {
if (xen_pv_domain())
-   return virt_to_machine(&vmcoreinfo_note).maddr;
+   return virt_to_machine(vmcoreinfo_note).maddr;
else
-   return __pa_symbol(&vmcoreinfo_note);
+   return __pa(vmcoreinfo_note);
 }
 #endif /* CONFIG_KEXEC_CORE */
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index eb71a70..ec9d415 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -14,7 +14,7 @@
 CRASH_CORE_NOTE_NAME_BYTES +   \
 CRASH_CORE_NOTE_DESC_BYTES)
 
-#define VMCOREINFO_BYTES  (4096)
+#define VMCOREINFO_BYTES  PAGE_SIZE
 #define VMCOREINFO_NOTE_NAME  "VMCOREINFO"
 #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
 #define VMCOREINFO_NOTE_SIZE  ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
@@ -53,7 +53,7 @@
 #define VMCOREINFO_PHYS_BASE(value) \
vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsigned long)value)
 
-extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+extern u32 *vmcoreinfo_note;
 extern size_t vmcoreinfo_size;
 extern size_t vmcoreinfo_max_size;
 
diff --git a

[PATCH] powerpc/pseries hotplug: prevent the reserved mem from removing

2017-04-27 Thread Pingfan Liu

E.g after fadump reserves mem regions, these regions should not be removed
before fadump explicitly free them.

Signed-off-by: Pingfan Liu 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index e104c71..201be23 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -346,6 +346,8 @@ static int pseries_remove_memblock(unsigned long base, 
unsigned int memblock_siz
 
if (!pfn_valid(start_pfn))
goto out;
+   if (memblock_is_reserved(base))
+   return -EINVAL;
 
block_sz = pseries_memory_block_size();
sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
@@ -388,8 +390,7 @@ static int pseries_remove_mem_node(struct device_node *np)
base = be64_to_cpu(*(unsigned long *)regs);
lmb_size = be32_to_cpu(regs[3]);
 
-   pseries_remove_memblock(base, lmb_size);
-   return 0;
+   return pseries_remove_memblock(base, lmb_size);
 }
 
 static bool lmb_is_removable(struct of_drconf_cell *lmb)
-- 
2.7.4

[PATCH v2 2/3] powerpc/kprobes: un-blacklist system_call() from kprobes

2017-04-27 Thread Naveen N. Rao

It is actually safe to probe system_call() in entry_64.S, but only till
.Lsyscall_exit. To allow this, convert .Lsyscall_exit to a non-local
symbol __system_call() and blacklist that symbol, rather than
system_call().

Reviewed-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 380361c0bb6a..e030ce34dd66 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -176,7 +176,7 @@ system_call:/* label this so stack 
traces look sane */
mtctr   r12
bctrl   /* Call handler */
 
-.Lsyscall_exit:
+__system_call:
std r3,RESULT(r1)
CURRENT_THREAD_INFO(r12, r1)
 
@@ -294,12 +294,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
blt+system_call
 
/* Return code is already in r3 thanks to do_syscall_trace_enter() */
-   b   .Lsyscall_exit
+   b   __system_call
 
 
 .Lsyscall_enosys:
li  r3,-ENOSYS
-   b   .Lsyscall_exit
+   b   __system_call

 .Lsyscall_exit_work:
 #ifdef CONFIG_PPC_BOOK3S
@@ -388,7 +388,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
b   .   /* prevent speculative execution */
 #endif
 _ASM_NOKPROBE_SYMBOL(system_call_common);
-_ASM_NOKPROBE_SYMBOL(system_call);
+_ASM_NOKPROBE_SYMBOL(__system_call);
 
 /* Save non-volatile GPRs, if not already saved. */
 _GLOBAL(save_nvgprs)
@@ -413,38 +413,38 @@ _GLOBAL(save_nvgprs)
 _GLOBAL(ppc_fork)
bl  save_nvgprs
bl  sys_fork
-   b   .Lsyscall_exit
+   b   __system_call
 
 _GLOBAL(ppc_vfork)
bl  save_nvgprs
bl  sys_vfork
-   b   .Lsyscall_exit
+   b   __system_call
 
 _GLOBAL(ppc_clone)
bl  save_nvgprs
bl  sys_clone
-   b   .Lsyscall_exit
+   b   __system_call
 
 _GLOBAL(ppc32_swapcontext)
bl  save_nvgprs
bl  compat_sys_swapcontext
-   b   .Lsyscall_exit
+   b   __system_call
 
 _GLOBAL(ppc64_swapcontext)
bl  save_nvgprs
bl  sys_swapcontext
-   b   .Lsyscall_exit
+   b   __system_call
 
 _GLOBAL(ppc_switch_endian)
bl  save_nvgprs
bl  sys_switch_endian
-   b   .Lsyscall_exit
+   b   __system_call
 
 _GLOBAL(ret_from_fork)
bl  schedule_tail
REST_NVGPRS(r1)
li  r3,0
-   b   .Lsyscall_exit
+   b   __system_call
 
 _GLOBAL(ret_from_kernel_thread)
bl  schedule_tail
@@ -456,7 +456,7 @@ _GLOBAL(ret_from_kernel_thread)
 #endif
blrl
li  r3,0
-   b   .Lsyscall_exit
+   b   __system_call
 
 /*
  * This routine switches between two different tasks.  The process
-- 
2.12.2

[PATCH v2 3/3] powerpc/kprobes: blacklist functions invoked on a trap

2017-04-27 Thread Naveen N. Rao

Blacklist all functions involved while handling a trap. We:
- convert some of the labels into private labels,
- remove the duplicate 'restore' label, and
- blacklist most functions involved while handling a trap.

Reviewed-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S   | 47 +---
 arch/powerpc/kernel/exceptions-64s.S |  1 +
 arch/powerpc/kernel/traps.c  |  3 +++
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index e030ce34dd66..e7e05eb590a5 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -184,7 +184,7 @@ __system_call:
 #ifdef CONFIG_PPC_BOOK3S
/* No MSR:RI on BookE */
andi.   r10,r8,MSR_RI
-   beq-unrecov_restore
+   beq-.Lunrecov_restore
 #endif
/*
 * Disable interrupts so current_thread_info()->flags can't change,
@@ -399,6 +399,7 @@ _GLOBAL(save_nvgprs)
clrrdi  r0,r11,1
std r0,_TRAP(r1)
blr
+_ASM_NOKPROBE_SYMBOL(save_nvgprs);
 

 /*
@@ -642,18 +643,18 @@ _GLOBAL(ret_from_except_lite)
 * Use the internal debug mode bit to do this.
 */
andis.  r0,r3,DBCR0_IDM@h
-   beq restore
+   beq fast_exc_return_irq
mfmsr   r0
rlwinm  r0,r0,0,~MSR_DE /* Clear MSR.DE */
mtmsr   r0
mtspr   SPRN_DBCR0,r3
li  r10, -1
mtspr   SPRN_DBSR,r10
-   b   restore
+   b   fast_exc_return_irq
 #else
addir3,r1,STACK_FRAME_OVERHEAD
bl  restore_math
-   b   restore
+   b   fast_exc_return_irq
 #endif
 1: andi.   r0,r4,_TIF_NEED_RESCHED
beq 2f
@@ -666,7 +667,7 @@ _GLOBAL(ret_from_except_lite)
bne 3f  /* only restore TM if nothing else to do */
addir3,r1,STACK_FRAME_OVERHEAD
bl  restore_tm_state
-   b   restore
+   b   fast_exc_return_irq
 3:
 #endif
bl  save_nvgprs
@@ -718,14 +719,14 @@ resume_kernel:
 #ifdef CONFIG_PREEMPT
/* Check if we need to preempt */
andi.   r0,r4,_TIF_NEED_RESCHED
-   beq+restore
+   beq+fast_exc_return_irq
/* Check that preempt_count() == 0 and interrupts are enabled */
lwz r8,TI_PREEMPT(r9)
cmpwi   cr1,r8,0
ld  r0,SOFTE(r1)
cmpdi   r0,0
crandc  eq,cr1*4+eq,eq
-   bne restore
+   bne fast_exc_return_irq
 
/*
 * Here we are preempting the current task. We want to make
@@ -756,7 +757,6 @@ resume_kernel:
 
.globl  fast_exc_return_irq
 fast_exc_return_irq:
-restore:
/*
 * This is the main kernel exit path. First we check if we
 * are about to re-enable interrupts
@@ -764,11 +764,11 @@ restore:
ld  r5,SOFTE(r1)
lbz r6,PACASOFTIRQEN(r13)
cmpwi   cr0,r5,0
-   beq restore_irq_off
+   beq .Lrestore_irq_off
 
/* We are enabling, were we already enabled ? Yes, just return */
cmpwi   cr0,r6,1
-   beq cr0,do_restore
+   beq cr0,.Ldo_restore
 
/*
 * We are about to soft-enable interrupts (we are hard disabled
@@ -777,14 +777,14 @@ restore:
 */
lbz r0,PACAIRQHAPPENED(r13)
cmpwi   cr0,r0,0
-   bne-restore_check_irq_replay
+   bne-.Lrestore_check_irq_replay
 
/*
 * Get here when nothing happened while soft-disabled, just
 * soft-enable and move-on. We will hard-enable as a side
 * effect of rfi
 */
-restore_no_replay:
+.Lrestore_no_replay:
TRACE_ENABLE_INTS
li  r0,1
stb r0,PACASOFTIRQEN(r13);
@@ -792,7 +792,7 @@ restore_no_replay:
/*
 * Final return path. BookE is handled in a different file
 */
-do_restore:
+.Ldo_restore:
 #ifdef CONFIG_PPC_BOOK3E
b   exception_return_book3e
 #else
@@ -826,7 +826,7 @@ fast_exception_return:
REST_8GPRS(5, r1)
 
andi.   r0,r3,MSR_RI
-   beq-unrecov_restore
+   beq-.Lunrecov_restore
 
/* Load PPR from thread struct before we clear MSR:RI */
 BEGIN_FTR_SECTION
@@ -884,7 +884,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 * make sure that in this case, we also clear PACA_IRQ_HARD_DIS
 * or that bit can get out of sync and bad things will happen
 */
-restore_irq_off:
+.Lrestore_irq_off:
ld  r3,_MSR(r1)
lbz r7,PACAIRQHAPPENED(r13)
andi.   r0,r3,MSR_EE
@@ -894,13 +894,13 @@ restore_irq_off:
 1: li  r0,0
stb r0,PACASOFTIRQEN(r13);
TRACE_DISABLE_INTS
-   b   do_restore
+   b   .Ldo_restore
 
/*
 * Something did happen, check if a re-emit is needed
 * (this also clears paca->irq_happened)
 */
-restore_check_irq_replay:
+.Lre

[PATCH v2 1/3] powerpc/kprobes: cleanup system_call_common and blacklist it from kprobes

2017-04-27 Thread Naveen N. Rao

Convert some of the labels into private labels and blacklist
system_call_common() and system_call() from kprobes. We can't take a
trap at parts of these functions as either MSR_RI is unset or the
kernel stack pointer is not yet setup.

Reviewed-by: Masami Hiramatsu 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/entry_64.S | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 9b541d22595a..380361c0bb6a 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -52,12 +52,11 @@ exception_marker:
.section".text"
.align 7
 
-   .globl system_call_common
-system_call_common:
+_GLOBAL(system_call_common)
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 BEGIN_FTR_SECTION
extrdi. r10, r12, 1, (63-MSR_TS_T_LG) /* transaction active? */
-   bne tabort_syscall
+   bne .Ltabort_syscall
 END_FTR_SECTION_IFSET(CPU_FTR_TM)
 #endif
andi.   r10,r12,MSR_PR
@@ -152,9 +151,9 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
CURRENT_THREAD_INFO(r11, r1)
ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
-   bne syscall_dotrace /* does not return */
+   bne .Lsyscall_dotrace   /* does not return */
cmpldi  0,r0,NR_syscalls
-   bge-syscall_enosys
+   bge-.Lsyscall_enosys
 
 system_call:   /* label this so stack traces look sane */
 /*
@@ -208,7 +207,7 @@ system_call:/* label this so stack 
traces look sane */
ld  r9,TI_FLAGS(r12)
li  r11,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
-   bne-syscall_exit_work
+   bne-.Lsyscall_exit_work
 
andi.   r0,r8,MSR_FP
beq 2f
@@ -232,7 +231,7 @@ system_call:/* label this so stack 
traces look sane */
 
 3: cmpld   r3,r11
ld  r5,_CCR(r1)
-   bge-syscall_error
+   bge-.Lsyscall_error
 .Lsyscall_error_cont:
ld  r7,_NIP(r1)
 BEGIN_FTR_SECTION
@@ -258,14 +257,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
RFI
b   .   /* prevent speculative execution */
 
-syscall_error: 
+.Lsyscall_error:
orisr5,r5,0x1000/* Set SO bit in CR */
neg r3,r3
std r5,_CCR(r1)
b   .Lsyscall_error_cont

 /* Traced system call support */
-syscall_dotrace:
+.Lsyscall_dotrace:
bl  save_nvgprs
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_syscall_trace_enter
@@ -298,11 +297,11 @@ syscall_dotrace:
b   .Lsyscall_exit
 
 
-syscall_enosys:
+.Lsyscall_enosys:
li  r3,-ENOSYS
b   .Lsyscall_exit

-syscall_exit_work:
+.Lsyscall_exit_work:
 #ifdef CONFIG_PPC_BOOK3S
li  r10,MSR_RI
mtmsrd  r10,1   /* Restore RI */
@@ -362,7 +361,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
b   ret_from_except
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-tabort_syscall:
+.Ltabort_syscall:
/* Firstly we need to enable TM in the kernel */
mfmsr   r10
li  r9, 1
@@ -388,6 +387,8 @@ tabort_syscall:
rfid
b   .   /* prevent speculative execution */
 #endif
+_ASM_NOKPROBE_SYMBOL(system_call_common);
+_ASM_NOKPROBE_SYMBOL(system_call);
 
 /* Save non-volatile GPRs, if not already saved. */
 _GLOBAL(save_nvgprs)
-- 
2.12.2

[PATCH v2 0/3] powerpc: build out kprobes blacklist

2017-04-27 Thread Naveen N. Rao

v2 changes:
- Patches 3 and 4 from the previous series have been merged.
- Updated to no longer blacklist functions involved with stolen time
  accounting.

v1:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg117514.html
--
This is the second in the series of patches to build out an appropriate
kprobes blacklist. This series blacklists system_call() and functions
involved when handling the trap itself. Not everything is covered, but
this is the first set of functions that I have tested with. More
patches to follow once I expand my tests.

I have converted many labels into private -- these are labels that I
felt are not necessary to read stack traces. If any of those are
important to have, please let me know.

- Naveen

Naveen N. Rao (3):
  powerpc/kprobes: cleanup system_call_common and blacklist it from
kprobes
  powerpc/kprobes: un-blacklist system_call() from kprobes
  powerpc/kprobes: blacklist functions invoked on a trap

 arch/powerpc/kernel/entry_64.S   | 94 +++-
 arch/powerpc/kernel/exceptions-64s.S |  1 +
 arch/powerpc/kernel/traps.c  |  3 ++
 3 files changed, 55 insertions(+), 43 deletions(-)

-- 
2.12.2

Re: [PATCH v4 2/3] powerpc/fadump: Use the correct VMCOREINFO_NOTE_SIZE for phdr

2017-04-27 Thread Mahesh Jagannath Salgaonkar

On 04/26/2017 12:41 PM, Dave Young wrote:
> Ccing ppc list
> On 04/20/17 at 07:39pm, Xunlei Pang wrote:
>> vmcoreinfo_max_size stands for the vmcoreinfo_data, the
>> correct one we should use is vmcoreinfo_note whose total
>> size is VMCOREINFO_NOTE_SIZE.
>>
>> Like explained in commit 77019967f06b ("kdump: fix exported
>> size of vmcoreinfo note"), it should not affect the actual
>> function, but we better fix it, also this change should be
>> safe and backward compatible.
>>
>> After this, we can get rid of variable vmcoreinfo_max_size,
>> let's use the corresponding macros directly, fewer variables
>> means more safety for vmcoreinfo operation.
>>
>> Cc: Mahesh Salgaonkar 
>> Cc: Hari Bathini 
>> Signed-off-by: Xunlei Pang 

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

>> ---
>> v3->v4:
>> -Rebased on the latest linux-next
>>
>>  arch/powerpc/kernel/fadump.c | 3 +--
>>  include/linux/crash_core.h   | 1 -
>>  kernel/crash_core.c  | 3 +--
>>  3 files changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>> index 466569e..7bd6cd0 100644
>> --- a/arch/powerpc/kernel/fadump.c
>> +++ b/arch/powerpc/kernel/fadump.c
>> @@ -893,8 +893,7 @@ static int fadump_create_elfcore_headers(char *bufp)
>>  
>>  phdr->p_paddr   = fadump_relocate(paddr_vmcoreinfo_note());
>>  phdr->p_offset  = phdr->p_paddr;
>> -phdr->p_memsz   = vmcoreinfo_max_size;
>> -phdr->p_filesz  = vmcoreinfo_max_size;
>> +phdr->p_memsz   = phdr->p_filesz = VMCOREINFO_NOTE_SIZE;
>>  
>>  /* Increment number of program headers. */
>>  (elf->e_phnum)++;
>> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
>> index ba283a2..7d6bc7b 100644
>> --- a/include/linux/crash_core.h
>> +++ b/include/linux/crash_core.h
>> @@ -55,7 +55,6 @@
>>  
>>  extern u32 *vmcoreinfo_note;
>>  extern size_t vmcoreinfo_size;
>> -extern size_t vmcoreinfo_max_size;
>>  
>>  Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>>void *data, size_t data_len);
>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>> index 0321f04..43cdb00 100644
>> --- a/kernel/crash_core.c
>> +++ b/kernel/crash_core.c
>> @@ -16,7 +16,6 @@
>>  /* vmcoreinfo stuff */
>>  static unsigned char *vmcoreinfo_data;
>>  size_t vmcoreinfo_size;
>> -size_t vmcoreinfo_max_size = VMCOREINFO_BYTES;
>>  u32 *vmcoreinfo_note;
>>  
>>  /*
>> @@ -343,7 +342,7 @@ void vmcoreinfo_append_str(const char *fmt, ...)
>>  r = vscnprintf(buf, sizeof(buf), fmt, args);
>>  va_end(args);
>>  
>> -r = min(r, vmcoreinfo_max_size - vmcoreinfo_size);
>> +r = min(r, VMCOREINFO_BYTES - vmcoreinfo_size);
>>  
>>  memcpy(&vmcoreinfo_data[vmcoreinfo_size], buf, r);
>>  
>> -- 
>> 1.8.3.1
>>
>>
>> ___
>> kexec mailing list
>> ke...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> 
> Reviewed-by: Dave Young 
> 
> Thanks
> Dave
>

63 matches

Mail list logo