Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-05 Thread Sachin Sant

>>> I've seen it on tip. It looks like hot unplug goes really slow when
>>> there's running tasks on the CPU being taken down.
>>> 
>>> What I did was something like:
>>> 
>>>  taskset -p $((1<<1)) $$
>>>  for ((i=0; i<20; i++)) do while :; do :; done & done
>>> 
>>>  taskset -p $((1<<0)) $$
>>>  echo 0 > /sys/devices/system/cpu/cpu1/online
>>> 
>>> And with those 20 tasks stuck sucking cycles on CPU1, the unplug goes
>>> _really_ slow and the RCU stall triggers. What I suspect happens is that
>>> hotplug stops participating in the RCU state machine early, but only
>>> tells RCU about it really late, and in between it gets suspicious it
>>> takes too long.
>>> 
>>> I've yet to dig through the RCU code to figure out the exact sequence of
>>> events, but found the above to be fairly reliable in triggering the
>>> issue.
> 
>> If you send me the full splat from the dmesg and the RCU portions of
>> .config, I will take a look.  Is this new behavior, or a new test?
> 

I have sent the required files to you via separate email.

> If new behavior, I would be most suspicious of these commits in -rcu which
> recently entered -tip:
> 
> 19e4d983cda1 rcu: Place guard on rcu_all_qs() and rcu_note_context_switch() 
> actions
> 913324b1364f rcu: Eliminate flavor scan in rcu_momentary_dyntick_idle()
> fcdcfefafa45 rcu: Pull rcu_qs_ctr into rcu_dynticks structure
> 0919a0b7e7a5 rcu: Pull rcu_sched_qs_mask into rcu_dynticks structure
> caa7c8e34293 rcu: Make rcu_note_context_switch() do deferred NOCB wakeups
> 41e4b159d516 rcu: Make rcu_all_qs() do deferred NOCB wakeups
> b457a3356a68 rcu: Make call_rcu() do deferred NOCB wakeups
> 
> Does reverting any of these help?

I tried reverting the above commits. That does not help. I can still recreate 
the issue.

Thanks
-Sachin


RE: [PATCH v3 2/2] cpufreq: qoriq: Don't look at clock implementation details

2017-02-05 Thread Y.T. Tang
> -Original Message-
> From: Leo Li [mailto:pku@gmail.com]
> Sent: Friday, February 03, 2017 2:12 AM
> To: Y.T. Tang 
> Cc: Scott Wood ; Michael Turquette
> ; Russell King ;
> Stephen Boyd ; Viresh Kumar
> ; Rafael J. Wysocki ; linux-
> c...@vger.kernel.org; linux...@vger.kernel.org; linuxppc-
> d...@lists.ozlabs.org; Leo Li ; X.F. Ren
> 
> Subject: Re: [PATCH v3 2/2] cpufreq: qoriq: Don't look at clock
> implementation details
> 
> On Tue, Jul 19, 2016 at 10:02 PM, Yuantian Tang 
> wrote:
> >
> > PING.
> >
> > Regards,
> > Yuantian
> >
> > > -Original Message-
> > > From: Scott Wood [mailto:o...@buserror.net]
> > > Sent: Saturday, July 09, 2016 5:07 AM
> > > To: Michael Turquette ; Russell King
> > > ; Stephen Boyd ;
> Viresh
> > > Kumar ; Rafael J. Wysocki
> > > 
> > > Cc: linux-...@vger.kernel.org; linux...@vger.kernel.org; linuxppc-
> > > d...@lists.ozlabs.org; Yuantian Tang ;
> > > Yang-Leo Li ; Xiaofeng Ren
> > > 
> > > Subject: Re: [PATCH v3 2/2] cpufreq: qoriq: Don't look at clock
> > > implementation details
> > >
> > > On Thu, 2016-07-07 at 19:26 -0700, Michael Turquette wrote:
> > > > Quoting Scott Wood (2016-07-06 21:13:23)
> > > > >
> > > > > On Wed, 2016-07-06 at 18:30 -0700, Michael Turquette wrote:
> > > > > >
> > > > > > Quoting Scott Wood (2016-06-15 23:21:25)
> > > > > > >
> > > > > > >
> > > > > > > -static struct device_node *cpu_to_clk_node(int cpu)
> > > > > > > +static struct clk *cpu_to_clk(int cpu)
> > > > > > >  {
> > > > > > > -   struct device_node *np, *clk_np;
> > > > > > > +   struct device_node *np;
> > > > > > > +   struct clk *clk;
> > > > > > >
> > > > > > > if (!cpu_present(cpu))
> > > > > > > return NULL; @@ -112,37 +80,28 @@ static
> > > > > > > struct device_node *cpu_to_clk_node(int
> > > > > > > cpu)
> > > > > > > if (!np)
> > > > > > > return NULL;
> > > > > > >
> > > > > > > -   clk_np = of_parse_phandle(np, "clocks", 0);
> > > > > > > -   if (!clk_np)
> > > > > > > -   return NULL;
> > > > > > > -
> > > > > > > +   clk = of_clk_get(np, 0);
> > > > > > Why not use devm_clk_get here?
> > > > > devm_clk_get() is a wrapper around clk_get() which is not the
> > > > > same as of_clk_get().  What device would you pass to
> > > > > devm_clk_get(), and what name would you pass?
> > > > I'm fuzzy on whether or not you get a struct device from a cpufreq
> > > > driver. If so, then that would be the one to use. I would hope
> > > > that cpufreq drivers model cpus as devices, but I'm really not
> > > > sure without looking into the code.
> > >
> > > It's not the cpufreq code that provides it, but get_cpu_device()
> > > could be used.
> > >
> > > Do you have any comments on the first patch of this set?
> 
> 
> Any action on this patch?  This patch is still a dependency for cpufreq to 
> work
> on all QorIQ platforms.
> 
This patch can be accepted on condition that the attached patch is accepted.
But unfortunately, the attached patch has been sent for a really long time and 
no feedback.

Regards,
Yuantian

> Regards,
> Leo
--- Begin Message ---
From: Scott Wood 

Commit fc4a05d4b0eb ("clk: Remove unused provider APIs") removed
__clk_get_num_parents() and clk_hw_get_parent_by_index(), leaving only
true provider API versions that operate on struct clk_hw.

qoriq-cpufreq needs these functions in order to determine the options
it has for calling clk_set_parent() and thus populate the cpufreq
table, so revive them as legitimate consumer APIs.

Signed-off-by: Scott Wood 
---
Previously sent as http://patchwork.ozlabs.org/patch/519803/

Russell, could you please either ACK this or comment, as CLK API
maintainer?

 drivers/clk/clk.c   | 19 +++
 include/linux/clk.h | 31 +++
 2 files changed, 50 insertions(+)

diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
index d584004..d61a3fe 100644
--- a/drivers/clk/clk.c
+++ b/drivers/clk/clk.c
@@ -290,6 +290,12 @@ struct clk_hw *__clk_get_hw(struct clk *clk)
 }
 EXPORT_SYMBOL_GPL(__clk_get_hw);

+unsigned int clk_get_num_parents(struct clk *clk)
+{
+   return !clk ? 0 : clk->core->num_parents;
+}
+EXPORT_SYMBOL_GPL(clk_get_num_parents);
+
 unsigned int clk_hw_get_num_parents(const struct clk_hw *hw)
 {
return hw->core->num_parents;
@@ -358,6 +364,19 @@ static struct clk_core 
*clk_core_get_parent_by_index(struct clk_core *core,
return core->parents[index];
 }

+struct clk *clk_get_parent_by_index(struct clk *clk, unsigned int index)
+{
+   struct clk_core *parent;
+
+   if (!clk)
+   return NULL;
+
+   parent = clk_core_get_parent_by_index(clk->core, index);
+
+   return !parent ? NULL : parent->hw->clk;
+}
+EXPORT_SYMBOL_GPL(clk_get_parent_by_index);
+
 struct clk_hw *
 clk_hw_get_parent_by_index(const struct clk_hw *hw, unsigned int index)
 {
diff --git a/include/linux/clk.h b/include/linux/clk.h
index 0df4a51..937de0e 100644
--- a/include/linux/cl

Re: Timekeeping oddities on MacMini G4s

2017-02-05 Thread Hal Murray
> Tell me what to look for, and I’ll take as many hi-res pictures as you want.

I'm looking for the frequency printed on the oscillator/crystal.

Here is a picture with several examples:
  https://en.wikipedia.org/wiki/File:Crystal_Packages.jpg

The row of Oscillators is most likely.  They also come in plastic packages.  
You will probably be able to see that they have 2 or 4 connections.  They 
will probably be quite a bit thicker than normal surface mount plastic 
packages.

Likely numbers are 166, 166.6 or an integer any sub multiple.  41.5 or 41.65 
is a good possibility.

-- 
These are my opinions.  I hate spam.





[PATCH] powerpc/opal-irqchip: Use interrupt names if present

2017-02-05 Thread Benjamin Herrenschmidt
Recent versions of OPAL can provide names for the various OPAL interrupts,
so let's use them. This also modernises the code that fetches the
interrupt array to use the helpers provided by the generic code instead
of hand-parsing the property.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/platforms/powernv/opal-irqchip.c | 45 ---
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c 
b/arch/powerpc/platforms/powernv/opal-irqchip.c
index 998316b..fe9b029 100644
--- a/arch/powerpc/platforms/powernv/opal-irqchip.c
+++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
@@ -183,8 +183,9 @@ void opal_event_shutdown(void)
 int __init opal_event_init(void)
 {
struct device_node *dn, *opal_node;
-   const __be32 *irqs;
-   int i, irqlen, rc = 0;
+   const char **names;
+   u32 *irqs;
+   int i, rc = 0;
 
opal_node = of_find_node_by_path("/ibm,opal");
if (!opal_node) {
@@ -209,37 +210,57 @@ int __init opal_event_init(void)
goto out;
}
 
-   /* Get interrupt property */
-   irqs = of_get_property(opal_node, "opal-interrupts", &irqlen);
-   opal_irq_count = irqs ? (irqlen / 4) : 0;
+   /* Get opal-interrupts property and names if present */
+   rc = of_property_count_u32_elems(opal_node, "opal-interrupts");
+   if (rc < 0)
+   goto out;
+   opal_irq_count = rc;
pr_debug("Found %d interrupts reserved for OPAL\n", opal_irq_count);
+   irqs = kzalloc(rc * sizeof(u32), GFP_KERNEL);
+   if (WARN_ON(!irqs))
+   goto out;
+   rc = of_property_read_u32_array(opal_node, "opal-interrupts",
+   irqs, opal_irq_count);
+   if (rc < 0) {
+   pr_err("Error %d reading opal-interrupts array\n", rc);
+   goto out;
+   }
+   names = kzalloc(opal_irq_count * sizeof(char *), GFP_KERNEL);
+   of_property_read_string_array(opal_node, "opal-interrupts-names",
+ names, opal_irq_count);
 
/* Install interrupt handlers */
opal_irqs = kcalloc(opal_irq_count, sizeof(*opal_irqs), GFP_KERNEL);
-   for (i = 0; irqs && i < opal_irq_count; i++, irqs++) {
-   unsigned int irq, virq;
+   for (i = 0; i < opal_irq_count; i++) {
+   unsigned int virq;
+   char *name;
 
/* Get hardware and virtual IRQ */
-   irq = be32_to_cpup(irqs);
-   virq = irq_create_mapping(NULL, irq);
+   virq = irq_create_mapping(NULL, irqs[i]);
if (!virq) {
-   pr_warn("Failed to map irq 0x%x\n", irq);
+   pr_warn("Failed to map irq 0x%x\n", irqs[i]);
continue;
}
+   if (names && names[i] && strlen(names[i]))
+   name = kasprintf(GFP_KERNEL, "opal-%s", names[i]);
+   else
+   name = kasprintf(GFP_KERNEL, "opal");
 
/* Install interrupt handler */
rc = request_irq(virq, opal_interrupt, IRQF_TRIGGER_LOW,
-"opal", NULL);
+name, NULL);
if (rc) {
irq_dispose_mapping(virq);
pr_warn("Error %d requesting irq %d (0x%x)\n",
-rc, virq, irq);
+rc, virq, irqs[i]);
continue;
}
 
/* Cache IRQ */
opal_irqs[i] = virq;
}
+   kfree(irqs);
+   kfree(names);
 
 out:
of_node_put(opal_node);



Re: [PATCH] powerpc/mm: Fix typo in set_pte_at()

2017-02-05 Thread Gavin Shan
On Mon, Feb 06, 2017 at 08:03:57AM +0530, Aneesh Kumar K.V wrote:
>Gavin Shan  writes:
>
>> This fixes the typo about the _PAGE_PTE in set_pte_at() by changing
>> "tryint" to "trying to".
>>
>> Fixes: 6a119eae942 ("powerpc/mm: Add a _PAGE_PTE bit")
>
>I guess this is not needed. We add that when we want to hint whether the
>patch needs backporting. 
>

Thanks for review. I used the tag to indicate the commit introducing
the typo. For this trivial patch, we won't backport it to table or
distro. If you want, I can drop the tag or Michael helps to drop it
when merging it.

Thanks,
Gavin

>
>> Signed-off-by: Gavin Shan 
>> ---
>>  arch/powerpc/mm/pgtable.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index cb39c8b..a03ff3d 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
>> @@ -193,9 +193,7 @@ void set_pte_at(struct mm_struct *mm, unsigned long 
>> addr, pte_t *ptep,
>>   */
>>  VM_WARN_ON(pte_present(*ptep) && !pte_protnone(*ptep));
>>
>> -/*
>> - * Add the pte bit when tryint set a pte
>> - */
>> +/* Add the pte bit when trying to set a pte */
>>  pte = __pte(pte_val(pte) | _PAGE_PTE);
>>
>>  /* Note: mm->context.id might not yet have been assigned as
>> -- 
>> 2.7.4



[PATCH 2/2] powerpc/powernv: Add more BPF in defconfig

2017-02-05 Thread Michael Neuling
This enables BCC (https://github.com/iovisor/bcc) on powernv.

This adds 225KB to the vmlinux size.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/configs/powernv_defconfig | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index 7028dbc753..911b43e2c7 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -29,6 +29,7 @@ CONFIG_CGROUP_CPUACCT=y
 CONFIG_CGROUP_PERF=y
 CONFIG_USER_NS=y
 CONFIG_BLK_DEV_INITRD=y
+CONFIG_BPF_SYSCALL=y
 # CONFIG_COMPAT_BRK is not set
 CONFIG_PROFILING=y
 CONFIG_OPROFILE=y
@@ -79,6 +80,10 @@ CONFIG_NETFILTER=y
 # CONFIG_NETFILTER_ADVANCED is not set
 CONFIG_BRIDGE=m
 CONFIG_VLAN_8021Q=m
+CONFIG_NET_SCHED=y
+CONFIG_NET_CLS_BPF=y
+CONFIG_NET_CLS_ACT=y
+CONFIG_NET_ACT_BPF=y
 CONFIG_BPF_JIT=y
 CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
 CONFIG_DEVTMPFS=y
-- 
2.9.3



[PATCH 1/2] powerpc/powernv: Add XHCI and USB storage to defconfig

2017-02-05 Thread Michael Neuling
These are common on bare metal machines, so put them in the defconfig.

This adds 216KB to the vmlinux size

Signed-off-by: Michael Neuling 
---
 arch/powerpc/configs/powernv_defconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index b793550fac..7028dbc753 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -214,10 +214,11 @@ CONFIG_HID_SUNPLUS=y
 CONFIG_USB_HIDDEV=y
 CONFIG_USB=y
 CONFIG_USB_MON=m
+CONFIG_USB_XHCI_HCD=y
 CONFIG_USB_EHCI_HCD=y
 # CONFIG_USB_EHCI_HCD_PPC_OF is not set
 CONFIG_USB_OHCI_HCD=y
-CONFIG_USB_STORAGE=m
+CONFIG_USB_STORAGE=y
 CONFIG_NEW_LEDS=y
 CONFIG_LEDS_CLASS=m
 CONFIG_LEDS_POWERNV=m
-- 
2.9.3



Re: [PATCH] powerpc/mm: fix a hardcode on memory boundary checking

2017-02-05 Thread Rui Teng

On 31/01/2017 5:11 PM, Michael Ellerman wrote:

Rui Teng  writes:


The offset of hugepage block will not be 16G, if the expected
page is more than one. Calculate the totol size instead of the
hardcode value.


I assume you found this by code inspection and not by triggering an
actual bug?


Yes, I found this problem only by code inspection. We were finding the
ways to enable 16G huge page besides changing the device tree. For 
example, provide a new interface to set these size and pages parameters.

So that I think it may cause problem here.



cheers


diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 8033493..b829f8e 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -506,7 +506,7 @@ static int __init htab_dt_scan_hugepage_blocks(unsigned 
long node,
printk(KERN_INFO "Huge page(16GB) memory: "
"addr = 0x%lX size = 0x%lX pages = %d\n",
phys_addr, block_size, expected_pages);
-   if (phys_addr + (16 * GB) <= memblock_end_of_DRAM()) {
+   if (phys_addr + block_size * expected_pages <= memblock_end_of_DRAM()) {
memblock_reserve(phys_addr, block_size * expected_pages);
add_gpage(phys_addr, block_size, expected_pages);
}
--
2.9.0






Re: [PATCH] powerpc/mm: Fix typo in set_pte_at()

2017-02-05 Thread Aneesh Kumar K.V
Gavin Shan  writes:

> This fixes the typo about the _PAGE_PTE in set_pte_at() by changing
> "tryint" to "trying to".
>
> Fixes: 6a119eae942 ("powerpc/mm: Add a _PAGE_PTE bit")

I guess this is not needed. We add that when we want to hint whether the
patch needs backporting. 


> Signed-off-by: Gavin Shan 
> ---
>  arch/powerpc/mm/pgtable.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index cb39c8b..a03ff3d 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -193,9 +193,7 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, 
> pte_t *ptep,
>*/
>   VM_WARN_ON(pte_present(*ptep) && !pte_protnone(*ptep));
>
> - /*
> -  * Add the pte bit when tryint set a pte
> -  */
> + /* Add the pte bit when trying to set a pte */
>   pte = __pte(pte_val(pte) | _PAGE_PTE);
>
>   /* Note: mm->context.id might not yet have been assigned as
> -- 
> 2.7.4



Re: Timekeeping oddities on MacMini G4s

2017-02-05 Thread Segher Boessenkool
On Mon, Feb 06, 2017 at 10:22:01AM +1100, Benjamin Herrenschmidt wrote:
> >   On the plus side, that means that the values are guaranteed not
> > to be core-specific.  On the minus side, it means that its count rate is
> > lower, and it's sufficiently "distant" that accessing it is somewhat more
> > expensive.
> 
> Right so there are various configuration options and ways to feed the timebase
> to PowerPC chips depending on the generation and manufacturer. On the old
> 32-bit chips, typically it was either a divisor of the bus frequency or
> externally clocked. Apple typically used the latter.

On all 6xx and most 7xx/7xxx it is 1:4 of the bus clock.  And on the
newer machines the clock chip uses clock spreading.  So you then cannot
calibrate with a dumb fast routine (the time base ticks pretty slow
anyhow, you cannot calibrate any fast if you want decent results; but
with clock spreading you either have to measure for many seconds, or you
need to find the period of the spreading and work with that).

> > The PowerPC architecture permits the timebase frequency to be variable,
> > but I'm not aware of any implementations that take advantage of that.
> 
> I think it's pretty much accepted that this would be a very bad idea
> and no implementation did it.

See above.

> >   The
> > Motorola 32-bit implementations in general run it on the "bus clock",
> > which is independent of processor-clock multipliers, and is also common
> > across processor chips in systems with more than one.
> 
> There's also a TBEN external pin iirc which can be used to feed it.

Some implementations have an MSR bit to stop the TB as well (7450 for
example).


Segher


[PATCH] powerpc/mm/radix: Update ERAT flushes when invalidating TLB

2017-02-05 Thread Benjamin Herrenschmidt
Three tiny changes to the ERAT flushing logic: First don't make
it depend on DD1. It hasn't been decided yet but we might run
DD2 in a mode that also requires explicit flushes for performance
reasons so make it unconditional. We also add a missing isync, and
finally remove the flush from _tlbiel_va as it is only necessary
for congruence-class invalidations (PID, LPID and full TLB), not
targetted invalidations.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/mm/tlb-radix.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 61b7911..e0c8162 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -50,9 +50,7 @@ static inline void _tlbiel_pid(unsigned long pid, unsigned 
long ric)
for (set = 0; set < POWER9_TLB_SETS_RADIX ; set++) {
__tlbiel_pid(pid, set, ric);
}
-   if (cpu_has_feature(CPU_FTR_POWER9_DD1))
-   asm volatile(PPC_INVALIDATE_ERAT : : :"memory");
-   return;
+   asm volatile(PPC_INVALIDATE_ERAT ";isync" : : :"memory");
 }
 
 static inline void _tlbie_pid(unsigned long pid, unsigned long ric)
@@ -85,8 +83,6 @@ static inline void _tlbiel_va(unsigned long va, unsigned long 
pid,
asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
 : : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : 
"memory");
asm volatile("ptesync": : :"memory");
-   if (cpu_has_feature(CPU_FTR_POWER9_DD1))
-   asm volatile(PPC_INVALIDATE_ERAT : : :"memory");
 }
 
 static inline void _tlbie_va(unsigned long va, unsigned long pid,




[PATCH] powerpc: Fix holes in DD1 TLB workarounds

2017-02-05 Thread Benjamin Herrenschmidt
This patch gets rid of the the TLB multihits observed in the lab.

Sadly it does disable whatever remaining optimizations we had for
TLB invalidations. It fixes 2 problems:

 - We do sadly need to invalidate in radix__pte_update() even when
the new pte is clear because what might happen otherwise is that we
clear a bunch of PTEs, we drop the PTL, then before we get to do
the flush_tlb_mm(), another thread puts/faults some new things in.
It's rather unlikely and probably requires funky mappings blown by
unmap_mapping_range() (otherwise we probably are protected by the
mmap sem) but possible.

 - In some rare cases we call set_pte_at() on top of a protnone PTE
which is valid, and thus we need to apply the workaround.

Now, I'm working on ways to restore batching by instead coping with
the multi-hits after the fact, but this hasn't yet been proven solid
so this will have to do in the meantime.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/include/asm/book3s/64/radix.h | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index b4d1302..b17d4a1 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -149,7 +149,7 @@ static inline unsigned long radix__pte_update(struct 
mm_struct *mm,
 * the below sequence and batch the tlb flush. The
 * tlb flush batching is done by mmu gather code
 */
-   if (new_pte) {
+   if (1 || new_pte) {
asm volatile("ptesync" : : : "memory");
radix__flush_tlb_pte_p9_dd1(old_pte, mm, addr);
__radix_pte_update(ptep, 0, new_pte);
@@ -179,7 +179,7 @@ static inline void radix__ptep_set_access_flags(struct 
mm_struct *mm,
 
unsigned long old_pte, new_pte;
 
-   old_pte = __radix_pte_update(ptep, ~0, 0);
+   old_pte = __radix_pte_update(ptep, ~0ul, 0);
asm volatile("ptesync" : : : "memory");
/*
 * new value of pte
@@ -202,9 +202,18 @@ static inline int radix__pte_none(pte_t pte)
return (pte_val(pte) & ~RADIX_PTE_NONE_MASK) == 0;
 }
 
+static inline int __pte_present(pte_t pte)
+{
+   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT));
+}
 static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte, int percpu)
 {
+   if (__pte_present(*ptep)) {
+   unsigned long old_pte = __radix_pte_update(ptep, ~0ul, 0);
+asm volatile("ptesync" : : : "memory");
+   radix__flush_tlb_pte_p9_dd1(old_pte, mm, addr);
+   }
*ptep = pte;
asm volatile("ptesync" : : : "memory");
 }




Re: [v2] cxl: prevent read/write to AFU config space while AFU not configured

2017-02-05 Thread Andrew Donnellan

On 27/01/17 11:57, Andrew Donnellan wrote:

On 27/01/17 11:40, Michael Ellerman wrote:

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/14a3ae34bfd0bcb1cc12d55b06a858


Will fix the remaining locking issue in a follow up patch...


Stable team - please make sure this doesn't go in without 
http://patchwork.ozlabs.org/patch/724315/ once that's merged.



Thanks,
--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



[PATCH] cxl: fix nested locking hang during EEH hotplug

2017-02-05 Thread Andrew Donnellan
Commit 14a3ae34bfd0 ("cxl: Prevent read/write to AFU config space while AFU
not configured") introduced a rwsem to fix an invalid memory access that
occurred when someone attempts to access the config space of an AFU on a
vPHB whilst the AFU is deconfigured, such as during EEH recovery.

It turns out that it's possible to run into a nested locking issue when EEH
recovery fails and a full device hotplug is required.
cxl_pci_error_detected() deconfigures the AFU, taking a writer lock on
configured_rwsem. When EEH recovery fails, the EEH code calls
pci_hp_remove_devices() to remove the device, which in turn calls
cxl_remove() -> cxl_pci_remove_afu() -> pci_deconfigure_afu(), which tries
to grab the writer lock that's already held.

Standard rwsem semantics don't express what we really want to do here and
don't allow for nested locking. Fix this by replacing the rwsem with an
atomic_t which we can control more finely. Allow the AFU to be locked
multiple times so long as there are no readers.

Fixes: 14a3ae34bfd0 ("cxl: Prevent read/write to AFU config space while AFU not 
configured")
Cc: sta...@vger.kernel.org # v4.9+
Signed-off-by: Andrew Donnellan 

---

I've asked Uma and Pradipta to give this a test.

---
 drivers/misc/cxl/cxl.h  |  5 +++--
 drivers/misc/cxl/main.c |  3 +--
 drivers/misc/cxl/pci.c  | 11 +--
 drivers/misc/cxl/vphb.c | 18 ++
 4 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index b4a43fd14b99..08e7d3a54425 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -418,8 +418,9 @@ struct cxl_afu {
struct dentry *debugfs;
struct mutex contexts_lock;
spinlock_t afu_cntl_lock;
-   /* Used to block access to AFU config space while deconfigured */
-   struct rw_semaphore configured_rwsem;
+
+   /* -1: AFU deconfigured/locked, >= 0: number of readers */
+   atomic_t configured_state;
 
/* AFU error buffer fields and bin attribute for sysfs */
u64 eb_len, eb_offset;
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index 2a6bf1d0a3a4..cc1706a92ace 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -268,8 +268,7 @@ struct cxl_afu *cxl_alloc_afu(struct cxl *adapter, int 
slice)
idr_init(&afu->contexts_idr);
mutex_init(&afu->contexts_lock);
spin_lock_init(&afu->afu_cntl_lock);
-   init_rwsem(&afu->configured_rwsem);
-   down_write(&afu->configured_rwsem);
+   atomic_set(&afu->configured_state, -1);
afu->prefault_mode = CXL_PREFAULT_NONE;
afu->irqs_max = afu->adapter->user_irqs;
 
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index f512e13ec0f2..bdfa5ff11aea 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1129,7 +1129,7 @@ static int pci_configure_afu(struct cxl_afu *afu, struct 
cxl *adapter, struct pc
if ((rc = cxl_native_register_psl_irq(afu)))
goto err2;
 
-   up_write(&afu->configured_rwsem);
+   atomic_set(&afu->configured_state, 0);
return 0;
 
 err2:
@@ -1142,7 +1142,14 @@ static int pci_configure_afu(struct cxl_afu *afu, struct 
cxl *adapter, struct pc
 
 static void pci_deconfigure_afu(struct cxl_afu *afu)
 {
-   down_write(&afu->configured_rwsem);
+   /*
+* It's okay to deconfigure when AFU is already locked, otherwise wait
+* until there are no readers
+*/
+   if (atomic_read(&afu->configured_state) != -1) {
+   while (atomic_cmpxchg(&afu->configured_state, 0, -1) != -1)
+   schedule();
+   }
cxl_native_release_psl_irq(afu);
if (afu->adapter->native->sl_ops->release_serr_irq)
afu->adapter->native->sl_ops->release_serr_irq(afu);
diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c
index 639a343b7836..512a4897dbf6 100644
--- a/drivers/misc/cxl/vphb.c
+++ b/drivers/misc/cxl/vphb.c
@@ -83,6 +83,16 @@ static inline struct cxl_afu *pci_bus_to_afu(struct pci_bus 
*bus)
return phb ? phb->private_data : NULL;
 }
 
+static void cxl_afu_configured_put(struct cxl_afu *afu)
+{
+   atomic_dec_if_positive(&afu->configured_state);
+}
+
+static bool cxl_afu_configured_get(struct cxl_afu *afu)
+{
+   return atomic_inc_unless_negative(&afu->configured_state);
+}
+
 static inline int cxl_pcie_config_info(struct pci_bus *bus, unsigned int devfn,
   struct cxl_afu *afu, int *_record)
 {
@@ -107,7 +117,7 @@ static int cxl_pcie_read_config(struct pci_bus *bus, 
unsigned int devfn,
 
afu = pci_bus_to_afu(bus);
/* Grab a reader lock on afu. */
-   if (afu == NULL || !down_read_trylock(&afu->configured_rwsem))
+   if (afu == NULL || !cxl_afu_configured_get(afu))
return PCIBIOS_DEVICE_NOT_FOUND;
 
rc = cxl_pcie_config_info(bus, devfn, afu, &record);
@@ -132,7 +142,7 @@ static int cxl_pcie_read_config

[PATCH] powerpc/mm: Fix typo in set_pte_at()

2017-02-05 Thread Gavin Shan
This fixes the typo about the _PAGE_PTE in set_pte_at() by changing
"tryint" to "trying to".

Fixes: 6a119eae942 ("powerpc/mm: Add a _PAGE_PTE bit")
Signed-off-by: Gavin Shan 
---
 arch/powerpc/mm/pgtable.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb39c8b..a03ff3d 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -193,9 +193,7 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep,
 */
VM_WARN_ON(pte_present(*ptep) && !pte_protnone(*ptep));
 
-   /*
-* Add the pte bit when tryint set a pte
-*/
+   /* Add the pte bit when trying to set a pte */
pte = __pte(pte_val(pte) | _PAGE_PTE);
 
/* Note: mm->context.id might not yet have been assigned as
-- 
2.7.4



Re: Timekeeping oddities on MacMini G4s

2017-02-05 Thread Benjamin Herrenschmidt
On Sat, 2017-02-04 at 16:19 -0800, Fred Wright wrote:
> On Tue, 31 Jan 2017, Hal Murray wrote:
> 
> > > b...@kernel.crashing.org said:
> > > Right, we just use the value provided by Open Firmware. Any chance you can
> 
> That seems inconsistent with the following comment in
> arch/powerpc/kernel/time.c:
> 
>  * TODO (not necessarily in this file):
>  * - improve precision and reproducibility of timebase frequency
>  * measurement at boot time.

That comment is probably ancient ;-) Different platforms use different
methods of calculating or obtaining the TB freq within arch/powerpc.

The most common however for anything recent is to just pick the value
from the device-tree. However, I noticed that MacOS X does "calibrate"
it using the timers provided by the KeyLargo chip.

> Unless it's an outdated comment that nobody bothered to remove.
> 
> > > From the value in the properties you showed me (and the ones I have in 
> > > some
> > > DT snapshots) it looks like the value isn't fixed but somewhat calibrated 
> > > by
> > > Open Firmware during boot.
> 
> Or by the OS, if the comment is to be believed.  It would be interesting
> to check OF values guaranteed to come directly from OF.

We don't change the DT values. Looking at some old dumps of Apple OF 
implementation
I have lying around it appears that the timebase either come from some specific
configuration area of the flash or some very early boot asm calibration.

> Runtime calibration often has issues of its own.  For example, on x86, the
> kernel likes to calibrate the TSC against the RTC at boot time.  But if an
> SMI intervenes during the calibration loop (which is not prevented by
> disabling interrupts), it throws the calibration so badly out of whack
> that the system can't keep time properly until it's rebooted.  At Google,
> we had to disable ECC-related SMIs on at least one server model for that
> reason.

Right. We don't have SMIs on Power and we can probably make sure we disable
(or catch & retry) things like Machine Checks. So we can make it slightly
more accurate.

> When you think about it, the manufacturer knows perfectly well the
> nominal frequency of the crystal being stuffed, and is also programming
> onboard nonvolatile memory (typically EEPROM) with various parameters, so
> directly reporting the nominal frequency should be much more reliable than
> trying to measure it in a short test at boot time.  And detecting that
> it's reported incorrectly should be the job of a diagnostic, not an OS.

Right. On recent POWER servers it's architected. The core always sees 512Mhz,
though I don't know how precise that is (see below).

> One would, of course, like to base timekeeping on the *actual* frequency
> rather than the nominal frequency, but measuring that accurately enough to
> be useful takes longer than one would like to spend in early startup,
> especially if the only accurate time source is Internet-based NTP.  The
> RTC is *not* good enough for this purpose, since *its* crystal has its own
> errors.
> 
> > I rebooted several times.  It always got the exact same clock speed numbers.
> 
> Most likely not runtime calibration, then.

Yup

> > I don't know anything about the insides of the PowerPC chip.  Can you 
> > confirm
> > that the kernel time keeping works off an always ticking register similar to
> > the Intel TSC and uses the timebase-frequency as the scale factor?
> 
> That's certainly the way it's normally done on PowerPC, and a cursory
> examination of the sources looks consistent with that.  The PowerPC
> timebase is a 64-bit free-running counter.  Unlike the TSC, it's not
> per-core.

Actually it is, see below :-)

>   On the plus side, that means that the values are guaranteed not
> to be core-specific.  On the minus side, it means that its count rate is
> lower, and it's sufficiently "distant" that accessing it is somewhat more
> expensive.

Right so there are various configuration options and ways to feed the timebase
to PowerPC chips depending on the generation and manufacturer. On the old
32-bit chips, typically it was either a divisor of the bus frequency or
externally clocked. Apple typically used the latter.

However there was always an architectural requirement that it was perfectly
synchronized between cores.

On IBM POWER chips since P6 at least, there's a unit in the chip called the
ChipTOD that provides a reference clock to all the cores at a 16th of the
timebase frequency iirc.

There's a special protocol to slave the TODs of secondary chips to the primary
along with an automatic fallback to a backup network in case of failure.

The cores feed the top bits of the TB from that. The bottom bits are locally
generated by each core in such a way that guarantees that the TB can never
be observed going backward.

> The PowerPC architecture permits the timebase frequency to be variable,
> but I'm not aware of any implementations that take advantage of that.

I think it's pretty much accepted that this would be a very 

Re: Timekeeping oddities on MacMini G4s

2017-02-05 Thread Frank Nicholas
I’ve had mine apart many times (for memory upgrade, sensors for use in a 
vehicle, etc. - http://mt.nfshost.com )

Tell me what to look for, and I’ll take as many hi-res pictures as you want.

Thanks,
Frank

> On Feb 4, 2017, at 10:32 PM, Hal Murray  wrote:
> 
> Mumble.  If it were easier to take apart, I'd look inside to see if I could 
> find the crystal and see what was printed on it.
> 




[PATCH v6 2/2] KVM: PPC: Exit guest upon MCE when FWNMI capability is enabled

2017-02-05 Thread Mahesh J Salgaonkar
From: Aravinda Prasad 

Enhance KVM to cause a guest exit with KVM_EXIT_NMI
exit reason upon a machine check exception (MCE) in
the guest address space if the KVM_CAP_PPC_FWNMI
capability is enabled (instead of delivering a 0x200
interrupt to guest). This enables QEMU to build error
log and deliver machine check exception to guest via
guest registered machine check handler.

This approach simplifies the delivery of machine
check exception to guest OS compared to the earlier
approach of KVM directly invoking 0x200 guest interrupt
vector.

This design/approach is based on the feedback for the
QEMU patches to handle machine check exception. Details
of earlier approach of handling machine check exception
in QEMU and related discussions can be found at:

https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html

Note:

This patch introduces a hook which is invoked at the time
of guest exit to facilitate the host-side handling of
machine check exception before the exception is passed
on to the guest. Hence, the host-side handling which was
performed earlier via machine_check_fwnmi is removed.

The reasons for this approach is (i) it is not possible
to distinguish whether the exception occurred in the
guest or the host from the pt_regs passed on the
machine_check_exception(). Hence machine_check_exception()
calls panic, instead of passing on the exception to
the guest, if the machine check exception is not
recoverable. (ii) the approach introduced in this
patch gives opportunity to the host kernel to perform
actions in virtual mode before passing on the exception
to the guest. This approach does not require complex
tweaks to machine_check_fwnmi and friends.

Signed-off-by: Aravinda Prasad 
Reviewed-by: David Gibson 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/kvm_host.h |2 +
 arch/powerpc/include/asm/machdep.h  |7 
 arch/powerpc/include/asm/opal.h |4 ++
 arch/powerpc/include/uapi/asm/kvm.h |6 
 arch/powerpc/kvm/book3s_hv.c|   24 ++
 arch/powerpc/kvm/book3s_hv_ras.c|   18 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   52 ++-
 arch/powerpc/platforms/powernv/opal.c   |   26 
 arch/powerpc/platforms/powernv/setup.c  |3 ++
 9 files changed, 112 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 5f031a7..f0ea9af 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
@@ -660,6 +661,7 @@ struct kvm_vcpu_arch {
int thread_cpu;
bool timer_running;
wait_queue_head_t cpu_run;
+   struct machine_check_event mce_evt; /* Valid if trap == 0x200 */
 
struct kvm_vcpu_arch_shared *shared;
 #if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_PR_POSSIBLE)
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 5011b69..9d74e7a 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -15,6 +15,7 @@
 #include 
 
 #include 
+#include 
 
 /* We export this macro for external modules like Alsa to know if
  * ppc_md.feature_call is implemented or not
@@ -112,6 +113,12 @@ struct machdep_calls {
/* Called during machine check exception to retrive fixup address. */
bool(*mce_check_early_recovery)(struct pt_regs *regs);
 
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   /* Called after KVM interrupt handler finishes handling MCE for guest */
+   int (*machine_check_exception_guest)
+   (struct machine_check_event *evt);
+#endif
+
/* Motherboard/chipset features. This is a kind of general purpose
 * hook used to control some machine specific features (like reset
 * lines, chip power control, etc...).
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 5c7db0f..0c2f62f 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -17,6 +17,7 @@
 #ifndef __ASSEMBLY__
 
 #include 
+#include 
 
 /* We calculate number of sg entries based on PAGE_SIZE */
 #define SG_ENTRIES_PER_NODE ((PAGE_SIZE - 16) / sizeof(struct opal_sg_entry))
@@ -279,6 +280,9 @@ extern int opal_hmi_handler_init(void);
 extern int opal_event_init(void);
 
 extern int opal_machine_check(struct pt_regs *regs);
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+extern int opal_machine_check_guest(struct machine_check_event *evt);
+#endif
 extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
 extern int opal_hmi_exception_early(struct pt_regs *regs);
 extern int opal_handle_hmi_exception(struct pt_regs *regs);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h

[PATCH v6 1/2] KVM: PPC: Add new capability to control MCE behaviour

2017-02-05 Thread Mahesh J Salgaonkar
From: Aravinda Prasad 

This patch introduces a new KVM capability to control
how KVM behaves on machine check exception (MCE).
Without this capability, KVM redirects machine check
exceptions to guest's 0x200 vector, if the address in
error belongs to the guest. With this capability KVM
causes a guest exit with NMI exit reason.

The new capability is required to avoid problems if
a new kernel/KVM is used with an old QEMU for guests
that don't issue "ibm,nmi-register". As old QEMU does
not understand the NMI exit type, it treats it as a
fatal error. However, the guest could have handled
the machine check error if the exception was delivered
to guest's 0x200 interrupt vector instead of NMI exit
in case of old QEMU.

Signed-off-by: Aravinda Prasad 
Reviewed-by: David Gibson 
Signed-off-by: Mahesh Salgaonkar 
---
 Documentation/virtual/kvm/api.txt   |   11 +++
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kernel/asm-offsets.c   |1 +
 arch/powerpc/kvm/powerpc.c  |7 +++
 include/uapi/linux/kvm.h|1 +
 5 files changed, 21 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 03145b7..e2960e0 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3914,6 +3914,17 @@ to take care of that.
 This capability can be enabled dynamically even if VCPUs were already
 created and are running.
 
+7.9 KVM_CAP_PPC_FWNMI
+
+Architectures: ppc
+Parameters: none
+
+With this capability a machine check exception in the guest address
+space will cause KVM to exit the guest with NMI exit reason. This
+enables QEMU to build error log and branch to guest kernel registered
+machine check handling routine. Without this capability KVM will
+branch to guests' 0x200 interrupt vector.
+
 8. Other capabilities.
 --
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e59b172..5f031a7 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -266,6 +266,7 @@ struct kvm_arch {
int hpt_cma_alloc;
struct dentry *debugfs_dir;
struct dentry *htab_dentry;
+   u8 fwnmi_enabled;
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
struct mutex hpt_mutex;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 0601e6a..df29caf 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -498,6 +498,7 @@ int main(void)
DEFINE(KVM_NEED_FLUSH, offsetof(struct kvm, arch.need_tlb_flush.bits));
DEFINE(KVM_ENABLED_HCALLS, offsetof(struct kvm, arch.enabled_hcalls));
DEFINE(KVM_VRMA_SLB_V, offsetof(struct kvm, arch.vrma_slb_v));
+   DEFINE(KVM_FWNMI, offsetof(struct kvm, arch.fwnmi_enabled));
DEFINE(VCPU_DSISR, offsetof(struct kvm_vcpu, arch.shregs.dsisr));
DEFINE(VCPU_DAR, offsetof(struct kvm_vcpu, arch.shregs.dar));
DEFINE(VCPU_VPA, offsetof(struct kvm_vcpu, arch.vpa.pinned_addr));
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index cd892de..14377bb 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -610,6 +610,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = cpu_has_feature(CPU_FTR_TM_COMP) &&
is_kvmppc_hv_enabled(kvm);
break;
+   case KVM_CAP_PPC_FWNMI:
+   r = 1;
+   break;
default:
r = 0;
break;
@@ -1210,6 +1213,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu 
*vcpu,
break;
}
 #endif /* CONFIG_KVM_XICS */
+   case KVM_CAP_PPC_FWNMI:
+   r = 0;
+   vcpu->kvm->arch.fwnmi_enabled = true;
+   break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index cac48ed..4783d11 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -871,6 +871,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_USER_INSTR0 130
 #define KVM_CAP_MSI_DEVID 131
 #define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_PPC_FWNMI 133
 
 #ifdef KVM_CAP_IRQ_ROUTING
 



[PATCH v6 0/2] KVM: PPC: Add FWNMI support for KVM guests on POWER

2017-02-05 Thread Mahesh J Salgaonkar
From: Aravinda Prasad 

This series of patches add FWNMI support for KVM guests
on POWER.

Memory errors such as bit flips that cannot be corrected
by hardware is passed on to the kernel for handling
by raising machine check exception (an NMI). Upon such
machine check exceptions, if the address in error belongs
to the guest, the error is passed on to the guest
kernel for handling. However, for guest kernels that
have issued "ibm,nmi-register" call, QEMU should build
an error log and pass on the error log to the guest-
kernel registered machine check handler routine.

This patch series adds the functionality to pass on the
machine check exception to the guest kernel by
giving control to QEMU. QEMU builds the error log
and invokes the guest-kernel registered handler.

QEMU part can be found at:
http://lists.nongnu.org/archive/html/qemu-ppc/2015-12/msg00199.html

Change Log v6:
  - Deliver all MCE errors (handled/unhandled) for FWNMI capable guest.
  - Use kvm_run->flags to pass NMI disposition status.

Change Log v5:
  - Added capability documentation. No functionality/code change.

Change Log v4:
  - Allow host-side handling of the machine check exception before
passing on the exception to the guest.

Change Log v3:
  - Split the patch into 2. First patch introduces the
new capability while the second one enhances KVM to
redirect MCE.
  - Fix access width bug

Change Log v2:
  - Added KVM capability

---

Aravinda Prasad (2):
  KVM: PPC: Add new capability to control MCE behaviour
  KVM: PPC: Exit guest upon MCE when FWNMI capability is enabled


 Documentation/virtual/kvm/api.txt   |   11 +++
 arch/powerpc/include/asm/kvm_host.h |3 ++
 arch/powerpc/include/asm/machdep.h  |7 
 arch/powerpc/include/asm/opal.h |4 ++
 arch/powerpc/include/uapi/asm/kvm.h |6 
 arch/powerpc/kernel/asm-offsets.c   |1 +
 arch/powerpc/kvm/book3s_hv.c|   24 ++
 arch/powerpc/kvm/book3s_hv_ras.c|   18 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   52 ++-
 arch/powerpc/kvm/powerpc.c  |7 
 arch/powerpc/platforms/powernv/opal.c   |   26 
 arch/powerpc/platforms/powernv/setup.c  |3 ++
 include/uapi/linux/kvm.h|1 +
 13 files changed, 133 insertions(+), 30 deletions(-)

--