Re: [PATCH 2/2] powerpc/time: Only cap decrementer when watchdog is enabled

2018-09-28 Thread Nicholas Piggin
On Sat, 29 Sep 2018 11:26:07 +1000
Anton Blanchard  wrote:

> If CONFIG_PPC_WATCHDOG is enabled, we always cap the decrementer to
> 0x7fff. As suggested by Nick, add a run time check of the watchdog
> cpumask, so if it is disabled we use the large decrementer.
> 
> Signed-off-by: Anton Blanchard 
> ---

Thanks for tracking this down. It's a fix for my breakage

a7cba02deced ("powerpc: allow soft-NMI watchdog to cover timer
interrupts with large decrementers")

Taking another look... what I had expected here is the timer subsystem
would have stopped the decrementer device after it processed the timer
and found nothing left. And we should have set DEC to max at that time.

The above patch was really intended to only cover the timer interrupt
itself locking up. I wonder if we need to add

.set_state_oneshot_stopped = decrementer_shutdown

In our decremementer clockevent device?

Thanks,
Nick


Re: [PATCH 2/2] powerpc/time: Only cap decrementer when watchdog is enabled

2018-09-28 Thread kbuild test robot
Hi Anton,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.19-rc5 next-20180928]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Anton-Blanchard/powerpc-time-Use-clockevents_register_device-fixing-an-issue-with-large-decrementer/20180929-093322
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allnoconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/time.c: In function 'timer_interrupt':
>> arch/powerpc/kernel/time.c:580:44: error: 'watchdog_cpumask' undeclared 
>> (first use in this function); did you mean 'proc_watchdog_cpumask'?
 cpumask_test_cpu(smp_processor_id(), _cpumask))
   ^~~~
   proc_watchdog_cpumask
   arch/powerpc/kernel/time.c:580:44: note: each undeclared identifier is 
reported only once for each function it appears in

vim +580 arch/powerpc/kernel/time.c

   549  
   550  /*
   551   * timer_interrupt - gets called when the decrementer overflows,
   552   * with interrupts disabled.
   553   */
   554  void timer_interrupt(struct pt_regs *regs)
   555  {
   556  struct clock_event_device *evt = this_cpu_ptr();
   557  u64 *next_tb = this_cpu_ptr(_next_tb);
   558  struct pt_regs *old_regs;
   559  u64 now;
   560  
   561  /* Some implementations of hotplug will get timer interrupts 
while
   562   * offline, just ignore these and we also need to set
   563   * decrementers_next_tb as MAX to make sure __check_irq_replay
   564   * don't replay timer interrupt when return, otherwise we'll 
trap
   565   * here infinitely :(
   566   */
   567  if (unlikely(!cpu_online(smp_processor_id( {
   568  *next_tb = ~(u64)0;
   569  set_dec(decrementer_max);
   570  return;
   571  }
   572  
   573  /* Ensure a positive value is written to the decrementer, or 
else
   574   * some CPUs will continue to take decrementer exceptions. When 
the
   575   * PPC_WATCHDOG (decrementer based) is configured, keep this at 
most
   576   * 31 bits, which is about 4 seconds on most systems, which 
gives
   577   * the watchdog a chance of catching timer interrupt hard 
lockups.
   578   */
   579  if (IS_ENABLED(CONFIG_PPC_WATCHDOG) &&
 > 580  cpumask_test_cpu(smp_processor_id(), _cpumask))
   581  set_dec(0x7fff);
   582  else
   583  set_dec(decrementer_max);
   584  
   585  /* Conditionally hard-enable interrupts now that the DEC has 
been
   586   * bumped to its maximum value
   587   */
   588  may_hard_irq_enable();
   589  
   590  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: New mail archive

2018-09-28 Thread Stephen Rothwell
Hi all,

On Sat, 29 Sep 2018 13:56:28 +1000 Stephen Rothwell  
wrote:
>
> This mailing list is now also archived at
> https://lore.kernel.org/linuxppc-dev/

A few of the earliest emails have been misdated in the archive (to today
or yesterday), sorry.

Also, this archive includes as much as we had of the linuxppc-embedded
archive as well.

-- 
Cheers,
Stephen Rothwell


pgpPeUtWj0mbH.pgp
Description: OpenPGP digital signature


New mail archive

2018-09-28 Thread Stephen Rothwell
Hi all,

This mailing list is now also archived at
https://lore.kernel.org/linuxppc-dev/

-- 
Cheers,
Stephen Rothwell


pgpCmQ1Stc07B.pgp
Description: OpenPGP digital signature


[PATCH 2/2] powerpc/time: Only cap decrementer when watchdog is enabled

2018-09-28 Thread Anton Blanchard
If CONFIG_PPC_WATCHDOG is enabled, we always cap the decrementer to
0x7fff. As suggested by Nick, add a run time check of the watchdog
cpumask, so if it is disabled we use the large decrementer.

Signed-off-by: Anton Blanchard 
---
 arch/powerpc/kernel/time.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 6a1f0a084ca3..3372019f52bd 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -60,6 +60,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -575,7 +576,8 @@ void timer_interrupt(struct pt_regs *regs)
 * 31 bits, which is about 4 seconds on most systems, which gives
 * the watchdog a chance of catching timer interrupt hard lockups.
 */
-   if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
+   if (IS_ENABLED(CONFIG_PPC_WATCHDOG) &&
+   cpumask_test_cpu(smp_processor_id(), _cpumask))
set_dec(0x7fff);
else
set_dec(decrementer_max);
-- 
2.17.1



[PATCH 1/2] powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer

2018-09-28 Thread Anton Blanchard
We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.

Signed-off-by: Anton Blanchard 
---
 arch/powerpc/kernel/time.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 70f145e02487..6a1f0a084ca3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -984,10 +984,10 @@ static void register_decrementer_clockevent(int cpu)
*dec = decrementer_clockevent;
dec->cpumask = cpumask_of(cpu);
 
+   clockevents_config_and_register(dec, ppc_tb_freq, 2, decrementer_max);
+
printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
dec->name, dec->mult, dec->shift, cpu);
-
-   clockevents_register_device(dec);
 }
 
 static void enable_large_decrementer(void)
@@ -1035,18 +1035,7 @@ static void __init set_decrementer_max(void)
 
 static void __init init_decrementer_clockevent(void)
 {
-   int cpu = smp_processor_id();
-
-   clockevents_calc_mult_shift(_clockevent, ppc_tb_freq, 4);
-
-   decrementer_clockevent.max_delta_ns =
-   clockevent_delta2ns(decrementer_max, _clockevent);
-   decrementer_clockevent.max_delta_ticks = decrementer_max;
-   decrementer_clockevent.min_delta_ns =
-   clockevent_delta2ns(2, _clockevent);
-   decrementer_clockevent.min_delta_ticks = 2;
-
-   register_decrementer_clockevent(cpu);
+   register_decrementer_clockevent(smp_processor_id());
 }
 
 void secondary_cpu_time_init(void)
-- 
2.17.1



Re: [GIT PULL] Please pull powerpc/linux.git powerpc-4.19-3 tag

2018-09-28 Thread Greg KH
On Fri, Sep 28, 2018 at 09:39:10PM +1000, Michael Ellerman wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
> 
> Hi Greg,
> 
> Please pull some more powerpc fixes for 4.19:
> 
> The following changes since commit 11da3a7f84f19c26da6f86af878298694ede0804:
> 
>   Linux 4.19-rc3 (2018-09-09 17:26:43 -0700)
> 
> are available in the git repository at:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-4.19-3

Now pulled, thanks.

greg k-h


[PATCH v2] i2c: Convert to using %pOFn instead of device_node.name

2018-09-28 Thread Rob Herring
In preparation to remove the node name pointer from struct device_node,
convert printf users to use the %pOFn format specifier.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Peter Rosin 
Cc: linux-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring 
---
v2:
- Remove initialization of parent

 drivers/i2c/busses/i2c-powermac.c | 17 +
 drivers/i2c/muxes/i2c-mux-gpmux.c |  4 ++--
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/i2c/busses/i2c-powermac.c 
b/drivers/i2c/busses/i2c-powermac.c
index f2a2067525ef..f6f4ed8afc93 100644
--- a/drivers/i2c/busses/i2c-powermac.c
+++ b/drivers/i2c/busses/i2c-powermac.c
@@ -388,9 +388,8 @@ static void i2c_powermac_register_devices(struct 
i2c_adapter *adap,
 static int i2c_powermac_probe(struct platform_device *dev)
 {
struct pmac_i2c_bus *bus = dev_get_platdata(>dev);
-   struct device_node *parent = NULL;
+   struct device_node *parent;
struct i2c_adapter *adapter;
-   const char *basename;
int rc;
 
if (bus == NULL)
@@ -407,23 +406,25 @@ static int i2c_powermac_probe(struct platform_device *dev)
parent = of_get_parent(pmac_i2c_get_controller(bus));
if (parent == NULL)
return -EINVAL;
-   basename = parent->name;
+   snprintf(adapter->name, sizeof(adapter->name), "%pOFn %d",
+parent,
+pmac_i2c_get_channel(bus));
+   of_node_put(parent);
break;
case pmac_i2c_bus_pmu:
-   basename = "pmu";
+   snprintf(adapter->name, sizeof(adapter->name), "pmu %d",
+pmac_i2c_get_channel(bus));
break;
case pmac_i2c_bus_smu:
/* This is not what we used to do but I'm fixing drivers at
 * the same time as this change
 */
-   basename = "smu";
+   snprintf(adapter->name, sizeof(adapter->name), "smu %d",
+pmac_i2c_get_channel(bus));
break;
default:
return -EINVAL;
}
-   snprintf(adapter->name, sizeof(adapter->name), "%s %d", basename,
-pmac_i2c_get_channel(bus));
-   of_node_put(parent);
 
platform_set_drvdata(dev, adapter);
adapter->algo = _powermac_algorithm;
diff --git a/drivers/i2c/muxes/i2c-mux-gpmux.c 
b/drivers/i2c/muxes/i2c-mux-gpmux.c
index 92cf5f48afe6..f60b670deff7 100644
--- a/drivers/i2c/muxes/i2c-mux-gpmux.c
+++ b/drivers/i2c/muxes/i2c-mux-gpmux.c
@@ -120,8 +120,8 @@ static int i2c_mux_probe(struct platform_device *pdev)
 
ret = of_property_read_u32(child, "reg", );
if (ret < 0) {
-   dev_err(dev, "no reg property for node '%s'\n",
-   child->name);
+   dev_err(dev, "no reg property for node '%pOFn'\n",
+   child);
goto err_children;
}
 
-- 
2.17.1



Re: [PATCH] tty: Convert to using %pOFn instead of device_node.name

2018-09-28 Thread Rob Herring
On Fri, Sep 28, 2018 at 5:09 PM Rob Herring  wrote:
>
> On Mon, Aug 27, 2018 at 8:55 PM Rob Herring  wrote:
> >
> > In preparation to remove the node name pointer from struct device_node,
> > convert printf users to use the %pOFn format specifier.
> >
> > Cc: Greg Kroah-Hartman 
> > Cc: Jiri Slaby 
> > Cc: Benjamin Herrenschmidt 
> > Cc: Paul Mackerras 
> > Cc: Michael Ellerman 
> > Cc: linux-ser...@vger.kernel.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Rob Herring 
> > ---
> >  drivers/tty/ehv_bytechan.c  | 12 ++--
> >  drivers/tty/serial/cpm_uart/cpm_uart_core.c |  8 
> >  drivers/tty/serial/pmac_zilog.c |  4 ++--
> >  3 files changed, 12 insertions(+), 12 deletions(-)
>
> Hey Greg, Is this still in your queue? Maybe you've just been extra
> busy lately. ;)

NM. I see it's applied now. Sorry for the noise.

Rob


Re: [PATCH] tty: Convert to using %pOFn instead of device_node.name

2018-09-28 Thread Rob Herring
On Mon, Aug 27, 2018 at 8:55 PM Rob Herring  wrote:
>
> In preparation to remove the node name pointer from struct device_node,
> convert printf users to use the %pOFn format specifier.
>
> Cc: Greg Kroah-Hartman 
> Cc: Jiri Slaby 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linux-ser...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Rob Herring 
> ---
>  drivers/tty/ehv_bytechan.c  | 12 ++--
>  drivers/tty/serial/cpm_uart/cpm_uart_core.c |  8 
>  drivers/tty/serial/pmac_zilog.c |  4 ++--
>  3 files changed, 12 insertions(+), 12 deletions(-)

Hey Greg, Is this still in your queue? Maybe you've just been extra
busy lately. ;)

Rob
>
> diff --git a/drivers/tty/ehv_bytechan.c b/drivers/tty/ehv_bytechan.c
> index eea4049b5dcc..769e0a5d1dfc 100644
> --- a/drivers/tty/ehv_bytechan.c
> +++ b/drivers/tty/ehv_bytechan.c
> @@ -128,8 +128,8 @@ static int find_console_handle(void)
>  */
> iprop = of_get_property(np, "hv-handle", NULL);
> if (!iprop) {
> -   pr_err("ehv-bc: no 'hv-handle' property in %s node\n",
> -  np->name);
> +   pr_err("ehv-bc: no 'hv-handle' property in %pOFn node\n",
> +  np);
> return 0;
> }
> stdout_bc = be32_to_cpu(*iprop);
> @@ -661,8 +661,8 @@ static int ehv_bc_tty_probe(struct platform_device *pdev)
>
> iprop = of_get_property(np, "hv-handle", NULL);
> if (!iprop) {
> -   dev_err(>dev, "no 'hv-handle' property in %s node\n",
> -   np->name);
> +   dev_err(>dev, "no 'hv-handle' property in %pOFn node\n",
> +   np);
> return -ENODEV;
> }
>
> @@ -682,8 +682,8 @@ static int ehv_bc_tty_probe(struct platform_device *pdev)
> bc->rx_irq = irq_of_parse_and_map(np, 0);
> bc->tx_irq = irq_of_parse_and_map(np, 1);
> if ((bc->rx_irq == NO_IRQ) || (bc->tx_irq == NO_IRQ)) {
> -   dev_err(>dev, "no 'interrupts' property in %s node\n",
> -   np->name);
> +   dev_err(>dev, "no 'interrupts' property in %pOFn 
> node\n",
> +   np);
> ret = -ENODEV;
> goto error;
> }
> diff --git a/drivers/tty/serial/cpm_uart/cpm_uart_core.c 
> b/drivers/tty/serial/cpm_uart/cpm_uart_core.c
> index 24a5f05e769b..ea7204d75022 100644
> --- a/drivers/tty/serial/cpm_uart/cpm_uart_core.c
> +++ b/drivers/tty/serial/cpm_uart/cpm_uart_core.c
> @@ -1151,8 +1151,8 @@ static int cpm_uart_init_port(struct device_node *np,
> if (!pinfo->clk) {
> data = of_get_property(np, "fsl,cpm-brg", );
> if (!data || len != 4) {
> -   printk(KERN_ERR "CPM UART %s has no/invalid "
> -   "fsl,cpm-brg property.\n", np->name);
> +   printk(KERN_ERR "CPM UART %pOFn has no/invalid "
> +   "fsl,cpm-brg property.\n", np);
> return -EINVAL;
> }
> pinfo->brg = *data;
> @@ -1160,8 +1160,8 @@ static int cpm_uart_init_port(struct device_node *np,
>
> data = of_get_property(np, "fsl,cpm-command", );
> if (!data || len != 4) {
> -   printk(KERN_ERR "CPM UART %s has no/invalid "
> -   "fsl,cpm-command property.\n", np->name);
> +   printk(KERN_ERR "CPM UART %pOFn has no/invalid "
> +   "fsl,cpm-command property.\n", np);
> return -EINVAL;
> }
> pinfo->command = *data;
> diff --git a/drivers/tty/serial/pmac_zilog.c b/drivers/tty/serial/pmac_zilog.c
> index 3d21790d961e..a4ec22d1f214 100644
> --- a/drivers/tty/serial/pmac_zilog.c
> +++ b/drivers/tty/serial/pmac_zilog.c
> @@ -1566,9 +1566,9 @@ static int pmz_attach(struct macio_dev *mdev, const 
> struct of_device_id *match)
>  * to work around bugs in ancient Apple device-trees
>  */
> if (macio_request_resources(uap->dev, "pmac_zilog"))
> -   printk(KERN_WARNING "%s: Failed to request resource"
> +   printk(KERN_WARNING "%pOFn: Failed to request resource"
>", port still active\n",
> -  uap->node->name);
> +  uap->node);
> else
> uap->flags |= PMACZILOG_FLAG_RSRC_REQUESTED;
>
> --
> 2.17.1
>


Re: drivers binding to device node with multiple compatible strings

2018-09-28 Thread Li Yang
On Fri, Sep 28, 2018 at 4:00 PM Li Yang  wrote:
>
> On Fri, Sep 28, 2018 at 3:07 PM Rob Herring  wrote:
> >
> > On Thu, Sep 27, 2018 at 5:25 PM Li Yang  wrote:
> > >
> > > Hi Rob and Grant,
> > >
> > > Various device tree specs are recommending to include all the
> > > potential compatible strings in the device node, with the order from
> > > most specific to most general.  But it looks like Linux kernel doesn't
> > > provide a way to bind the device to the most specific driver, however,
> > > the first registered compatible driver will be bound.
> > >
> > > As more and more generic drivers are added to the Linux kernel, they
> > > are competing with the more specific vendor drivers and causes problem
> > > when both are built into the kernel.  I'm wondering if there is a
> > > generic solution (or in plan) to make the most specific driver bound
> > > to the device.   Or we have to disable the more general driver or
> > > remove the more general compatible string from the device tree?
> >
> > It's been a known limitation for a long time. However, in practice it
> > doesn't seem to be a common problem. Perhaps folks just remove the
> > less specific compatible from their DT (though that's not ideal). For
> > most modern bindings, there's so many other resources beyond
> > compatible (clocks, resets, pinctrl, etc.) that there are few generic
> > drivers that can work.
> >
> > I guess if we want to fix this, we'd need to have weighted matching in
> > the driver core and unbind drivers when we get a better match. Though
> > it could get messy if the better driver probe fails. Then we've got to
> > rebind to the original driver.
>
> Probably we can populate the platform devices from device tree after
> the device_init phase?  So that all built-in drivers are already
> registered when the devices are created and we can try find the best
> match in one go?  For more specific loadable modules we probably need
> to unbind from the old driver and bind to the new one.  But I agree
> with you that it could be messy.
>
> >
> > Do you have a specific case where you hit this?
>
> Maybe not a new issue but "snps,dw-pcie" is competing with various
> "fsl,-pcie" compatibles.  Also a specific PHY
> "ethernet-phy-id." with generic "ethernet-phy-ieee802.3-c45".

The ethernet-phy issue is not related to the general device binding
framework, it should be an issue with the of_mdio framework. But it is
still a inalignment with the device tree recommendation.

Regards,
Leo


Re: drivers binding to device node with multiple compatible strings

2018-09-28 Thread Li Yang
On Fri, Sep 28, 2018 at 3:07 PM Rob Herring  wrote:
>
> On Thu, Sep 27, 2018 at 5:25 PM Li Yang  wrote:
> >
> > Hi Rob and Grant,
> >
> > Various device tree specs are recommending to include all the
> > potential compatible strings in the device node, with the order from
> > most specific to most general.  But it looks like Linux kernel doesn't
> > provide a way to bind the device to the most specific driver, however,
> > the first registered compatible driver will be bound.
> >
> > As more and more generic drivers are added to the Linux kernel, they
> > are competing with the more specific vendor drivers and causes problem
> > when both are built into the kernel.  I'm wondering if there is a
> > generic solution (or in plan) to make the most specific driver bound
> > to the device.   Or we have to disable the more general driver or
> > remove the more general compatible string from the device tree?
>
> It's been a known limitation for a long time. However, in practice it
> doesn't seem to be a common problem. Perhaps folks just remove the
> less specific compatible from their DT (though that's not ideal). For
> most modern bindings, there's so many other resources beyond
> compatible (clocks, resets, pinctrl, etc.) that there are few generic
> drivers that can work.
>
> I guess if we want to fix this, we'd need to have weighted matching in
> the driver core and unbind drivers when we get a better match. Though
> it could get messy if the better driver probe fails. Then we've got to
> rebind to the original driver.

Probably we can populate the platform devices from device tree after
the device_init phase?  So that all built-in drivers are already
registered when the devices are created and we can try find the best
match in one go?  For more specific loadable modules we probably need
to unbind from the old driver and bind to the new one.  But I agree
with you that it could be messy.

>
> Do you have a specific case where you hit this?

Maybe not a new issue but "snps,dw-pcie" is competing with various
"fsl,-pcie" compatibles.  Also a specific PHY
"ethernet-phy-id." with generic "ethernet-phy-ieee802.3-c45".

Regards,
Leo


Re: [PATCH v3 5/6] arm64: dts: add QorIQ LX2160A SoC support

2018-09-28 Thread Li Yang
On Mon, Sep 24, 2018 at 7:47 AM Vabhav Sharma  wrote:
>
> LX2160A SoC is based on Layerscape Chassis Generation 3.2 Architecture.
>
> LX2160A features an advanced 16 64-bit ARM v8 CortexA72 processor cores
> in 8 cluster, CCN508, GICv3,two 64-bit DDR4 memory controller, 8 I2C
> controllers, 3 dspi, 2 esdhc,2 USB 3.0, mmu 500, 3 SATA, 4 PL011 SBSA
> UARTs etc.
>
> Signed-off-by: Ramneek Mehresh 
> Signed-off-by: Zhang Ying-22455 
> Signed-off-by: Nipun Gupta 
> Signed-off-by: Priyanka Jain 
> Signed-off-by: Yogesh Gaur 
> Signed-off-by: Sriram Dash 
> Signed-off-by: Vabhav Sharma 
> ---
>  arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi | 693 
> +
>  1 file changed, 693 insertions(+)
>  create mode 100644 arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
>
> diff --git a/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi 
> b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> new file mode 100644
> index 000..46eea16
> --- /dev/null
> +++ b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> @@ -0,0 +1,693 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR MIT)
> +//
> +// Device Tree Include file for Layerscape-LX2160A family SoC.
> +//
> +// Copyright 2018 NXP
> +
> +#include 

You included the header file, but you didn't use the MACROs in most of
the interrupts property below.  It is recommended to use them for
better readibity.

> +
> +/memreserve/ 0x8000 0x0001;
> +
> +/ {
> +   compatible = "fsl,lx2160a";
> +   interrupt-parent = <>;
> +   #address-cells = <2>;
> +   #size-cells = <2>;
> +
> +   cpus {
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +
> +   // 8 clusters having 2 Cortex-A72 cores each
> +   cpu@0 {
> +   device_type = "cpu";
> +   compatible = "arm,cortex-a72";
> +   reg = <0x0>;
> +   clocks = < 1 0>;
> +   d-cache-size = <0x8000>;
> +   d-cache-line-size = <64>;
> +   d-cache-sets = <128>;
> +   i-cache-size = <0xC000>;
> +   i-cache-line-size = <64>;
> +   i-cache-sets = <192>;
> +   next-level-cache = <_l2>;

enable-method is a required property for this and cpu below.

> +   };
> +
> +   cpu@1 {
> +   device_type = "cpu";
> +   compatible = "arm,cortex-a72";
> +   reg = <0x1>;
> +   clocks = < 1 0>;
> +   d-cache-size = <0x8000>;
> +   d-cache-line-size = <64>;
> +   d-cache-sets = <128>;
> +   i-cache-size = <0xC000>;
> +   i-cache-line-size = <64>;
> +   i-cache-sets = <192>;
> +   next-level-cache = <_l2>;
> +   };
> +
> +   cpu@100 {
> +   device_type = "cpu";
> +   compatible = "arm,cortex-a72";
> +   reg = <0x100>;
> +   clocks = < 1 1>;
> +   d-cache-size = <0x8000>;
> +   d-cache-line-size = <64>;
> +   d-cache-sets = <128>;
> +   i-cache-size = <0xC000>;
> +   i-cache-line-size = <64>;
> +   i-cache-sets = <192>;
> +   next-level-cache = <_l2>;
> +   };
> +
> +   cpu@101 {
> +   device_type = "cpu";
> +   compatible = "arm,cortex-a72";
> +   reg = <0x101>;
> +   clocks = < 1 1>;
> +   d-cache-size = <0x8000>;
> +   d-cache-line-size = <64>;
> +   d-cache-sets = <128>;
> +   i-cache-size = <0xC000>;
> +   i-cache-line-size = <64>;
> +   i-cache-sets = <192>;
> +   next-level-cache = <_l2>;
> +   };
> +
> +   cpu@200 {
> +   device_type = "cpu";
> +   compatible = "arm,cortex-a72";
> +   reg = <0x200>;
> +   clocks = < 1 2>;
> +   d-cache-size = <0x8000>;
> +   d-cache-line-size = <64>;
> +   d-cache-sets = <128>;
> +   i-cache-size = <0xC000>;
> +   i-cache-line-size = <64>;
> +   i-cache-sets = <192>;
> +   next-level-cache = <_l2>;
> +   };
> +
> +   cpu@201 {
> +   device_type = "cpu";
> +   compatible = "arm,cortex-a72";
> +   reg = <0x201>;
> +   clocks = < 1 2>;
> +   

Re: [PATCH] powerpc/rtas: Fix a potential race between CPU-Offline & Migration

2018-09-28 Thread Nathan Fontenot
On 09/28/2018 02:02 AM, Gautham R Shenoy wrote:
> Hi Nathan,
> 
> On Thu, Sep 27, 2018 at 12:31:34PM -0500, Nathan Fontenot wrote:
>> On 09/27/2018 11:51 AM, Gautham R. Shenoy wrote:
>>> From: "Gautham R. Shenoy" 
>>>
>>> Live Partition Migrations require all the present CPUs to execute the
>>> H_JOIN call, and hence rtas_ibm_suspend_me() onlines any offline CPUs
>>> before initiating the migration for this purpose.
>>>
>>> The commit 85a88cabad57
>>> ("powerpc/pseries: Disable CPU hotplug across migrations")
>>> disables any CPU-hotplug operations once all the offline CPUs are
>>> brought online to prevent any further state change. Once the
>>> CPU-Hotplug operation is disabled, the code assumes that all the CPUs
>>> are online.
>>>
>>> However, there is a minor window in rtas_ibm_suspend_me() between
>>> onlining the offline CPUs and disabling CPU-Hotplug when a concurrent
>>> CPU-offline operations initiated by the userspace can succeed thereby
>>> nullifying the the aformentioned assumption. In this unlikely case
>>> these offlined CPUs will not call H_JOIN, resulting in a system hang.
>>>
>>> Fix this by verifying that all the present CPUs are actually online
>>> after CPU-Hotplug has been disabled, failing which we return from
>>> rtas_ibm_suspend_me() with -EBUSY.
>>
>> Would we also want to havr the ability to re-try onlining all of the cpus
>> before failing the migration?
> 
> Given that we haven't been able to hit issue in practice after your
> fix to disable CPU Hotplug after migrations, it indicates that the
> race-window, if it is not merely a theoretical one, is extremely
> narrow. So, this current patch addresses the safety aspect, as in,
> should someone manage to exploit this narrow race-window, it ensures
> that the system doesn't go to a hang state.
> 
> Having the ability to retry onlining all the CPUs is only required for
> progress of LPM in this rarest of cases. We should add the code to
> retry onlining the CPUs if the consequence of failing an LPM is high,
> even in this rarest of case. Otherwise IMHO we should be ok not adding
> the additional code.

I believe you're correct. One small update to the patch below...

> 
>>
>> This would involve a bigger code change as the current code to online all
>> CPUs would work in its current form.
>>
>> -Nathan
>>
>>>
>>> Cc: Nathan Fontenot 
>>> Cc: Tyrel Datwyler 
>>> Suggested-by: Michael Ellerman 
>>> Signed-off-by: Gautham R. Shenoy 
>>> ---
>>>  arch/powerpc/kernel/rtas.c | 10 ++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>> index 2c7ed31..27f6fd3 100644
>>> --- a/arch/powerpc/kernel/rtas.c
>>> +++ b/arch/powerpc/kernel/rtas.c
>>> @@ -982,6 +982,16 @@ int rtas_ibm_suspend_me(u64 handle)
>>> }
>>>
>>> cpu_hotplug_disable();
>>> +
>>> +   /* Check if we raced with a CPU-Offline Operation */
>>> +   if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) {
>>> +   pr_err("%s: Raced against a concurrent CPU-Offline\n",
>>> +  __func__);
>>> +   atomic_set(, -EBUSY);
>>> +   cpu_hotplug_enable();

Before returning, we return all CPUs that were offline prior to the migration
back to the offline state. We should be doing that here as well. This should
be as simple as adding a call to rtas_offline_cpus_mask() here.

-Nathan

>>> +   goto out;
>>> +   }
>>> +
>>> stop_topology_update();
>>>
>>> /* Call function on all CPUs.  One of us will make the
>>>



Re: [PATCH -next] powerpc/pseries/memory-hotplug: Fix return value type of find_aa_index

2018-09-28 Thread Nathan Fontenot
On 09/21/2018 05:37 AM, YueHaibing wrote:
> find_aa_index will return -1 when dlpar_clone_property fails,
> its return value type should be int. Also the caller
> update_lmb_associativity_index should use a int variable to
> get it,then compared with 0.

The aa_index that we are handling here is defined as an unsigned value
in the PAPR so I'm a little hesitant in changing it to a signed value.
Also, changing the aa_index to be signed, we still assign it to the
u32 lmb->aa_index.

There are some other places where the aa_index is treated as a signed value
in finc_aa_index(). Perhaps the better solution is use an rc value to track
the validation of finding the aa_index instead of the aa_index value itself.

-Nathan 

> 
> Fixes: c05a5a40969e ("powerpc/pseries: Dynamic add entires to associativity 
> lookup array")
> Signed-off-by: YueHaibing 
> ---
>  arch/powerpc/platforms/pseries/hotplug-memory.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 9a15d39..6aad17c 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -101,13 +101,12 @@ static struct property *dlpar_clone_property(struct 
> property *prop,
>   return new_prop;
>  }
> 
> -static u32 find_aa_index(struct device_node *dr_node,
> +static int find_aa_index(struct device_node *dr_node,
>struct property *ala_prop, const u32 *lmb_assoc)
>  {
>   u32 *assoc_arrays;
> - u32 aa_index;
>   int aa_arrays, aa_array_entries, aa_array_sz;
> - int i, index;
> + int i, index, aa_index;
> 
>   /*
>* The ibm,associativity-lookup-arrays property is defined to be
> @@ -168,7 +167,7 @@ static int update_lmb_associativity_index(struct 
> drmem_lmb *lmb)
>   struct device_node *parent, *lmb_node, *dr_node;
>   struct property *ala_prop;
>   const u32 *lmb_assoc;
> - u32 aa_index;
> + int aa_index;
> 
>   parent = of_find_node_by_path("/");
>   if (!parent)
> 



Re: drivers binding to device node with multiple compatible strings

2018-09-28 Thread Lucas Stach
Hi,

Am Freitag, den 28.09.2018, 12:43 -0700 schrieb Frank Rowand:
> + Frank
> 
> On 09/27/18 15:25, Li Yang wrote:
> > Hi Rob and Grant,
> > 
> > Various device tree specs are recommending to include all the
> > potential compatible strings in the device node, with the order from
> > most specific to most general.  But it looks like Linux kernel doesn't
> > provide a way to bind the device to the most specific driver, however,
> > the first registered compatible driver will be bound.
> > 
> > As more and more generic drivers are added to the Linux kernel, they
> > are competing with the more specific vendor drivers and causes problem
> > when both are built into the kernel.  I'm wondering if there is a
> > generic solution (or in plan) to make the most specific driver bound
> > to the device.   Or we have to disable the more general driver or
> > remove the more general compatible string from the device tree?

Not really contributing to the solution, but the hard question to
answer is when do you know what the most specific driver is? The most
specific driver might well be a module that can be loaded at any time,
while there might already be other less specific drivers around.

In general I would say that if your device is specific enough to
warrant a whole new driver, it should not declare compatibility with
the generic thing in the compatible, but then this is kind of exporting
an Linux implementation detail to DT.

Regards,
Lucas



Re: drivers binding to device node with multiple compatible strings

2018-09-28 Thread Rob Herring
On Thu, Sep 27, 2018 at 5:25 PM Li Yang  wrote:
>
> Hi Rob and Grant,
>
> Various device tree specs are recommending to include all the
> potential compatible strings in the device node, with the order from
> most specific to most general.  But it looks like Linux kernel doesn't
> provide a way to bind the device to the most specific driver, however,
> the first registered compatible driver will be bound.
>
> As more and more generic drivers are added to the Linux kernel, they
> are competing with the more specific vendor drivers and causes problem
> when both are built into the kernel.  I'm wondering if there is a
> generic solution (or in plan) to make the most specific driver bound
> to the device.   Or we have to disable the more general driver or
> remove the more general compatible string from the device tree?

It's been a known limitation for a long time. However, in practice it
doesn't seem to be a common problem. Perhaps folks just remove the
less specific compatible from their DT (though that's not ideal). For
most modern bindings, there's so many other resources beyond
compatible (clocks, resets, pinctrl, etc.) that there are few generic
drivers that can work.

I guess if we want to fix this, we'd need to have weighted matching in
the driver core and unbind drivers when we get a better match. Though
it could get messy if the better driver probe fails. Then we've got to
rebind to the original driver.

Do you have a specific case where you hit this?

Rob


Re: drivers binding to device node with multiple compatible strings

2018-09-28 Thread Frank Rowand


+ Frank

On 09/27/18 15:25, Li Yang wrote:
> Hi Rob and Grant,
> 
> Various device tree specs are recommending to include all the
> potential compatible strings in the device node, with the order from
> most specific to most general.  But it looks like Linux kernel doesn't
> provide a way to bind the device to the most specific driver, however,
> the first registered compatible driver will be bound.
> 
> As more and more generic drivers are added to the Linux kernel, they
> are competing with the more specific vendor drivers and causes problem
> when both are built into the kernel.  I'm wondering if there is a
> generic solution (or in plan) to make the most specific driver bound
> to the device.   Or we have to disable the more general driver or
> remove the more general compatible string from the device tree?
> 
> Regards,
> Leo
> 



Re: [PATCH v3 6/6] arm64: dts: add LX2160ARDB board support

2018-09-28 Thread Li Yang
On Mon, Sep 24, 2018 at 7:51 AM Vabhav Sharma  wrote:
>
> LX2160A reference design board (RDB) is a high-performance
> computing, evaluation, and development platform with LX2160A
> SoC.

Please send next version with Shawn Guo and me in the "to" recipient
so that its less likely we will miss it.

>
> Signed-off-by: Priyanka Jain 
> Signed-off-by: Sriram Dash 
> Signed-off-by: Vabhav Sharma 
> ---
>  arch/arm64/boot/dts/freescale/Makefile|  1 +
>  arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts | 88 
> +++
>  2 files changed, 89 insertions(+)
>  create mode 100644 arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts
>
> diff --git a/arch/arm64/boot/dts/freescale/Makefile 
> b/arch/arm64/boot/dts/freescale/Makefile
> index 86e18ad..445b72b 100644
> --- a/arch/arm64/boot/dts/freescale/Makefile
> +++ b/arch/arm64/boot/dts/freescale/Makefile
> @@ -13,3 +13,4 @@ dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2080a-rdb.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2080a-simu.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2088a-qds.dtb
>  dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2088a-rdb.dtb
> +dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-lx2160a-rdb.dtb
> diff --git a/arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts 
> b/arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts
> new file mode 100644
> index 000..1bbe663
> --- /dev/null
> +++ b/arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR MIT)
> +//
> +// Device Tree file for LX2160ARDB
> +//
> +// Copyright 2018 NXP
> +
> +/dts-v1/;
> +
> +#include "fsl-lx2160a.dtsi"
> +
> +/ {
> +   model = "NXP Layerscape LX2160ARDB";
> +   compatible = "fsl,lx2160a-rdb", "fsl,lx2160a";
> +
> +   chosen {
> +   stdout-path = "serial0:115200n8";
> +   };
> +};
> +
> + {
> +   status = "okay";
> +};
> +
> + {
> +   status = "okay";
> +};
> +
> + {
> +   status = "okay";
> +   i2c-mux@77 {
> +   compatible = "nxp,pca9547";
> +   reg = <0x77>;
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +
> +   i2c@2 {
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +   reg = <0x2>;
> +
> +   power-monitor@40 {
> +   compatible = "ti,ina220";
> +   reg = <0x40>;
> +   shunt-resistor = <1000>;
> +   };
> +   };
> +
> +   i2c@3 {
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +   reg = <0x3>;
> +
> +   temperature-sensor@4c {
> +   compatible = "nxp,sa56004";
> +   reg = <0x4c>;

Need a vcc-supply property according to the binding.

> +   };
> +
> +   temperature-sensor@4d {
> +   compatible = "nxp,sa56004";
> +   reg = <0x4d>;

Ditto.

> +   };
> +   };
> +   };
> +};
> +
> + {
> +   status = "okay";
> +
> +   rtc@51 {
> +   compatible = "nxp,pcf2129";
> +   reg = <0x51>;
> +   // IRQ10_B
> +   interrupts = <0 150 0x4>;
> +   };
> +
> +};
> +
> + {
> +   status = "okay";
> +};
> +
> + {
> +   status = "okay";
> +};
> +
> + {
> +   status = "okay";
> +};
> --
> 2.7.4
>


Re: [PATCH v3 6/9] kbuild: consolidate Devicetree dtb build rules

2018-09-28 Thread Rob Herring
On Fri, Sep 28, 2018 at 12:21 PM Andreas Färber  wrote:
>
> Hi Geert,
>
> Am 13.09.18 um 17:51 schrieb Geert Uytterhoeven:
> > On Wed, Sep 12, 2018 at 3:02 AM Masahiro Yamada
> >  wrote:
> >> Even x86 can enable OF and OF_UNITTEST.
> >>
> >> Another solution might be,
> >> guard it by 'depends on ARCH_SUPPORTS_OF'.
> >>
> >> This is actually what ACPI does.
> >>
> >> menuconfig ACPI
> >> bool "ACPI (Advanced Configuration and Power Interface) Support"
> >> depends on ARCH_SUPPORTS_ACPI
> >>  ...
> >
> > ACPI is a real platform feature, as it depends on firmware.
> >
> > CONFIG_OF can be enabled, and DT overlays can be loaded, on any platform,
> > even if it has ACPI ;-)
>
> How would loading a DT overlay work on an ACPI platform? I.e., what
> would it overlay against and how to practically load such a file?

The DT unittests do just that. I run them on x86 and UM builds. In
this case, the loading source is built-in.

> I wonder whether that could be helpful for USB devices and serdev...

How to load the overlays is pretty orthogonal to the issues to be
solved here. It would certainly be possible to move forward with
prototyping this and just have the overlay built-in. It may not even
need to be an overlay if we can support multiple root nodes.

Rob


Re: [PATCH -next] PCI: hotplug: Use kmemdup rather than duplicating its implementation in pnv_php_add_devtree()

2018-09-28 Thread Bjorn Helgaas
On Thu, Sep 27, 2018 at 06:52:21AM +, YueHaibing wrote:
> Use kmemdup rather than duplicating its implementation
> 
> Signed-off-by: YueHaibing 

Applied with Michael's ack to pci/hotplug for v4.20, thanks!

> ---
>  drivers/pci/hotplug/pnv_php.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/hotplug/pnv_php.c b/drivers/pci/hotplug/pnv_php.c
> index 5070620..ee54f5b 100644
> --- a/drivers/pci/hotplug/pnv_php.c
> +++ b/drivers/pci/hotplug/pnv_php.c
> @@ -275,14 +275,13 @@ static int pnv_php_add_devtree(struct pnv_php_slot 
> *php_slot)
>   goto free_fdt1;
>   }
>  
> - fdt = kzalloc(fdt_totalsize(fdt1), GFP_KERNEL);
> + fdt = kmemdup(fdt1, fdt_totalsize(fdt1), GFP_KERNEL);
>   if (!fdt) {
>   ret = -ENOMEM;
>   goto free_fdt1;
>   }
>  
>   /* Unflatten device tree blob */
> - memcpy(fdt, fdt1, fdt_totalsize(fdt1));
>   dt = of_fdt_unflatten_tree(fdt, php_slot->dn, NULL);
>   if (!dt) {
>   ret = -EINVAL;
> 
> 
> 


Re: [PATCH v3 6/9] kbuild: consolidate Devicetree dtb build rules

2018-09-28 Thread Andreas Färber
Hi Geert,

Am 13.09.18 um 17:51 schrieb Geert Uytterhoeven:
> On Wed, Sep 12, 2018 at 3:02 AM Masahiro Yamada
>  wrote:
>> Even x86 can enable OF and OF_UNITTEST.
>>
>> Another solution might be,
>> guard it by 'depends on ARCH_SUPPORTS_OF'.
>>
>> This is actually what ACPI does.
>>
>> menuconfig ACPI
>> bool "ACPI (Advanced Configuration and Power Interface) Support"
>> depends on ARCH_SUPPORTS_ACPI
>>  ...
> 
> ACPI is a real platform feature, as it depends on firmware.
> 
> CONFIG_OF can be enabled, and DT overlays can be loaded, on any platform,
> even if it has ACPI ;-)

How would loading a DT overlay work on an ACPI platform? I.e., what
would it overlay against and how to practically load such a file?

I wonder whether that could be helpful for USB devices and serdev...

Cheers,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-09-28 Thread Dave Hansen
It's really nice if these kinds of things are broken up.  First, replace
the old want_memblock parameter, then add the parameter to the
__add_page() calls.

> +/*
> + * NONE: No memory block is to be created (e.g. device memory).
> + * NORMAL:   Memory block that represents normal (boot or hotplugged) memory
> + *   (e.g. ACPI DIMMs) that should be onlined either automatically
> + *   (memhp_auto_online) or manually by user space to select a
> + *   specific zone.
> + *   Applicable to memhp_auto_online.
> + * STANDBY:  Memory block that represents standby memory that should only
> + *   be onlined on demand by user space (e.g. standby memory on
> + *   s390x), but never automatically by the kernel.
> + *   Not applicable to memhp_auto_online.
> + * PARAVIRT: Memory block that represents memory added by
> + *   paravirtualized mechanisms (e.g. hyper-v, xen) that will
> + *   always automatically get onlined. Memory will be unplugged
> + *   using ballooning, not by relying on the MOVABLE ZONE.
> + *   Not applicable to memhp_auto_online.
> + */
> +enum {
> + MEMORY_BLOCK_NONE,
> + MEMORY_BLOCK_NORMAL,
> + MEMORY_BLOCK_STANDBY,
> + MEMORY_BLOCK_PARAVIRT,
> +};

This does not seem like the best way to expose these.

STANDBY, for instance, seems to be essentially a replacement for a check
against running on s390 in userspace to implement a _typical_ s390
policy.  It seems rather weird to try to make the userspace policy
determination easier by telling userspace about the typical s390 policy
via the kernel.

As for the OOM issues, that sounds like something we need to fix by
refusing to do (or delaying) hot-add operations once we consume too much
ZONE_NORMAL from memmap[]s rather than trying to indirectly tell
userspace to hurry thing along.

So, to my eye, we need:

 +enum {
 +  MEMORY_BLOCK_NONE,
 +  MEMORY_BLOCK_STANDBY, /* the default */
 +  MEMORY_BLOCK_AUTO_ONLINE,
 +};

and we can probably collapse NONE into AUTO_ONLINE because userspace
ends up doing the same thing for both: nothing.

>  struct memory_block {
>   unsigned long start_section_nr;
>   unsigned long end_section_nr;
> @@ -34,6 +58,7 @@ struct memory_block {
>   int (*phys_callback)(struct memory_block *);
>   struct device dev;
>   int nid;/* NID for this memory block */
> + int type;   /* type of this memory block */
>  };

Shouldn't we just be creating and using an actual named enum type?


Re: [PATCH v4 1/2] powerpc/32: add stack protector support

2018-09-28 Thread Segher Boessenkool
On Fri, Sep 28, 2018 at 10:56:07PM +1000, Michael Ellerman wrote:
> The problem of low entropy at boot on systems without a good hardware
> source is sort of unsolvable.
> 
> As you say it's up to the core kernel/random code, we shouldn't be
> trying to do anything tricky in the arch code.
> 
> You don't want your system to take 3 hours to boot because it's waiting
> for entropy for the stack canary.
> 
> If we can update the canary later once the entropy pool is setup that
> would be ideal.

Yup, I agree with all that.

But we should *also* not say "oh, there may be cases where we cannot
do the right thing, so just do not even try, ever, anywhere".


Segher


[PATCH 4/4] powerpc/64s/hash: add more barriers for slb preloading

2018-09-28 Thread Nicholas Piggin
In several places, more care has to be taken to prevent compiler or
CPU re-ordering of memory accesses into critical sections that must
not take SLB faults.

Fixes: 5e46e29e6a97 ("powerpc/64s/hash: convert SLB miss handlers to C")
Fixes: 89ca4e126a3f ("powerpc/64s/hash: Add a SLB preload cache")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/slb.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index c1425853af5d..f93ed8afbac6 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -344,6 +344,9 @@ void slb_setup_new_exec(void)
if (preload_add(ti, mm->mmap_base))
slb_allocate_user(mm, mm->mmap_base);
}
+
+   /* see switch_slb */
+   asm volatile("isync" : : : "memory");
 }
 
 void preload_new_slb_context(unsigned long start, unsigned long sp)
@@ -373,6 +376,9 @@ void preload_new_slb_context(unsigned long start, unsigned 
long sp)
if (preload_add(ti, heap))
slb_allocate_user(mm, heap);
}
+
+   /* see switch_slb */
+   asm volatile("isync" : : : "memory");
 }
 
 
@@ -389,6 +395,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
 * which would update the slb_cache/slb_cache_ptr fields in the PACA.
 */
hard_irq_disable();
+   asm volatile("isync" : : : "memory");
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
/*
 * SLBIA IH=3 invalidates all Class=1 SLBEs and their
@@ -396,7 +403,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
 * switch_slb wants. So ARCH_300 does not use the slb
 * cache.
 */
-   asm volatile("isync ; " PPC_SLBIA(3)" ; isync");
+   asm volatile(PPC_SLBIA(3));
} else {
unsigned long offset = get_paca()->slb_cache_ptr;
 
@@ -404,7 +411,6 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
offset <= SLB_CACHE_ENTRIES) {
unsigned long slbie_data = 0;
 
-   asm volatile("isync" : : : "memory");
for (i = 0; i < offset; i++) {
/* EA */
slbie_data = (unsigned long)
@@ -419,7 +425,6 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
if (!cpu_has_feature(CPU_FTR_ARCH_207S) && offset == 1)
asm volatile("slbie %0" : : "r" (slbie_data));
 
-   asm volatile("isync" : : : "memory");
} else {
struct slb_shadow *p = get_slb_shadow();
unsigned long ksp_esid_data =
@@ -427,8 +432,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
unsigned long ksp_vsid_data =
be64_to_cpu(p->save_area[KSTACK_INDEX].vsid);
 
-   asm volatile("isync\n"
-PPC_SLBIA(1) "\n"
+   asm volatile(PPC_SLBIA(1) "\n"
 "slbmte%0,%1\n"
 "isync"
 :: "r"(ksp_vsid_data),
@@ -464,6 +468,13 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
 
slb_allocate_user(mm, ea);
}
+
+   /*
+* Synchronize slbmte preloads with possible subsequent user memory
+* address accesses by the kernel (user mode won't happen until
+* rfid, which is safe).
+*/
+   asm volatile("isync" : : : "memory");
 }
 
 void slb_set_size(u16 size)
@@ -625,6 +636,17 @@ static long slb_insert_entry(unsigned long ea, unsigned 
long context,
if (!vsid)
return -EFAULT;
 
+   /*
+* There must not be a kernel SLB fault in alloc_slb_index or before
+* slbmte here or the allocation bitmaps could get out of whack with
+* the SLB.
+*
+* User SLB faults or preloads take this path which might get inlined
+* into the caller, so add compiler barriers here to ensure unsafe
+* memory accesses do not come between
+*/
+   barrier();
+
index = alloc_slb_index(kernel);
 
vsid_data = __mk_vsid_data(vsid, ssize, flags);
@@ -633,10 +655,13 @@ static long slb_insert_entry(unsigned long ea, unsigned 
long context,
/*
 * No need for an isync before or after this slbmte. The exception
 * we enter with and the rfid we exit with are context synchronizing.
-* Also we only handle user segments here.
+* User preloads should add isync afterwards in case the kernel
+* accesses user memory before it returns to userspace with rfid.
 */
asm volatile("slbmte %0, %1" : : "r" 

[PATCH 3/4] powerpc/64s/hash: Fix preloading of SLB entries

2018-09-28 Thread Nicholas Piggin
slb_setup_new_exec and preload_new_slb_context assumed if an address
missed the preload cache, then it would not be in the SLB and could
be added. This is wrong if the preload cache has started to overflow.
This can cause SLB multi-hits on user addresses.

That assumption came from an earlier version of the patch which
cleared the preload cache when copying the task, but even that was
technically wrong because some user accesses occur before these
preloads, and the preloads themselves could overflow the cache
depending on the size.

Fixes: 89ca4e126a3f ("powerpc/64s/hash: Add a SLB preload cache")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/slb.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index b438220c4336..c1425853af5d 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -311,6 +311,13 @@ void slb_setup_new_exec(void)
struct mm_struct *mm = current->mm;
unsigned long exec = 0x1000;
 
+   /*
+* preload cache can only be used to determine whether a SLB
+* entry exists if it does not start to overflow.
+*/
+   if (ti->slb_preload_nr + 2 > SLB_PRELOAD_NR)
+   return;
+
/*
 * We have no good place to clear the slb preload cache on exec,
 * flush_thread is about the earliest arch hook but that happens
@@ -345,6 +352,10 @@ void preload_new_slb_context(unsigned long start, unsigned 
long sp)
struct mm_struct *mm = current->mm;
unsigned long heap = mm->start_brk;
 
+   /* see above */
+   if (ti->slb_preload_nr + 3 > SLB_PRELOAD_NR)
+   return;
+
/* Userspace entry address. */
if (!is_kernel_addr(start)) {
if (preload_add(ti, start))
-- 
2.18.0



[PATCH 2/4] powerpc/64: interrupts save PPR on stack rather than thread_struct

2018-09-28 Thread Nicholas Piggin
PPR is the odd register out when it comes to interrupt handling,
it is saved in current->thread.ppr while all others are saved on
the stack.

The difficulty with this is that accessing thread.ppr can cause a
SLB fault, but the SLB fault handler implementation in C change had
assumed the normal exception entry handlers would not cause an SLB
fault.

Fix this by allocating room in the interrupt stack to save PPR.

Fixes: 5e46e29e6a97 ("powerpc/64s/hash: convert SLB miss handlers to C")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/exception-64s.h |  9 -
 arch/powerpc/include/asm/processor.h | 12 +++-
 arch/powerpc/include/asm/ptrace.h|  1 +
 arch/powerpc/kernel/asm-offsets.c|  2 +-
 arch/powerpc/kernel/entry_64.S   | 15 +--
 arch/powerpc/kernel/process.c|  2 +-
 arch/powerpc/kernel/ptrace.c |  4 ++--
 7 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 47578b79f0fb..3b4767ed3ec5 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -228,11 +228,10 @@
  * PPR save/restore macros used in exceptions_64s.S  
  * Used for P7 or later processors
  */
-#define SAVE_PPR(area, ra, rb) \
+#define SAVE_PPR(area, ra) \
 BEGIN_FTR_SECTION_NESTED(940)  \
-   ld  ra,PACACURRENT(r13);\
-   ld  rb,area+EX_PPR(r13);/* Read PPR from paca */\
-   std rb,TASKTHREADPPR(ra);   \
+   ld  ra,area+EX_PPR(r13);/* Read PPR from paca */\
+   std ra,_PPR(r1);\
 END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
 
 #define RESTORE_PPR_PACA(area, ra) \
@@ -500,7 +499,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 3: EXCEPTION_PROLOG_COMMON_1();   \
beq 4f; /* if from kernel mode  */ \
ACCOUNT_CPU_USER_ENTRY(r13, r9, r10);  \
-   SAVE_PPR(area, r9, r10);   \
+   SAVE_PPR(area, r9);\
 4: EXCEPTION_PROLOG_COMMON_2(area)\
EXCEPTION_PROLOG_COMMON_3(n)   \
ACCOUNT_STOLEN_TIME
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 350c584ca179..07251598056c 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -32,9 +32,9 @@
 /* Default SMT priority is set to 3. Use 11- 13bits to save priority. */
 #define PPR_PRIORITY 3
 #ifdef __ASSEMBLY__
-#define INIT_PPR (PPR_PRIORITY << 50)
+#define DEFAULT_PPR (PPR_PRIORITY << 50)
 #else
-#define INIT_PPR ((u64)PPR_PRIORITY << 50)
+#define DEFAULT_PPR ((u64)PPR_PRIORITY << 50)
 #endif /* __ASSEMBLY__ */
 #endif /* CONFIG_PPC64 */
 
@@ -247,7 +247,11 @@ struct thread_struct {
 #ifdef CONFIG_PPC64
unsigned long   ksp_vsid;
 #endif
-   struct pt_regs  *regs;  /* Pointer to saved register state */
+   union {
+   struct int_regs *iregs; /* Pointer to saved register state */
+   struct pt_regs  *regs;  /* Pointer to saved register state */
+   };
+
mm_segment_taddr_limit; /* for get_fs() validation */
 #ifdef CONFIG_BOOKE
/* BookE base exception scratch space; align on cacheline */
@@ -342,7 +346,6 @@ struct thread_struct {
 * onwards.
 */
int dscr_inherit;
-   unsigned long   ppr;/* used to save/restore SMT priority */
unsigned long   tidr;
 #endif
 #ifdef CONFIG_PPC_BOOK3S_64
@@ -390,7 +393,6 @@ struct thread_struct {
.regs = (struct pt_regs *)INIT_SP - 1, /* XXX bogus, I think */ \
.addr_limit = KERNEL_DS, \
.fpexc_mode = 0, \
-   .ppr = INIT_PPR, \
.fscr = FSCR_TAR | FSCR_EBB \
 }
 #endif
diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index 1a98cd8c49f6..9a5a1cc85bd0 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -78,6 +78,7 @@
 struct int_regs {
/* pt_regs must be offset 0 so r1 + STACK_FRAME_OVERHEAD points to it */
struct pt_regs pt_regs;
+   unsigned long ppr;
 };
 
 #define GET_IP(regs)   ((regs)->nip)
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 8db740a3a8c7..32908a08908b 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -88,7 +88,6 @@ int main(void)
 #ifdef CONFIG_PPC64

[PATCH 1/4] powerpc/64: add struct int_regs to save additional registers on stack

2018-09-28 Thread Nicholas Piggin
struct pt_regs is part of the user ABI and also the fundametal
structure for saving registers at interrupt.

The generic kernel code makes it difficult to completely decouple
these, but it's easy enough to add additional space required to save
more registers. Create a struct int_stack with struct pt_regs at
offset 0.

This is required for a following fix to save the PPR SPR on stack.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/ptrace.h | 17 +++---
 arch/powerpc/kernel/asm-offsets.c | 21 -
 arch/powerpc/kernel/process.c | 52 ---
 arch/powerpc/kernel/stacktrace.c  |  2 +-
 4 files changed, 53 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index 447cbd1bee99..1a98cd8c49f6 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -26,7 +26,6 @@
 #include 
 #include 
 
-
 #ifdef __powerpc64__
 
 /*
@@ -44,7 +43,7 @@
 #define STACK_FRAME_OVERHEAD   112 /* size of minimum stack frame */
 #define STACK_FRAME_LR_SAVE2   /* Location of LR in stack frame */
 #define STACK_FRAME_REGS_MARKERASM_CONST(0x7265677368657265)
-#define STACK_INT_FRAME_SIZE   (sizeof(struct pt_regs) + \
+#define STACK_INT_FRAME_SIZE   (sizeof(struct int_regs) + \
 STACK_FRAME_OVERHEAD + KERNEL_REDZONE_SIZE)
 #define STACK_FRAME_MARKER 12
 
@@ -76,6 +75,11 @@
 
 #ifndef __ASSEMBLY__
 
+struct int_regs {
+   /* pt_regs must be offset 0 so r1 + STACK_FRAME_OVERHEAD points to it */
+   struct pt_regs pt_regs;
+};
+
 #define GET_IP(regs)   ((regs)->nip)
 #define GET_USP(regs)  ((regs)->gpr[1])
 #define GET_FP(regs)   (0)
@@ -119,8 +123,11 @@ extern int ptrace_get_reg(struct task_struct *task, int 
regno,
 extern int ptrace_put_reg(struct task_struct *task, int regno,
  unsigned long data);
 
-#define current_pt_regs() \
-   ((struct pt_regs *)((unsigned long)current_thread_info() + THREAD_SIZE) 
- 1)
+#define current_int_regs() \
+ ((struct int_regs *)((unsigned long)current_thread_info() + THREAD_SIZE) - 1)
+
+#define current_pt_regs() (_int_regs()->pt_regs)
+
 /*
  * We use the least-significant bit of the trap field to indicate
  * whether we have saved the full set of registers, or only a
@@ -137,7 +144,7 @@ extern int ptrace_put_reg(struct task_struct *task, int 
regno,
 #define TRAP(regs) ((regs)->trap & ~0xF)
 #ifdef __powerpc64__
 #define NV_REG_POISON  0xdeadbeefdeadbeefUL
-#define CHECK_FULL_REGS(regs)  BUG_ON(regs->trap & 1)
+#define CHECK_FULL_REGS(regs)  BUG_ON((regs)->trap & 1)
 #else
 #define NV_REG_POISON  0xdeadbeef
 #define CHECK_FULL_REGS(regs)\
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index ba9d0fc98730..8db740a3a8c7 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -72,8 +72,13 @@
 #include 
 #endif
 
-#define STACK_PT_REGS_OFFSET(sym, val) \
-   DEFINE(sym, STACK_FRAME_OVERHEAD + offsetof(struct pt_regs, val))
+#define STACK_INT_REGS_OFFSET(sym, val)\
+   DEFINE(sym, STACK_FRAME_OVERHEAD + offsetof(struct int_regs, val))
+
+#define STACK_PT_REGS_OFFSET(sym, val) \
+   DEFINE(sym, STACK_FRAME_OVERHEAD +  \
+   offsetof(struct int_regs, pt_regs) +\
+   offsetof(struct pt_regs, val))
 
 int main(void)
 {
@@ -150,7 +155,7 @@ int main(void)
OFFSET(THREAD_CKFPSTATE, thread_struct, ckfp_state.fpr);
/* Local pt_regs on stack for Transactional Memory funcs. */
DEFINE(TM_FRAME_SIZE, STACK_FRAME_OVERHEAD +
-  sizeof(struct pt_regs) + 16);
+  sizeof(struct int_regs) + 16);
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
OFFSET(TI_FLAGS, thread_info, flags);
@@ -264,11 +269,11 @@ int main(void)
 
/* Interrupt register frame */
DEFINE(INT_FRAME_SIZE, STACK_INT_FRAME_SIZE);
-   DEFINE(SWITCH_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct 
pt_regs));
+   DEFINE(SWITCH_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct 
int_regs));
 #ifdef CONFIG_PPC64
/* Create extra stack space for SRR0 and SRR1 when calling prom/rtas. */
-   DEFINE(PROM_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) + 
16);
-   DEFINE(RTAS_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) + 
16);
+   DEFINE(PROM_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct int_regs) 
+ 16);
+   DEFINE(RTAS_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct int_regs) 
+ 16);
 #endif /* CONFIG_PPC64 */
STACK_PT_REGS_OFFSET(GPR0, gpr[0]);
STACK_PT_REGS_OFFSET(GPR1, gpr[1]);
@@ -315,8 +320,8 @@ int main(void)
STACK_PT_REGS_OFFSET(SOFTE, softe);
 
/* These _only_ to be used with 

[PATCH 0/4] Fixes for SLB to C series

2018-09-28 Thread Nicholas Piggin
These are some fixes I've got so far to sovle hangs and multi hits
particularly on P8 with 256MB segments (but can also be reproduced
on P9).

I'm not yet sure these solve all the problems, and they need some
good review and testing. So far they have been solid for me.

Thanks,
Nick

Nicholas Piggin (4):
  powerpc/64: add struct int_regs to save additional registers on stack
  powerpc/64: interrupts save PPR on stack rather than thread_struct
  powerpc/64s/hash: Fix preloading of SLB entries
  powerpc/64s/hash: add more barriers for slb preloading

 arch/powerpc/include/asm/exception-64s.h |  9 ++--
 arch/powerpc/include/asm/processor.h | 12 +++---
 arch/powerpc/include/asm/ptrace.h| 18 +---
 arch/powerpc/kernel/asm-offsets.c| 23 ++
 arch/powerpc/kernel/entry_64.S   | 15 +++
 arch/powerpc/kernel/process.c| 54 
 arch/powerpc/kernel/ptrace.c |  4 +-
 arch/powerpc/kernel/stacktrace.c |  2 +-
 arch/powerpc/mm/slb.c| 48 ++---
 9 files changed, 116 insertions(+), 69 deletions(-)

-- 
2.18.0



Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-28 Thread Christoph Hellwig
On Fri, Sep 28, 2018 at 10:06:48AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2018-09-27 at 15:49 +0200, Christoph Hellwig wrote:
> > On Thu, Sep 27, 2018 at 11:45:15AM +1000, Benjamin Herrenschmidt wrote:
> > > I'm not sure this is entirely right.
> > > 
> > > Let's say the mask is 30 bits. You will return GFP_DMA32, which will
> > > fail if you allocate something above 1G (which is legit for
> > > ZONE_DMA32).
> > 
> > And then we will try GFP_DMA further down in the function:
> > 
> > if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> > dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
> > !(gfp & GFP_DMA)) {
> > gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
> > goto again;
> > }
> > 
> > This is and old optimization from x86, because chances are high that
> > GFP_DMA32 will give you suitable memory for the infamous 31-bit
> > dma mask devices (at least at boot time) and thus we don't have
> > to deplete the tiny ZONE_DMA pool.
> 
> I see, it's rather confusing :-) Wouldn't it be better to check against
> top of 32-bit memory instead here too ?

Where is here?  In __dma_direct_optimal_gfp_mask we already handled
it due to the optimistic zone selection we are discussing.

In the fallback quoted above there is no point for it, as with a
physical memory size smaller than ZONE_DMA32 (or ZONE_DMA for that matter)
we will have succeeded with the optimistic zone selection and not hit
the fallback path.

Either way this code probably needs much better comments.  I'll send
a patch on top of the recent series.


Re: [PATCH v3 6/9] kbuild: consolidate Devicetree dtb build rules

2018-09-28 Thread Rob Herring
On Sun, Sep 23, 2018 at 06:31:14AM -0400, Masahiro Yamada wrote:
> 2018-09-13 11:51 GMT-04:00 Geert Uytterhoeven :
> > Hi Yamada-san,
> >
> > On Wed, Sep 12, 2018 at 3:02 AM Masahiro Yamada
> >  wrote:
> >> 2018-09-12 0:40 GMT+09:00 Rob Herring :
> >> > On Mon, Sep 10, 2018 at 10:04 AM Rob Herring  wrote:
> >> >> There is nothing arch specific about building dtb files other than their
> >> >> location under /arch/*/boot/dts/. Keeping each arch aligned is a pain.
> >> >> The dependencies and supported targets are all slightly different.
> >> >> Also, a cross-compiler for each arch is needed, but really the host
> >> >> compiler preprocessor is perfectly fine for building dtbs. Move the
> >> >> build rules to a common location and remove the arch specific ones. This
> >> >> is done in a single step to avoid warnings about overriding rules.
> >> >>
> >> >> The build dependencies had been a mixture of 'scripts' and/or 'prepare'.
> >> >> These pull in several dependencies some of which need a target compiler
> >> >> (specifically devicetable-offsets.h) and aren't needed to build dtbs.
> >> >> All that is really needed is dtc, so adjust the dependencies to only be
> >> >> dtc.
> >> >>
> >> >> This change enables support 'dtbs_install' on some arches which were
> >> >> missing the target.
> >> >
> >> > [...]
> >> >
> >> >> @@ -1215,6 +1215,33 @@ kselftest-merge:
> >> >> $(srctree)/tools/testing/selftests/*/config
> >> >> +$(Q)$(MAKE) -f $(srctree)/Makefile olddefconfig
> >> >>
> >> >> +# 
> >> >> ---
> >> >> +# Devicetree files
> >> >> +
> >> >> +ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/boot/dts/),)
> >> >> +dtstree := arch/$(SRCARCH)/boot/dts
> >> >> +endif
> >> >> +
> >> >> +ifdef CONFIG_OF_EARLY_FLATTREE
> >> >
> >> > This can be true when dtstree is unset. So this line should be this
> >> > instead to fix the 0-day reported error:
> >> >
> >> > ifneq ($(dtstree),)
> >> >
> >> >> +
> >> >> +%.dtb : scripts_dtc
> >> >> +   $(Q)$(MAKE) $(build)=$(dtstree) $(dtstree)/$@
> >> >> +
> >> >> +PHONY += dtbs dtbs_install
> >> >> +dtbs: scripts_dtc
> >> >> +   $(Q)$(MAKE) $(build)=$(dtstree)
> >> >> +
> >> >> +dtbs_install: dtbs
> >> >> +   $(Q)$(MAKE) $(dtbinst)=$(dtstree)
> >> >> +
> >> >> +all: dtbs
> >> >> +
> >> >> +endif
> >>
> >>
> >> Ah, right.
> >> Even x86 can enable OF and OF_UNITTEST.
> >>
> >>
> >>
> >> Another solution might be,
> >> guard it by 'depends on ARCH_SUPPORTS_OF'.
> >>
> >>
> >>
> >> This is actually what ACPI does.
> >>
> >> menuconfig ACPI
> >> bool "ACPI (Advanced Configuration and Power Interface) Support"
> >> depends on ARCH_SUPPORTS_ACPI
> >>  ...
> >
> > ACPI is a real platform feature, as it depends on firmware.
> >
> > CONFIG_OF can be enabled, and DT overlays can be loaded, on any platform,
> > even if it has ACPI ;-)
> >
> 
> OK, understood.

Any other comments on this? I'd like to get the series into linux-next 
soon.

There was one other problem 0-day reported when building with 
CONFIG_OF=n while setting CONFIG_OF_ALL_DTBS=y on the kernel command 
line. The problem is dtc is not built as setting options on the command 
line doesn't invoke kconfig select(s). This can be fixed by also 
adding CONFIG_DTC=y to the command line, always building dtc regardless 
of Kconfig, or making 'all' conditionally dependent on 'dtbs'. I've gone 
with the last option as that is how this problem was avoided before. 

So the hunk in question with the 2 fixes now looks like this:

@@ -1215,6 +1215,35 @@ kselftest-merge:
$(srctree)/tools/testing/selftests/*/config
+$(Q)$(MAKE) -f $(srctree)/Makefile olddefconfig
 
+# 
---
+# Devicetree files
+
+ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/boot/dts/),)
+dtstree := arch/$(SRCARCH)/boot/dts
+endif
+
+ifneq ($(dtstree),)
+
+%.dtb : scripts_dtc
+   $(Q)$(MAKE) $(build)=$(dtstree) $(dtstree)/$@
+
+PHONY += dtbs dtbs_install
+dtbs: scripts_dtc
+   $(Q)$(MAKE) $(build)=$(dtstree)
+
+dtbs_install: dtbs
+   $(Q)$(MAKE) $(dtbinst)=$(dtstree)
+
+ifdef CONFIG_OF_EARLY_FLATTREE
+all: dtbs
+endif
+
+endif
+
+PHONY += scripts_dtc
+scripts_dtc: scripts_basic
+   $(Q)$(MAKE) $(build)=scripts/dtc
+
 # 
---
 # Modules
 


[PATCH v3] powerpc: wire up memtest

2018-09-28 Thread Christophe Leroy
Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.

Tested-by: Daniel Axtens 
Signed-off-by: Christophe Leroy 
---
 v3: updated kernel parameters documentation to mention PPC

 v2: moved the test after initmem_init() as PPC64 sets max_low_pfn later than 
PPC32.

 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 arch/powerpc/kernel/setup-common.c  | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 92eb1f42240d..a5ad67d5cb16 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2404,7 +2404,7 @@
seconds.  Use this parameter to check at some
other rate.  0 disables periodic checking.
 
-   memtest=[KNL,X86,ARM] Enable memtest
+   memtest=[KNL,X86,ARM,PPC] Enable memtest
Format: 
default : 0 
Specifies the number of memtest passes to be
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 93fa0c99681e..9ca9db707bcb 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -966,6 +967,8 @@ void __init setup_arch(char **cmdline_p)
 
initmem_init();
 
+   early_memtest(min_low_pfn << PAGE_SHIFT, max_low_pfn << PAGE_SHIFT);
+
 #ifdef CONFIG_DUMMY_CONSOLE
conswitchp = _con;
 #endif
-- 
2.13.3



[PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-09-28 Thread David Hildenbrand
How to/when to online hotplugged memory is hard to manage for
distributions because different memory types are to be treated differently.
Right now, we need complicated udev rules that e.g. check if we are
running on s390x, on a physical system or on a virtualized system. But
there is also sometimes the demand to really online memory immediately
while adding in the kernel and not to wait for user space to make a
decision. And on virtualized systems there might be different
requirements, depending on "how" the memory was added (and if it will
eventually get unplugged again - DIMM vs. paravirtualized mechanisms).

On the one hand, we have physical systems where we sometimes
want to be able to unplug memory again - e.g. a DIMM - so we have to online
it to the MOVABLE zone optionally. That decision is usually made in user
space.

On the other hand, we have memory that should never be onlined
automatically, only when asked for by an administrator. Such memory only
applies to virtualized environments like s390x, where the concept of
"standby" memory exists. Memory is detected and added during boot, so it
can be onlined when requested by the admininistrator or some tooling.
Only when onlining, memory will be allocated in the hypervisor.

But then, we also have paravirtualized devices (namely xen and hyper-v
balloons), that hotplug memory that will never ever be removed from a
system right now using offline_pages/remove_memory. If at all, this memory
is logically unplugged and handed back to the hypervisor via ballooning.

For paravirtualized devices it is relevant that memory is onlined as
quickly as possible after adding - and that it is added to the NORMAL
zone. Otherwise, it could happen that too much memory in a row is added
(but not onlined), resulting in out-of-memory conditions due to the
additional memory for "struct pages" and friends. MOVABLE zone as well
as delays might be very problematic and lead to crashes (e.g. zone
imbalance).

Therefore, introduce memory block types and online memory depending on
it when adding the memory. Expose the memory type to user space, so user
space handlers can start to process only "normal" memory. Other memory
block types can be ignored. One thing less to worry about in user space.

Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Greg Kroah-Hartman 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Stephen Rothwell 
Cc: Michal Hocko 
Cc: "Kirill A. Shutemov" 
Cc: David Hildenbrand 
Cc: Nicholas Piggin 
Cc: "Jonathan Neuschäfer" 
Cc: Joe Perches 
Cc: Michael Neuling 
Cc: Mauricio Faria de Oliveira 
Cc: Balbir Singh 
Cc: Rashmica Gupta 
Cc: Pavel Tatashin 
Cc: Rob Herring 
Cc: Philippe Ombredanne 
Cc: Kate Stewart 
Cc: "mike.tra...@hpe.com" 
Cc: Joonsoo Kim 
Cc: Oscar Salvador 
Cc: Mathieu Malaterre 
Signed-off-by: David Hildenbrand 
---

This patch is based on the current mm-tree, where some related
patches from me are currently residing that touched the add_memory()
functions.

 arch/ia64/mm/init.c   |  4 +-
 arch/powerpc/mm/mem.c |  4 +-
 arch/powerpc/platforms/powernv/memtrace.c |  3 +-
 arch/s390/mm/init.c   |  4 +-
 arch/sh/mm/init.c |  4 +-
 arch/x86/mm/init_32.c |  4 +-
 arch/x86/mm/init_64.c |  8 +--
 drivers/acpi/acpi_memhotplug.c|  3 +-
 drivers/base/memory.c | 63 ---
 drivers/hv/hv_balloon.c   | 33 ++--
 drivers/s390/char/sclp_cmd.c  |  3 +-
 drivers/xen/balloon.c |  2 +-
 include/linux/memory.h| 28 +-
 include/linux/memory_hotplug.h| 17 +++---
 mm/hmm.c  |  6 ++-
 mm/memory_hotplug.c   | 31 ++-
 16 files changed, 139 insertions(+), 78 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index d5e12ff1d73c..813d1d86bf95 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -646,13 +646,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-   bool want_memblock)
+   int memory_block_type)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
-   ret = __add_pages(nid, start_pfn, nr_pages, altmap, want_memblock);
+   ret = __add_pages(nid, start_pfn, nr_pages, altmap, 

Re: [PATCH] selftests/powerpc: Fix Makefiles for headers_install change

2018-09-28 Thread Anders Roxell
On Fri, 28 Sep 2018 at 07:43, Michael Ellerman  wrote:
>
> Commit b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
> introduced a requirement that Makefiles more than one level below the
> selftests directory need to define top_srcdir, but it didn't update
> any of the powerpc Makefiles.
>
> This broke building all the powerpc selftests with eg:
>
>   make[1]: Entering directory '/src/linux/tools/testing/selftests/powerpc'
>   BUILD_TARGET=/src/linux/tools/testing/selftests/powerpc/alignment; mkdir -p 
> $BUILD_TARGET; make OUTPUT=$BUILD_TARGET -k -C alignment all
>   make[2]: Entering directory 
> '/src/linux/tools/testing/selftests/powerpc/alignment'
>   ../../lib.mk:20: ../../../../scripts/subarch.include: No such file or 
> directory
>   make[2]: *** No rule to make target '../../../../scripts/subarch.include'.
>   make[2]: Failed to remake makefile '../../../../scripts/subarch.include'.
>   Makefile:38: recipe for target 'alignment' failed
>
> Fix it by setting top_srcdir in the affected Makefiles.
>
> Fixes: b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
> Signed-off-by: Michael Ellerman 

oops, I'm sorry =/

Reviewed-by: Anders Roxell 

> ---
>  tools/testing/selftests/powerpc/alignment/Makefile | 1 +
>  tools/testing/selftests/powerpc/benchmarks/Makefile| 1 +
>  tools/testing/selftests/powerpc/cache_shape/Makefile   | 1 +
>  tools/testing/selftests/powerpc/copyloops/Makefile | 1 +
>  tools/testing/selftests/powerpc/dscr/Makefile  | 1 +
>  tools/testing/selftests/powerpc/math/Makefile  | 1 +
>  tools/testing/selftests/powerpc/mm/Makefile| 1 +
>  tools/testing/selftests/powerpc/pmu/Makefile   | 1 +
>  tools/testing/selftests/powerpc/pmu/ebb/Makefile   | 1 +
>  tools/testing/selftests/powerpc/primitives/Makefile| 1 +
>  tools/testing/selftests/powerpc/ptrace/Makefile| 1 +
>  tools/testing/selftests/powerpc/signal/Makefile| 1 +
>  tools/testing/selftests/powerpc/stringloops/Makefile   | 1 +
>  tools/testing/selftests/powerpc/switch_endian/Makefile | 1 +
>  tools/testing/selftests/powerpc/syscalls/Makefile  | 1 +
>  tools/testing/selftests/powerpc/tm/Makefile| 1 +
>  tools/testing/selftests/powerpc/vphn/Makefile  | 1 +
>  17 files changed, 17 insertions(+)
>
> diff --git a/tools/testing/selftests/powerpc/alignment/Makefile 
> b/tools/testing/selftests/powerpc/alignment/Makefile
> index 93baacab7693..d056486f49de 100644
> --- a/tools/testing/selftests/powerpc/alignment/Makefile
> +++ b/tools/testing/selftests/powerpc/alignment/Makefile
> @@ -1,5 +1,6 @@
>  TEST_GEN_PROGS := copy_first_unaligned alignment_handler
>
> +top_srcdir = ../../../../..
>  include ../../lib.mk
>
>  $(TEST_GEN_PROGS): ../harness.c ../utils.c
> diff --git a/tools/testing/selftests/powerpc/benchmarks/Makefile 
> b/tools/testing/selftests/powerpc/benchmarks/Makefile
> index b4d7432a0ecd..d40300a65b42 100644
> --- a/tools/testing/selftests/powerpc/benchmarks/Makefile
> +++ b/tools/testing/selftests/powerpc/benchmarks/Makefile
> @@ -4,6 +4,7 @@ TEST_GEN_FILES := exec_target
>
>  CFLAGS += -O2
>
> +top_srcdir = ../../../../..
>  include ../../lib.mk
>
>  $(TEST_GEN_PROGS): ../harness.c
> diff --git a/tools/testing/selftests/powerpc/cache_shape/Makefile 
> b/tools/testing/selftests/powerpc/cache_shape/Makefile
> index 1be547434a49..ede4d3dae750 100644
> --- a/tools/testing/selftests/powerpc/cache_shape/Makefile
> +++ b/tools/testing/selftests/powerpc/cache_shape/Makefile
> @@ -5,6 +5,7 @@ all: $(TEST_PROGS)
>
>  $(TEST_PROGS): ../harness.c ../utils.c
>
> +top_srcdir = ../../../../..
>  include ../../lib.mk
>
>  clean:
> diff --git a/tools/testing/selftests/powerpc/copyloops/Makefile 
> b/tools/testing/selftests/powerpc/copyloops/Makefile
> index 1cf89a34d97c..44574f3818b3 100644
> --- a/tools/testing/selftests/powerpc/copyloops/Makefile
> +++ b/tools/testing/selftests/powerpc/copyloops/Makefile
> @@ -17,6 +17,7 @@ TEST_GEN_PROGS := copyuser_64_t0 copyuser_64_t1 
> copyuser_64_t2 \
>
>  EXTRA_SOURCES := validate.c ../harness.c stubs.S
>
> +top_srcdir = ../../../../..
>  include ../../lib.mk
>
>  $(OUTPUT)/copyuser_64_t%:  copyuser_64.S $(EXTRA_SOURCES)
> diff --git a/tools/testing/selftests/powerpc/dscr/Makefile 
> b/tools/testing/selftests/powerpc/dscr/Makefile
> index 55d7db7a616b..5df476364b4d 100644
> --- a/tools/testing/selftests/powerpc/dscr/Makefile
> +++ b/tools/testing/selftests/powerpc/dscr/Makefile
> @@ -3,6 +3,7 @@ TEST_GEN_PROGS := dscr_default_test dscr_explicit_test 
> dscr_user_test   \
>   dscr_inherit_test dscr_inherit_exec_test dscr_sysfs_test  \
>   dscr_sysfs_thread_test
>
> +top_srcdir = ../../../../..
>  include ../../lib.mk
>
>  $(OUTPUT)/dscr_default_test: LDLIBS += -lpthread
> diff --git a/tools/testing/selftests/powerpc/math/Makefile 
> b/tools/testing/selftests/powerpc/math/Makefile
> index 0dd3a01fdab9..11a10d7a2bbd 100644
> --- 

Re: [PATCH v2] powerpc: wire up memtest

2018-09-28 Thread Daniel Axtens
Hi Christophe,

> Add call to early_memtest() so that kernel compiled with
> CONFIG_MEMTEST really perform memtest at startup when requested
> via 'memtest' boot parameter.
>
This works for me on an e6500.

Tested-by: Daniel Axtens 

However, you should also change Documentation/admin-guide/kernel-parameters.txt
to reflect that memtest is supported natively on ppc with your patch.

Regards,
Daniel

> Signed-off-by: Christophe Leroy 
> ---
>  v2: moved the test after initmem_init() as PPC64 sets max_low_pfn later than 
> PPC32.
>
>  arch/powerpc/kernel/setup-common.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 93fa0c99681e..9ca9db707bcb 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -966,6 +967,8 @@ void __init setup_arch(char **cmdline_p)
>  
>   initmem_init();
>  
> + early_memtest(min_low_pfn << PAGE_SHIFT, max_low_pfn << PAGE_SHIFT);
> +
>  #ifdef CONFIG_DUMMY_CONSOLE
>   conswitchp = _con;
>  #endif
> -- 
> 2.13.3


Re: [PATCH v6] selftests: add headers_install to lib.mk

2018-09-28 Thread Shuah Khan
On 09/27/2018 10:52 PM, Michael Ellerman wrote:
> [ + linuxppc-dev ]
> 
> Anders Roxell  writes:
>> If the kernel headers aren't installed we can't build all the tests.
>> Add a new make target rule 'khdr' in the file lib.mk to generate the
>> kernel headers and that gets include for every test-dir Makefile that
>> includes lib.mk If the testdir in turn have its own sub-dirs the
>> top_srcdir needs to be set to the linux-rootdir to be able to generate
>> the kernel headers.
>>
>> Signed-off-by: Anders Roxell 
>> Reviewed-by: Fathi Boudra 
>> ---
>>
>> I sent this (v5) a month ago and wondered if it got lost. Resending
>> unchanged.
>>
>> Cheers,
>> Anders
>>
>>  Makefile   | 14 +-
>>  scripts/subarch.include| 13 +
>>  tools/testing/selftests/android/Makefile   |  2 +-
>>  tools/testing/selftests/android/ion/Makefile   |  2 ++
>>  tools/testing/selftests/futex/functional/Makefile  |  1 +
>>  tools/testing/selftests/gpio/Makefile  |  7 ++-
>>  tools/testing/selftests/kvm/Makefile   |  7 ++-
>>  tools/testing/selftests/lib.mk | 12 
>>  tools/testing/selftests/net/Makefile   |  1 +
>>  .../selftests/networking/timestamping/Makefile |  1 +
>>  tools/testing/selftests/vm/Makefile|  4 
>>  11 files changed, 36 insertions(+), 28 deletions(-)
>>  create mode 100644 scripts/subarch.include
> 
> This broke all the powerpc selftests :(

Sorry for thr breakage.

> 
> Why did it go in at rc5?
> 

This patch has been in linux-next for a sometime before I decided to send this.
My original intent was to send this for rc2, and my schedule was messed up with
traveling. Since I didn't hear any issues from linux-next soaking, I made a call
on sending this in for rc5.

On second thought I should have waited until 4.20. Sorry about that.

thanks,
-- Shuah


Re: [PATCH] kdb: use correct pointer when 'btc' calls 'btt'

2018-09-28 Thread Michael Ellerman
Christophe LEROY  writes:
> Le 27/09/2018 à 13:09, Michael Ellerman a écrit :
>> Christophe LEROY  writes:
>>> Le 26/09/2018 à 13:11, Daniel Thompson a écrit :
 On 16/09/2018 20:06, Daniel Thompson wrote:
> On Fri, Sep 14, 2018 at 12:35:44PM +, Christophe Leroy wrote:
>> On a powerpc 8xx, 'btc' fails as follows:
>> Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry
>> ...
>>
>> Signed-off-by: Christophe Leroy 
>> Cc:  # 4.15+
>
> Would a Fixes: be better here?
> Fixes: ad67b74d2469d9b82 ("printk: hash addresses printed with %p")

 Christophe, When you add the Fixes: could you also add my

 Reviewed-by: Daniel Thompson 
>>>
>>> Ok, thanks for the review, but do I have to do anything really ?
>>>
>>> The Fixes: and now your Reviewed-by: appear automatically in patchwork
>>> (https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=65715),
>>> so I believe they'll be automatically included when Jason or someone
>>> else takes the patch, no ?
>> 
>> patchwork won't add the Fixes tag from the reply, it needs to be in the
>> original mail.
>> 
>> See:
>>https://github.com/getpatchwork/patchwork/issues/151
>> 
>
> Ok, so it accounts it and adds a '1' in the F column in the patches 
> list, but won't take it into account.

Yes. The logic that populates the columns is separate from the logic
that scrapes the tags, which is a bug :)

> Then I'll send a v2 with revised commit text.

Thanks.

cheers


Re: [PATCH v4 1/2] powerpc/32: add stack protector support

2018-09-28 Thread Michael Ellerman
Christophe LEROY  writes:
> Le 27/09/2018 à 09:45, Segher Boessenkool a écrit :
>> On Thu, Sep 27, 2018 at 08:20:00AM +0200, Christophe LEROY wrote:
...
>> 
>>> However this is the canary for initial startup only. Only idle() still
>>> uses this canary once the system is running. A new canary is set for any
>>> new forked task.
>> 
>> Ah, that makes things a lot better!  Do those new tasks get a canary
>> from something with sufficient entropy though?
>
> For the kernel threads that are started early, probably not. For the 
> ones started a bit later, and for user processes, I believe they have 
> better entropy. Anyway, all this is handled by the kernel core and is 
> out of control of individual arches, as its done in kernel/fork.c in 
> function dup_task_struct(). However this function is declared as
> static __latent_entropy struct task_struct *copy_process(). This 
> __latent_entropy attibute must help in a way.
>
>> 
>>> Maybe should the idle canary be updated later once there is more entropy
>> 
>> That is tricky to do, but sure, if you can, that should help.
>> 
>>> ? Today there is a new call to boot_init_stack_canary() in
>>> cpu_startup_entry(), but it is enclosed inside #ifdef CONFIG_X86.
>> 
>> It needs to know the details of how ssp works on each platform.
>
> Well, that could be for another patch in the future. That's probably 
> feasible on x86 and PPC because they both use TLS guard, but not for 
> other arches where the guard is global at the moment. So we'll have to 
> do it carefully.
>
> I agree with you that we may not for the time being get all the expected 
> security against hackers out of it due to that little entropy, but my 
> main concern at the time being is to deter developper's bugs clobbering 
> the stack, and for that any non trivial canary should make it, shouldn't 
> it ?

Yes.

The problem of low entropy at boot on systems without a good hardware
source is sort of unsolvable.

As you say it's up to the core kernel/random code, we shouldn't be
trying to do anything tricky in the arch code.

You don't want your system to take 3 hours to boot because it's waiting
for entropy for the stack canary.

If we can update the canary later once the entropy pool is setup that
would be ideal.

cheers


Re: [PATCH kernel] cxl: Remove unused include

2018-09-28 Thread Michael Ellerman
Alexey Kardashevskiy  writes:
> The included opal.h gives a wrong idea that CXL makes PPC OPAL calls
> while it does not so let's remote it.

But it does use eg.

  OPAL_PHB_CAPI_MODE_SNOOP_ON
  OPAL_PHB_CAPI_MODE_CAPI

Which come from opal-api.h via opal.h.

So you should at least include opal-api.h.

cheers

> diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
> index b66d832..8cbcbb7 100644
> --- a/drivers/misc/cxl/pci.c
> +++ b/drivers/misc/cxl/pci.c
> @@ -17,7 +17,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> -- 
> 2.11.0


Re: [PATCH] powerpc: wire up memtest

2018-09-28 Thread Michael Ellerman
Christophe LEROY  writes:
> Le 28/09/2018 à 05:41, Michael Ellerman a écrit :
>> Christophe Leroy  writes:
>>> Add call to early_memtest() so that kernel compiled with
>>> CONFIG_MEMTEST really perform memtest at startup when requested
>>> via 'memtest' boot parameter.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   arch/powerpc/kernel/setup-common.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/powerpc/kernel/setup-common.c 
>>> b/arch/powerpc/kernel/setup-common.c
>>> index 93fa0c99681e..904b728eb20d 100644
>>> --- a/arch/powerpc/kernel/setup-common.c
>>> +++ b/arch/powerpc/kernel/setup-common.c
>>> @@ -33,6 +33,7 @@
>>>   #include 
>>>   #include 
>>>   #include 
>>> +#include 
>>>   #include 
>>>   #include 
>>>   #include 
>>> @@ -917,6 +918,8 @@ void __init setup_arch(char **cmdline_p)
>>> /* Parse memory topology */
>>> mem_topology_setup();
>>>   
>>> +   early_memtest(min_low_pfn << PAGE_SHIFT, max_low_pfn << PAGE_SHIFT);
>> 
>> On a ppc64le VM this boils down to early_memtest(0, 0) for me.
>> 
>> I think it's too early, we don't set up max_low_pfn until
>> initmem_init().
>> 
>> If I move it after initmem_init() then it does something more useful:
>
> Ok. On my 8xx max_low_pfn is set in mem_topology_setup().
>
> Moving the test afte initmem_init() still works on the 8xx so I'll do that.

Great, thanks.

cheers


Re: [PATCH] selftests/powerpc: Fix Makefiles for headers_install change

2018-09-28 Thread Michael Ellerman
Anders Roxell  writes:
> On Fri, 28 Sep 2018 at 07:43, Michael Ellerman  wrote:
>>
>> Commit b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
>> introduced a requirement that Makefiles more than one level below the
>> selftests directory need to define top_srcdir, but it didn't update
>> any of the powerpc Makefiles.
>>
>> This broke building all the powerpc selftests with eg:
>>
>>   make[1]: Entering directory '/src/linux/tools/testing/selftests/powerpc'
>>   BUILD_TARGET=/src/linux/tools/testing/selftests/powerpc/alignment; mkdir 
>> -p $BUILD_TARGET; make OUTPUT=$BUILD_TARGET -k -C alignment all
>>   make[2]: Entering directory 
>> '/src/linux/tools/testing/selftests/powerpc/alignment'
>>   ../../lib.mk:20: ../../../../scripts/subarch.include: No such file or 
>> directory
>>   make[2]: *** No rule to make target '../../../../scripts/subarch.include'.
>>   make[2]: Failed to remake makefile '../../../../scripts/subarch.include'.
>>   Makefile:38: recipe for target 'alignment' failed
>>
>> Fix it by setting top_srcdir in the affected Makefiles.
>>
>> Fixes: b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
>> Signed-off-by: Michael Ellerman 
>
> oops, I'm sorry =/

No worries, it happens :)

> Reviewed-by: Anders Roxell 

Thanks.

cheers


Re: selftests/powerpc: Fix Makefiles for headers_install change

2018-09-28 Thread Michael Ellerman
On Fri, 2018-09-28 at 05:43:23 UTC, Michael Ellerman wrote:
> Commit b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
> introduced a requirement that Makefiles more than one level below the
> selftests directory need to define top_srcdir, but it didn't update
> any of the powerpc Makefiles.
> 
> This broke building all the powerpc selftests with eg:
> 
>   make[1]: Entering directory '/src/linux/tools/testing/selftests/powerpc'
>   BUILD_TARGET=/src/linux/tools/testing/selftests/powerpc/alignment; mkdir -p 
> $BUILD_TARGET; make OUTPUT=$BUILD_TARGET -k -C alignment all
>   make[2]: Entering directory 
> '/src/linux/tools/testing/selftests/powerpc/alignment'
>   ../../lib.mk:20: ../../../../scripts/subarch.include: No such file or 
> directory
>   make[2]: *** No rule to make target '../../../../scripts/subarch.include'.
>   make[2]: Failed to remake makefile '../../../../scripts/subarch.include'.
>   Makefile:38: recipe for target 'alignment' failed
> 
> Fix it by setting top_srcdir in the affected Makefiles.
> 
> Fixes: b2d35fa5fc80 ("selftests: add headers_install to lib.mk")
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/7e0cf1c983b5b24426d130fd949a05

cheers


[GIT PULL] Please pull powerpc/linux.git powerpc-4.19-3 tag

2018-09-28 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Hi Greg,

Please pull some more powerpc fixes for 4.19:

The following changes since commit 11da3a7f84f19c26da6f86af878298694ede0804:

  Linux 4.19-rc3 (2018-09-09 17:26:43 -0700)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.19-3

for you to fetch changes up to 7e0cf1c983b5b24426d130fd949a055d520acc9a:

  selftests/powerpc: Fix Makefiles for headers_install change (2018-09-28 
15:07:45 +1000)

- --
powerpc fixes for 4.19 #3

A reasonably big batch of fixes due to me being away for a few weeks.

A fix for the TM emulation support on Power9, which could result in corrupting
the guest r11 when running under KVM.

Two fixes to the TM code which could lead to userspace GPR corruption if we take
an SLB miss at exactly the wrong time.

Our dynamic patching code had a bug that meant we could patch freed __init text,
which could lead to corrupting userspace memory.

csum_ipv6_magic() didn't work on little endian platforms since we optimised it
recently.

A fix for an endian bug when reading a device tree property telling us how many
storage keys the machine has available.

Fix a crash seen on some configurations of PowerVM when migrating the partition
from one machine to another.

A fix for a regression in the setup of our CPU to NUMA node mapping in KVM
guests.

A fix to our selftest Makefiles to make them work since a recent change to the
shared Makefile logic.

Thanks to:
  Alexey Kardashevskiy, Breno Leitao, Christophe Leroy, Michael Bringmann,
  Michael Neuling, Nicholas Piggin, Paul Mackerras,, Srikar Dronamraju, Thiago
  Jung Bauermann, Xin Long.

- --
Alexey Kardashevskiy (1):
  powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)

Christophe Leroy (1):
  powerpc: fix csum_ipv6_magic() on little endian platforms

Michael Bringmann (1):
  powerpc/pseries: Fix unitialized timer reset on migration

Michael Ellerman (1):
  selftests/powerpc: Fix Makefiles for headers_install change

Michael Neuling (4):
  KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds
  powerpc: Avoid code patching freed init sections
  powerpc/tm: Fix userspace r13 corruption
  powerpc/tm: Avoid possible userspace r1 corruption on reclaim

Srikar Dronamraju (1):
  powerpc/numa: Use associativity if VPHN hcall is successful

Thiago Jung Bauermann (1):
  powerpc/pkeys: Fix reading of ibm, processor-storage-keys property


 arch/powerpc/include/asm/setup.h |  1 +
 arch/powerpc/kernel/exceptions-64s.S |  4 ++--
 arch/powerpc/kernel/tm.S | 20 +---
 arch/powerpc/lib/checksum_64.S   |  3 +++
 arch/powerpc/lib/code-patching.c |  6 ++
 arch/powerpc/mm/mem.c|  2 ++
 arch/powerpc/mm/numa.c   |  7 +--
 arch/powerpc/mm/pkeys.c  |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda-tce.c|  2 +-
 tools/testing/selftests/powerpc/alignment/Makefile   |  1 +
 tools/testing/selftests/powerpc/benchmarks/Makefile  |  1 +
 tools/testing/selftests/powerpc/cache_shape/Makefile |  1 +
 tools/testing/selftests/powerpc/copyloops/Makefile   |  1 +
 tools/testing/selftests/powerpc/dscr/Makefile|  1 +
 tools/testing/selftests/powerpc/math/Makefile|  1 +
 tools/testing/selftests/powerpc/mm/Makefile  |  1 +
 tools/testing/selftests/powerpc/pmu/Makefile |  1 +
 tools/testing/selftests/powerpc/pmu/ebb/Makefile |  1 +
 tools/testing/selftests/powerpc/primitives/Makefile  |  1 +
 tools/testing/selftests/powerpc/ptrace/Makefile  |  1 +
 tools/testing/selftests/powerpc/signal/Makefile  |  1 +
 tools/testing/selftests/powerpc/stringloops/Makefile |  1 +
 .../testing/selftests/powerpc/switch_endian/Makefile |  1 +
 tools/testing/selftests/powerpc/syscalls/Makefile|  1 +
 tools/testing/selftests/powerpc/tm/Makefile  |  1 +
 tools/testing/selftests/powerpc/vphn/Makefile|  1 +
 26 files changed, 55 insertions(+), 9 deletions(-)
-BEGIN PGP SIGNATURE-

iQIzBAEBCgAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAluuEqoACgkQUevqPMjh
pYDd2Q/9FuvhAY3jn89VcWFATv8bst/h0ujZp9w9IJapA3Pa3kAktrFo7arwvbqM
M1ffcbuU+klIYL0PpJN1pN19xZCd+vsYmmlx4OwpXBmA9+hIyX6WKo5rxLDc/cbq
N8b6ARmA4b8qIXSeLlUZ8WjQ+Qw19DzNLNq7MpCFHS1fkaE8ozn9bQLm+8v/M9Qz
6lKneV3pSvGIDI0PQOP3FC6TNKdR36MVng1LW9lKlt8y3gJFboNaP/5QGz7x3sBP
qe4cqxx9F9GlJyonQxz6T0PcoxFpk2DTKnxlnJ2okT7QG30hXXdJkgWStE58Xgyp
8yTJYZ+KXcpt/hfwMR82RVP9TzDr1UwXnuWF8slkoEu+y+YcZrlnajzoZVgojXqH
aU4k3pcQJXOU0evOz9+TvL23ilV6dCRIt5Fv9Hy3+QUnc2+2iFe3+AEPIKAXUWPG
mPhI37020GLBfi+ymUIoJ8vRwahQEcr2aH+uWndjV8+4FgNw3ygLbTGCAnsh4nHq

[PATCH v2 33/33] KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested

2018-09-28 Thread Paul Mackerras
This adds code to call the H_TLB_INVALIDATE hypercall when running as
a guest, in the cases where we need to invalidate TLBs (or other MMU
caches) as part of managing the mappings for a nested guest.  Calling
H_TLB_INVALIDATE is an alternative to doing the tlbie instruction and
having it be emulated by our hypervisor.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  5 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 30 --
 arch/powerpc/kvm/book3s_hv_nested.c  | 19 ---
 3 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 6066913..703924f 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC_PSERIES
 static inline bool kvmhv_on_pseries(void)
@@ -121,6 +122,10 @@ struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, 
int l1_lpid,
 void kvmhv_put_nested(struct kvm_nested_guest *gp);
 int kvmhv_nested_next_lpid(struct kvm *kvm, int lpid);
 
+/* Encoding of first parameter for H_TLB_INVALIDATE */
+#define H_TLBIE_P1_ENC(ric, prs, r)(___PPC_RIC(ric) | ___PPC_PRS(prs) | \
+___PPC_R(r))
+
 /* Power architecture requires HPT is at least 256kiB, at most 64TiB */
 #define PPC_MIN_HPT_ORDER  18
 #define PPC_MAX_HPT_ORDER  46
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index b74abdd..6c93f5c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -201,17 +201,43 @@ static void kvmppc_radix_tlbie_page(struct kvm *kvm, 
unsigned long addr,
unsigned int pshift, unsigned int lpid)
 {
unsigned long psize = PAGE_SIZE;
+   int psi;
+   long rc;
+   unsigned long rb;
 
if (pshift)
psize = 1UL << pshift;
+   else
+   pshift = PAGE_SHIFT;
 
addr &= ~(psize - 1);
-   radix__flush_tlb_lpid_page(lpid, addr, psize);
+
+   if (!kvmhv_on_pseries()) {
+   radix__flush_tlb_lpid_page(lpid, addr, psize);
+   return;
+   }
+
+   psi = shift_to_mmu_psize(pshift);
+   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
+   lpid, rb);
+   if (rc)
+   pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
 }
 
 static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned int lpid)
 {
-   radix__flush_pwc_lpid(lpid);
+   long rc;
+
+   if (!kvmhv_on_pseries()) {
+   radix__flush_pwc_lpid(lpid);
+   return;
+   }
+
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   if (rc)
+   pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
 }
 
 static unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 5879c8d..e35ee4f 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -366,15 +366,20 @@ void kvmhv_nested_exit(void)
 
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1)
 {
-   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   long rc;
+
+   if (!kvmhv_on_pseries()) {
mmu_partition_table_set_entry(lpid, dw0, dw1);
-   } else {
-   pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
-   pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
-   /* this will be emulated, L0 will do the necessary barriers */
-   asm volatile(PPC_TLBIE_5(%0, %1, 2, 0, 1) : :
-"r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
+   return;
}
+
+   pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
+   pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
+   /* L0 will do the necessary barriers */
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   if (rc)
+   pr_err("KVM: TLB LPID invalidation hcall failed, rc=%ld\n", rc);
 }
 
 static void kvmhv_set_nested_ptbl(struct kvm_nested_guest *gp)
-- 
2.7.4



[PATCH v2 32/33] KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs

2018-09-28 Thread Paul Mackerras
This adds a list of valid shadow PTEs for each nested guest to
the 'radix' file for the guest in debugfs.  This can be useful for
debugging.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 39 +---
 arch/powerpc/kvm/book3s_hv_nested.c  | 15 
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2273101..6066913 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -119,6 +119,7 @@ struct rmap_nested {
 struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
  bool create);
 void kvmhv_put_nested(struct kvm_nested_guest *gp);
+int kvmhv_nested_next_lpid(struct kvm *kvm, int lpid);
 
 /* Power architecture requires HPT is at least 256kiB, at most 64TiB */
 #define PPC_MIN_HPT_ORDER  18
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 4f0fae2..b74abdd 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -978,6 +978,7 @@ struct debugfs_radix_state {
struct kvm  *kvm;
struct mutexmutex;
unsigned long   gpa;
+   int lpid;
int chars_left;
int buf_index;
charbuf[128];
@@ -1019,6 +1020,7 @@ static ssize_t debugfs_radix_read(struct file *file, char 
__user *buf,
struct kvm *kvm;
unsigned long gpa;
pgd_t *pgt;
+   struct kvm_nested_guest *nested;
pgd_t pgd, *pgdp;
pud_t pud, *pudp;
pmd_t pmd, *pmdp;
@@ -1053,10 +1055,39 @@ static ssize_t debugfs_radix_read(struct file *file, 
char __user *buf,
}
 
gpa = p->gpa;
-   pgt = kvm->arch.pgtable;
-   while (len != 0 && gpa < RADIX_PGTABLE_RANGE) {
+   nested = NULL;
+   pgt = NULL;
+   while (len != 0 && p->lpid >= 0) {
+   if (gpa >= RADIX_PGTABLE_RANGE) {
+   gpa = 0;
+   pgt = NULL;
+   if (nested) {
+   kvmhv_put_nested(nested);
+   nested = NULL;
+   }
+   p->lpid = kvmhv_nested_next_lpid(kvm, p->lpid);
+   p->hdr = 0;
+   if (p->lpid < 0)
+   break;
+   }
+   if (!pgt) {
+   if (p->lpid == 0) {
+   pgt = kvm->arch.pgtable;
+   } else {
+   nested = kvmhv_get_nested(kvm, p->lpid, false);
+   if (!nested) {
+   gpa = RADIX_PGTABLE_RANGE;
+   continue;
+   }
+   pgt = nested->shadow_pgtable;
+   }
+   }
+   n = 0;
if (!p->hdr) {
-   n = scnprintf(p->buf, sizeof(p->buf),
+   if (p->lpid > 0)
+   n = scnprintf(p->buf, sizeof(p->buf),
+ "\nNested LPID %d: ", p->lpid);
+   n += scnprintf(p->buf + n, sizeof(p->buf) - n,
  "pgdir: %lx\n", (unsigned long)pgt);
p->hdr = 1;
goto copy;
@@ -1122,6 +1153,8 @@ static ssize_t debugfs_radix_read(struct file *file, char 
__user *buf,
}
}
p->gpa = gpa;
+   if (nested)
+   kvmhv_put_nested(nested);
 
  out:
mutex_unlock(>mutex);
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 8199908..5879c8d 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1296,3 +1296,18 @@ long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu)
mutex_unlock(>tlb_lock);
return ret;
 }
+
+int kvmhv_nested_next_lpid(struct kvm *kvm, int lpid)
+{
+   int ret = -1;
+
+   spin_lock(>mmu_lock);
+   while (++lpid <= kvm->arch.max_nested_lpid) {
+   if (kvm->arch.nested_guests[lpid]) {
+   ret = lpid;
+   break;
+   }
+   }
+   spin_unlock(>mmu_lock);
+   return ret;
+}
-- 
2.7.4



[PATCH v2 31/33] KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode

2018-09-28 Thread Paul Mackerras
With this, the KVM-HV module can be loaded in a guest running under
KVM-HV, and if the hypervisor supports nested virtualization, this
guest can now act as a nested hypervisor and run nested guests.

This also adds some checks to inform userspace that HPT guests are not
supported by nested hypervisors, and to prevent userspace from
configuring a guest to use HPT mode.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index dd7dafa..741631a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4234,6 +4234,10 @@ static int kvm_vm_ioctl_get_smmu_info_hv(struct kvm *kvm,
 {
struct kvm_ppc_one_seg_page_size *sps;
 
+   /* If we're a nested hypervisor, we only support radix guests */
+   if (kvmhv_on_pseries())
+   return -EINVAL;
+
/*
 * POWER7, POWER8 and POWER9 all support 32 storage keys for data.
 * POWER7 doesn't support keys for instruction accesses,
@@ -4819,11 +4823,15 @@ static int kvmppc_core_emulate_mfspr_hv(struct kvm_vcpu 
*vcpu, int sprn,
 
 static int kvmppc_core_check_processor_compat_hv(void)
 {
-   if (!cpu_has_feature(CPU_FTR_HVMODE) ||
-   !cpu_has_feature(CPU_FTR_ARCH_206))
-   return -EIO;
+   if (cpu_has_feature(CPU_FTR_HVMODE) &&
+   cpu_has_feature(CPU_FTR_ARCH_206))
+   return 0;
 
-   return 0;
+   /* Can run as nested hypervisor on POWER9 in radix mode. */
+   if (cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled())
+   return 0;
+
+   return -EIO;
 }
 
 #ifdef CONFIG_KVM_XICS
@@ -5141,6 +5149,10 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
kvm_ppc_mmuv3_cfg *cfg)
if (radix && !radix_enabled())
return -EINVAL;
 
+   /* If we're a nested hypervisor, we currently only support radix */
+   if (kvmhv_on_pseries() && !radix)
+   return -EINVAL;
+
mutex_lock(>lock);
if (radix != kvm_is_radix(kvm)) {
if (kvm->arch.mmu_ready) {
-- 
2.7.4



[PATCH v2 30/33] KVM: PPC: Book3S HV: Handle differing endianness for H_ENTER_NESTED

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

The hcall H_ENTER_NESTED takes as the two parameters the address in
L1 guest memory of a hv_regs struct and a pt_regs struct which the
L1 guest would like to use to run a L2 guest and in which are returned
the exit state of the L2 guest.  For efficiency, these are in the
endianness of the L1 guest, rather than being always big-endian as is
usually the case for PAPR hypercalls.

When reading/writing these structures, this patch handles the case
where the endianness of the L1 guest differs from that of the L0
hypervisor, by byteswapping the structures after reading and before
writing them back.

Since all the fields of the pt_regs are of the same type, i.e.,
unsigned long, we treat it as an array of unsigned longs.  The fields
of struct hv_guest_state are not all the same, so its fields are
byteswapped individually.

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 51 -
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index c3fb171..8199908 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -51,6 +51,48 @@ void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct 
hv_guest_state *hr)
hr->ppr = vcpu->arch.ppr;
 }
 
+static void byteswap_pt_regs(struct pt_regs *regs)
+{
+   unsigned long *addr = (unsigned long *) regs;
+
+   for (; addr < ((unsigned long *) (regs + 1)); addr++)
+   *addr = swab64(*addr);
+}
+
+static void byteswap_hv_regs(struct hv_guest_state *hr)
+{
+   hr->version = swab64(hr->version);
+   hr->lpid = swab32(hr->lpid);
+   hr->vcpu_token = swab32(hr->vcpu_token);
+   hr->lpcr = swab64(hr->lpcr);
+   hr->pcr = swab64(hr->pcr);
+   hr->amor = swab64(hr->amor);
+   hr->dpdes = swab64(hr->dpdes);
+   hr->hfscr = swab64(hr->hfscr);
+   hr->tb_offset = swab64(hr->tb_offset);
+   hr->dawr0 = swab64(hr->dawr0);
+   hr->dawrx0 = swab64(hr->dawrx0);
+   hr->ciabr = swab64(hr->ciabr);
+   hr->hdec_expiry = swab64(hr->hdec_expiry);
+   hr->purr = swab64(hr->purr);
+   hr->spurr = swab64(hr->spurr);
+   hr->ic = swab64(hr->ic);
+   hr->vtb = swab64(hr->vtb);
+   hr->hdar = swab64(hr->hdar);
+   hr->hdsisr = swab64(hr->hdsisr);
+   hr->heir = swab64(hr->heir);
+   hr->asdr = swab64(hr->asdr);
+   hr->srr0 = swab64(hr->srr0);
+   hr->srr1 = swab64(hr->srr1);
+   hr->sprg[0] = swab64(hr->sprg[0]);
+   hr->sprg[1] = swab64(hr->sprg[1]);
+   hr->sprg[2] = swab64(hr->sprg[2]);
+   hr->sprg[3] = swab64(hr->sprg[3]);
+   hr->pidr = swab64(hr->pidr);
+   hr->cfar = swab64(hr->cfar);
+   hr->ppr = swab64(hr->ppr);
+}
+
 static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap,
 struct hv_guest_state *hr)
 {
@@ -175,6 +217,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
  sizeof(struct hv_guest_state));
if (err)
return H_PARAMETER;
+   if (kvmppc_need_byteswap(vcpu))
+   byteswap_hv_regs(_hv);
if (l2_hv.version != HV_GUEST_STATE_VERSION)
return H_P2;
 
@@ -183,7 +227,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
  sizeof(struct pt_regs));
if (err)
return H_PARAMETER;
-
+   if (kvmppc_need_byteswap(vcpu))
+   byteswap_pt_regs(_regs);
if (l2_hv.vcpu_token >= NR_CPUS)
return H_PARAMETER;
 
@@ -255,6 +300,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
kvmhv_put_nested(l2);
 
/* copy l2_hv_state and regs back to guest */
+   if (kvmppc_need_byteswap(vcpu)) {
+   byteswap_hv_regs(_hv);
+   byteswap_pt_regs(_regs);
+   }
err = kvm_vcpu_write_guest(vcpu, hv_ptr, _hv,
   sizeof(struct hv_guest_state));
if (err)
-- 
2.7.4



[PATCH v2 29/33] KVM: PPC: Book3S HV: Sanitise hv_regs on nested guest entry

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
nested (L2) guest into the vcpu structure. We need to sanitise these
values to ensure we don't let the L1 guest hypervisor do things we don't
want it to.

We don't let data address watchpoints or completed instruction address
breakpoints be set to match in hypervisor state.

We also don't let L1 enable features in the hypervisor facility status
and control register (HFSCR) for L2 which we have disabled for L1. That
is L2 will get the subset of features which the L0 hypervisor has
enabled for L1 and the features L1 wants to enable for L2. This could
mean we give L1 a hypervisor facility unavailable interrupt for a
facility it thinks it has enabled, however it shouldn't have enabled a
facility it itself doesn't have for the L2 guest.

We sanitise the registers when copying in the L2 hv_regs. We don't need
to sanitise when copying back the L1 hv_regs since these shouldn't be
able to contain invalid values as they're just what was copied out.

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h  |  1 +
 arch/powerpc/kvm/book3s_hv_nested.c | 17 +
 2 files changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 9c42abf..47489f6 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -415,6 +415,7 @@
 #define   HFSCR_DSCR   __MASK(FSCR_DSCR_LG)
 #define   HFSCR_VECVSX __MASK(FSCR_VECVSX_LG)
 #define   HFSCR_FP __MASK(FSCR_FP_LG)
+#define   HFSCR_INTR_CAUSE (ASM_CONST(0xFF) << 56) /* interrupt cause */
 #define SPRN_TAR   0x32f   /* Target Address Register */
 #define SPRN_LPCR  0x13E   /* LPAR Control Register */
 #define   LPCR_VPM0ASM_CONST(0x8000)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 2abe0cf..c3fb171 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -86,6 +86,22 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, int 
trap,
}
 }
 
+static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
+{
+   /*
+* Don't let L1 enable features for L2 which we've disabled for L1,
+* but preserve the interrupt cause field.
+*/
+   hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr);
+
+   /* Don't let data address watchpoint match in hypervisor state */
+   hr->dawrx0 &= ~DAWRX_HYP;
+
+   /* Don't let completed instruction address breakpt match in HV state */
+   if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
+   hr->ciabr &= ~CIABR_PRIV;
+}
+
 static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
@@ -198,6 +214,7 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
LPCR_LPES | LPCR_MER;
lpcr = (vc->lpcr & ~mask) | (l2_hv.lpcr & mask);
+   sanitise_hv_regs(vcpu, _hv);
restore_hv_regs(vcpu, _hv);
 
vcpu->arch.ret = RESUME_GUEST;
-- 
2.7.4



[PATCH v2 28/33] KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register

2018-09-28 Thread Paul Mackerras
This adds a one-reg register identifier which can be used to read and
set the virtual PTCR for the guest.  This register identifies the
address and size of the virtual partition table for the guest, which
contains information about the nested guests under this guest.

Migrating this value is the only extra requirement for migrating a
guest which has nested guests (assuming of course that the destination
host supports nested virtualization in the kvm-hv module).

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt   | 1 +
 arch/powerpc/include/uapi/asm/kvm.h | 1 +
 arch/powerpc/kvm/book3s_hv.c| 6 ++
 3 files changed, 8 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 647f941..2f5f9b7 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1922,6 +1922,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TIDR  | 64
   PPC   | KVM_REG_PPC_PSSCR | 64
   PPC   | KVM_REG_PPC_DEC_EXPIRY| 64
+  PPC   | KVM_REG_PPC_PTCR  | 64
   PPC   | KVM_REG_PPC_TM_GPR0   | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31  | 64
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 1b32b56..8c876c1 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -634,6 +634,7 @@ struct kvm_ppc_cpu_char {
 
 #define KVM_REG_PPC_DEC_EXPIRY (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xbe)
 #define KVM_REG_PPC_ONLINE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xbf)
+#define KVM_REG_PPC_PTCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc0)
 
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c99b5fb..dd7dafa 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1733,6 +1733,9 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_ONLINE:
*val = get_reg_val(id, vcpu->arch.online);
break;
+   case KVM_REG_PPC_PTCR:
+   *val = get_reg_val(id, vcpu->kvm->arch.l1_ptcr);
+   break;
default:
r = -EINVAL;
break;
@@ -1964,6 +1967,9 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
atomic_dec(>arch.vcore->online_count);
vcpu->arch.online = i;
break;
+   case KVM_REG_PPC_PTCR:
+   vcpu->kvm->arch.l1_ptcr = set_reg_val(id, *val);
+   break;
default:
r = -EINVAL;
break;
-- 
2.7.4



[PATCH v2 27/33] KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested

2018-09-28 Thread Paul Mackerras
When running as a nested hypervisor, this avoids reading hypervisor
privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
instead reasonable default values are used.  This also avoids writing
LPIDR in the single-vcpu entry/exit path.

Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
since its only caller already checks this.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  7 +++
 arch/powerpc/kvm/book3s_hv.c| 33 +
 2 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 68e14af..c615617 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -268,14 +268,13 @@ int kvmppc_mmu_hv_init(void)
 {
unsigned long host_lpid, rsvd_lpid;
 
-   if (!cpu_has_feature(CPU_FTR_HVMODE))
-   return -EINVAL;
-
if (!mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE))
return -EINVAL;
 
/* POWER7 has 10-bit LPIDs (12-bit in POWER8) */
-   host_lpid = mfspr(SPRN_LPID);
+   host_lpid = 0;
+   if (cpu_has_feature(CPU_FTR_HVMODE))
+   host_lpid = mfspr(SPRN_LPID);
rsvd_lpid = LPID_RSVD;
 
kvmppc_init_lpid(rsvd_lpid + 1);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 5d0c257..c99b5fb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2197,15 +2197,18 @@ static struct kvm_vcpu 
*kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 * Set the default HFSCR for the guest from the host value.
 * This value is only used on POWER9.
 * On POWER9, we want to virtualize the doorbell facility, so we
-* turn off the HFSCR bit, which causes those instructions to trap.
+* don't set the HFSCR_MSGP bit, and that causes those instructions
+* to trap and then we emulate them.
 */
-   vcpu->arch.hfscr = mfspr(SPRN_HFSCR);
-   if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   vcpu->arch.hfscr = HFSCR_TAR | HFSCR_EBB | HFSCR_PM | HFSCR_BHRB |
+   HFSCR_DSCR | HFSCR_VECVSX | HFSCR_FP;
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   vcpu->arch.hfscr &= mfspr(SPRN_HFSCR);
+   if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   vcpu->arch.hfscr |= HFSCR_TM;
+   }
+   if (cpu_has_feature(CPU_FTR_TM_COMP))
vcpu->arch.hfscr |= HFSCR_TM;
-   else if (!cpu_has_feature(CPU_FTR_TM_COMP))
-   vcpu->arch.hfscr &= ~HFSCR_TM;
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
-   vcpu->arch.hfscr &= ~HFSCR_MSGP;
 
kvmppc_mmu_book3s_hv_init(vcpu);
 
@@ -4021,8 +4024,10 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
 
srcu_read_unlock(>kvm->srcu, srcu_idx);
 
-   mtspr(SPRN_LPID, vc->kvm->arch.host_lpid);
-   isync();
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   mtspr(SPRN_LPID, vc->kvm->arch.host_lpid);
+   isync();
+   }
 
trace_hardirqs_off();
set_irq_happened(trap);
@@ -4642,9 +4647,13 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
kvm->arch.host_sdr1 = mfspr(SPRN_SDR1);
 
/* Init LPCR for virtual RMA mode */
-   kvm->arch.host_lpid = mfspr(SPRN_LPID);
-   kvm->arch.host_lpcr = lpcr = mfspr(SPRN_LPCR);
-   lpcr &= LPCR_PECE | LPCR_LPES;
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   kvm->arch.host_lpid = mfspr(SPRN_LPID);
+   kvm->arch.host_lpcr = lpcr = mfspr(SPRN_LPCR);
+   lpcr &= LPCR_PECE | LPCR_LPES;
+   } else {
+   lpcr = 0;
+   }
lpcr |= (4UL << LPCR_DPFD_SH) | LPCR_HDICE |
LPCR_VPM0 | LPCR_VPM1;
kvm->arch.vrma_slb_v = SLB_VSID_B_1T |
-- 
2.7.4



[PATCH v2 26/33] KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

This is only done at level 0, since only level 0 knows which physical
CPU a vcpu is running on.  This does for nested guests what L0 already
did for its own guests, which is to flush the TLB on a pCPU when it
goes to run a vCPU there, and there is another vCPU in the same VM
which previously ran on this pCPU and has now started to run on another
pCPU.  This is to handle the situation where the other vCPU touched
a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
on that new pCPU and thus left behind a stale TLB entry on this pCPU.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  3 +
 arch/powerpc/kvm/book3s_hv.c | 98 +++-
 arch/powerpc/kvm/book3s_hv_nested.c  |  5 ++
 3 files changed, 68 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 38614f0..2273101 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -51,6 +51,9 @@ struct kvm_nested_guest {
long refcnt;/* number of pointers to this struct */
struct mutex tlb_lock;  /* serialize page faults and tlbies */
struct kvm_nested_guest *next;
+   cpumask_t need_tlb_flush;
+   cpumask_t cpu_in_guest;
+   int prev_cpu[NR_CPUS];
 };
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 910484b..5d0c257 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2420,10 +2420,18 @@ static void kvmppc_release_hwthread(int cpu)
 
 static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
 {
+   struct kvm_nested_guest *nested = vcpu->arch.nested;
+   cpumask_t *cpu_in_guest;
int i;
 
cpu = cpu_first_thread_sibling(cpu);
-   cpumask_set_cpu(cpu, >arch.need_tlb_flush);
+   if (nested) {
+   cpumask_set_cpu(cpu, >need_tlb_flush);
+   cpu_in_guest = >cpu_in_guest;
+   } else {
+   cpumask_set_cpu(cpu, >arch.need_tlb_flush);
+   cpu_in_guest = >arch.cpu_in_guest;
+   }
/*
 * Make sure setting of bit in need_tlb_flush precedes
 * testing of cpu_in_guest bits.  The matching barrier on
@@ -2431,13 +2439,23 @@ static void radix_flush_cpu(struct kvm *kvm, int cpu, 
struct kvm_vcpu *vcpu)
 */
smp_mb();
for (i = 0; i < threads_per_core; ++i)
-   if (cpumask_test_cpu(cpu + i, >arch.cpu_in_guest))
+   if (cpumask_test_cpu(cpu + i, cpu_in_guest))
smp_call_function_single(cpu + i, do_nothing, NULL, 1);
 }
 
 static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu *vcpu, int pcpu)
 {
+   struct kvm_nested_guest *nested = vcpu->arch.nested;
struct kvm *kvm = vcpu->kvm;
+   int *prev_cpu;
+
+   if (!cpu_has_feature(CPU_FTR_HVMODE))
+   return;
+
+   if (nested)
+   prev_cpu = >prev_cpu[vcpu->arch.nested_vcpu_id];
+   else
+   prev_cpu = >arch.prev_cpu;
 
/*
 * With radix, the guest can do TLB invalidations itself,
@@ -2451,12 +2469,43 @@ static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu 
*vcpu, int pcpu)
 * ran to flush the TLB.  The TLB is shared between threads,
 * so we use a single bit in .need_tlb_flush for all 4 threads.
 */
-   if (vcpu->arch.prev_cpu != pcpu) {
-   if (vcpu->arch.prev_cpu >= 0 &&
-   cpu_first_thread_sibling(vcpu->arch.prev_cpu) !=
+   if (*prev_cpu != pcpu) {
+   if (*prev_cpu >= 0 &&
+   cpu_first_thread_sibling(*prev_cpu) !=
cpu_first_thread_sibling(pcpu))
-   radix_flush_cpu(kvm, vcpu->arch.prev_cpu, vcpu);
-   vcpu->arch.prev_cpu = pcpu;
+   radix_flush_cpu(kvm, *prev_cpu, vcpu);
+   *prev_cpu = pcpu;
+   }
+}
+
+static void kvmppc_radix_check_need_tlb_flush(struct kvm *kvm, int pcpu,
+ struct kvm_nested_guest *nested)
+{
+   cpumask_t *need_tlb_flush;
+   int lpid;
+
+   if (!cpu_has_feature(CPU_FTR_HVMODE))
+   return;
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   pcpu &= ~0x3UL;
+
+   if (nested) {
+   lpid = nested->shadow_lpid;
+   need_tlb_flush = >need_tlb_flush;
+   } else {
+   lpid = kvm->arch.lpid;
+   need_tlb_flush = >arch.need_tlb_flush;
+   }
+
+   mtspr(SPRN_LPID, lpid);
+   isync();
+   smp_mb();
+
+   if (cpumask_test_cpu(pcpu, need_tlb_flush)) {
+   radix__local_flush_tlb_lpid_guest(lpid);
+   /* Clear the bit after the TLB flush */
+   

[PATCH v2 25/33] KVM: PPC: Book3S HV: Emulate Privileged TLBIE for guest hypervisors

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

When running a nested (L2) guest the guest (L1) hypervisor will use
hypervisor privileged tlb invalidation instructions (to manage the
partition scoped page tables) which will result in hypervisor
emulation assistance interrupts. We emulate these instructions on behalf
of the L1 guest.

The tlbie instruction can invalidate different scopes:

Invalidate TLB for a given target address:
- This invalidates a single L2 -> L1 pte
- We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
  address space which is being invalidated. This is because a single
  L2 -> L1 pte may have been mapped with more than one pte in the
  L2 -> L0 page tables.

Invalidate the entire TLB for a given LPID or for all LPIDs:
- Invalidate the entire shadow_pgtable for a given nested guest, or
  for all nested guests.

Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
- We don't cache the PWC, so nothing to do

Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
- Here we re-read the partition table entry and remove the nested state
  for any nested guest for which the first doubleword of the partition
  table entry is now zero.

This also implements the H_TLB_INVALIDATE hcall.  It takes as parameters
the tlbie instruction word (of which the RIC, PRS and R fields are used),
the rS value (giving the lpid, where required) and the rB value (giving
the IS, AP and EPN values).

[pau...@ozlabs.org - adapted to having the partition table in guest
memory, added the H_TLB_INVALIDATE implementation.]

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  12 ++
 arch/powerpc/include/asm/kvm_book3s.h |   1 +
 arch/powerpc/include/asm/ppc-opcode.h |   1 +
 arch/powerpc/kvm/book3s_emulate.c |   1 -
 arch/powerpc/kvm/book3s_hv.c  |   3 +
 arch/powerpc/kvm/book3s_hv_nested.c   | 210 +-
 6 files changed, 225 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index b3520b5..66db23e 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -203,6 +203,18 @@ static inline unsigned int mmu_psize_to_shift(unsigned int 
mmu_psize)
BUG();
 }
 
+static inline unsigned int ap_to_shift(unsigned long ap)
+{
+   int psize;
+
+   for (psize = 0; psize < MMU_PAGE_COUNT; psize++) {
+   if (mmu_psize_defs[psize].ap == ap)
+   return mmu_psize_defs[psize].shift;
+   }
+
+   return -1;
+}
+
 static inline unsigned long get_sllp_encoding(int psize)
 {
unsigned long sllp;
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 00688cd..c94ef3b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -301,6 +301,7 @@ long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
+long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
 int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu,
  u64 time_limit, unsigned long lpcr);
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 665af14..6093bc8 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -104,6 +104,7 @@
 #define OP_31_XOP_LHZUX 311
 #define OP_31_XOP_MSGSNDP   142
 #define OP_31_XOP_MSGCLRP   174
+#define OP_31_XOP_TLBIE 306
 #define OP_31_XOP_MFSPR 339
 #define OP_31_XOP_LWAX  341
 #define OP_31_XOP_LHAX  343
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 2654df2..8c7e933 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -36,7 +36,6 @@
 #define OP_31_XOP_MTSR 210
 #define OP_31_XOP_MTSRIN   242
 #define OP_31_XOP_TLBIEL   274
-#define OP_31_XOP_TLBIE306
 /* Opcode is officially reserved, reuse it as sc 1 when sc 1 doesn't trap */
 #define OP_31_XOP_FAKE_SC1 308
 #define OP_31_XOP_SLBMTE   402
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 27b07cb..910484b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -974,6 +974,9 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
break;
case H_TLB_INVALIDATE:
ret = H_FUNCTION;
+   if (!vcpu->kvm->arch.nested_enable)
+   break;
+   ret = kvmhv_do_nested_tlbie(vcpu);
break;
 
default:
diff 

[PATCH v2 24/33] KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

When a host (L0) page which is mapped into a (L1) guest is in turn
mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
so that these mappings can be retrieved later.

Whenever we create an entry in a shadow_pgtable for a nested guest we
create a corresponding rmap entry and add it to the list for the
L1 guest memslot at the index of the L1 guest page it maps. This means
at the L1 guest memslot we end up with lists of rmaps.

When we are notified of a host page being invalidated which has been
mapped through to a (L1) guest, we can then walk the rmap list for that
guest page, and find and invalidate all of the corresponding
shadow_pgtable entries.

In order to reduce memory consumption, we compress the information for
each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
for the guest real page frame number -- which will fit in a single
unsigned long.  To avoid a scenario where a guest can trigger
unbounded memory allocations, we scan the list when adding an entry to
see if there is already an entry with the contents we need.  This can
occur, because we don't ever remove entries from the middle of a list.

A struct nested guest rmap is a list pointer and an rmap entry;

| next pointer |

| rmap entry   |


Thus the rmap pointer for each guest frame number in the memslot can be
either NULL, a single entry, or a pointer to a list of nested rmap entries.

gfn  memslot rmap array
-
 0  | NULL  |   (no rmap entry)
-
 1  | single rmap entry |   (rmap entry with low bit set)
-
 2  | list head pointer |   (list of rmap entries)
-

The final entry always has the lowest bit set and is stored in the next
pointer of the last list entry, or as a single rmap entry.
With a list of rmap entries looking like;

-   -   -
| list head ptr | > | next pointer  | > | single rmap entry |
-   -   -
| rmap entry|   | rmap entry|
-   -

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h|   3 +
 arch/powerpc/include/asm/kvm_book3s_64.h |  70 -
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  44 +++
 arch/powerpc/kvm/book3s_hv.c |   1 +
 arch/powerpc/kvm/book3s_hv_nested.c  | 130 ++-
 5 files changed, 233 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 045ab15..00688cd 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -196,6 +196,9 @@ extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu 
*vcpu, gva_t eaddr,
int table_index, u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
+extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
+   unsigned int shift, struct kvm_memory_slot *memslot,
+   unsigned int lpid);
 extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
bool writing, unsigned long gpa,
unsigned int lpid);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 5496152..38614f0 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -53,6 +53,66 @@ struct kvm_nested_guest {
struct kvm_nested_guest *next;
 };
 
+/*
+ * We define a nested rmap entry as a single 64-bit quantity
+ * 0xFFF0  12-bit lpid field
+ * 0x000FF000  40-bit guest physical address field
+ * 0x0001  1-bit  single entry flag
+ */
+#define RMAP_NESTED_LPID_MASK  0xFFF0UL
+#define RMAP_NESTED_LPID_SHIFT (52)
+#define RMAP_NESTED_GPA_MASK   0x000FF000UL
+#define RMAP_NESTED_IS_SINGLE_ENTRY0x0001UL
+
+/* Structure for a nested guest rmap entry */
+struct rmap_nested {
+   struct llist_node list;
+   u64 rmap;
+};
+
+/*
+ * for_each_nest_rmap_safe - iterate over the list of nested rmap entries
+ *  safe against removal of the list entry or NULL list
+ * @pos:   a (struct rmap_nested *) to use as a loop cursor
+ * @node:  pointer to the first entry
+ * NOTE: this can be NULL
+ * @rmapp: an (unsigned long *) in which to return the rmap entries on 

[PATCH v2 23/33] KVM: PPC: Book3S HV: Handle page fault for a nested guest

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;

L1
|
| (L1 -> L0)
|
> L0

There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;

L2  L2
|   |
| (L2 -> L1)|
|   |
> L1| (L2 -> L0)
  | |
  | (L1 -> L0)  |
  | |
  > L0  > L0

When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:

1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
   provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
   or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable

Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:

1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
   the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
   host page.
3. Set the rc bits in the L2 -> L0 pte.

As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 .../powerpc/include/asm/book3s/64/tlbflush-radix.h |   1 +
 arch/powerpc/include/asm/kvm_book3s.h  |  19 ++
 arch/powerpc/include/asm/kvm_book3s_64.h   |   4 +
 arch/powerpc/include/asm/kvm_host.h|   2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 196 +++-
 arch/powerpc/kvm/book3s_hv_nested.c| 333 -
 arch/powerpc/mm/tlb-radix.c|   9 +
 7 files changed, 477 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 1154a6d..671316f 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -53,6 +53,7 @@ extern void radix__flush_tlb_lpid_page(unsigned int lpid,
unsigned long addr,
unsigned long page_size);
 extern void radix__flush_pwc_lpid(unsigned int lpid);
+extern void radix__flush_tlb_lpid(unsigned int lpid);
 extern void radix__local_flush_tlb_lpid(unsigned int lpid);
 extern void radix__local_flush_tlb_lpid_guest(unsigned int lpid);
 
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 80b43ac..045ab15 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,17 +188,34 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm 
*kvm, unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
+ struct kvmppc_pte *gpte, u64 root,
+ u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, u64 table,
int table_index, u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
+extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
+   bool writing, unsigned long gpa,
+   unsigned int lpid);
+extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
+   unsigned long gpa,
+   struct kvm_memory_slot *memslot,
+   bool 

[PATCH v2 22/33] KVM: PPC: Book3S HV: Framework to handle HV Emulation Assist Interrupt

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

A HEAI (hypervisor emulation assistance interrupt) occurs when a
hypervisor resource or instruction is used in a privileged but
non-hypervisor state and the LPCR_EVIRT bit is set in LPCR.  When
this occurs bit 45 is set in HSRR1.  Detect the occurrence of this,
and if userspace has enabled the nested virtualization capability
on the VM, then call the code to handle it accordingly.

With LPCR[EVIRT] set, we also get HEAI (without bit 45 set) for
mfspr or mtspr to unimplemented SPR numbers.  For these accesses,
we emulate the EVIRT=0 behaviour, which is to make the access
a no-op for privileged software unless it is accessing SPR 0,
4, 5 or 6.  Problem-state accesses and accesses to SPR 0, 4, 5
or 6 generate an illegal-instruction type program interrupt.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h |  2 +
 arch/powerpc/kvm/book3s_hv.c  | 87 ++-
 arch/powerpc/kvm/book3s_hv_nested.c   | 55 ++
 3 files changed, 112 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 2dd996c..80b43ac 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -287,6 +287,8 @@ void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct 
hv_guest_state *hr);
 void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
   struct hv_guest_state *hr);
 long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
+int kvmhv_emulate_priv(struct kvm_run *run, struct kvm_vcpu *vcpu,
+   unsigned int instr);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 84d08d5..b705668 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1027,30 +1027,6 @@ static int kvmppc_hcall_impl_hv(unsigned long cmd)
return kvmppc_hcall_impl_hv_realmode(cmd);
 }
 
-static int kvmppc_emulate_debug_inst(struct kvm_run *run,
-   struct kvm_vcpu *vcpu)
-{
-   u32 last_inst;
-
-   if (kvmppc_get_last_inst(vcpu, INST_GENERIC, _inst) !=
-   EMULATE_DONE) {
-   /*
-* Fetch failed, so return to guest and
-* try executing it again.
-*/
-   return RESUME_GUEST;
-   }
-
-   if (last_inst == KVMPPC_INST_SW_BREAKPOINT) {
-   run->exit_reason = KVM_EXIT_DEBUG;
-   run->debug.arch.address = kvmppc_get_pc(vcpu);
-   return RESUME_HOST;
-   } else {
-   kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
-   return RESUME_GUEST;
-   }
-}
-
 static void do_nothing(void *x)
 {
 }
@@ -1144,6 +1120,23 @@ static int kvmppc_emulate_doorbell_instr(struct kvm_vcpu 
*vcpu)
return RESUME_GUEST;
 }
 
+static int kvmhv_emulate_unknown_spr(struct kvm_vcpu *vcpu, u32 instr)
+{
+   u32 spr = get_sprn(instr);
+
+   /*
+* In privileged state, access to unimplemented SPRs is a no-op
+* except for SPR 0, 4, 5 and 6.  All other accesses get turned
+* into illegal-instruction program interrupts.
+*/
+   if ((vcpu->arch.shregs.msr & MSR_PR) ||
+   spr == 0 || (4 <= spr && spr <= 6))
+   return EMULATE_FAIL;
+
+   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+   return RESUME_GUEST;
+}
+
 static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 struct task_struct *tsk)
 {
@@ -1260,19 +1253,49 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 * to the guest. If guest debug is enabled, we need to check
 * whether the instruction is a software breakpoint instruction.
 * Accordingly return to Guest or Host.
+* With LPCR[EVIRT] set, we also get these for accesses to
+* unknown SPRs and for guests executing hypervisor privileged
+* instructions.
 */
case BOOK3S_INTERRUPT_H_EMUL_ASSIST:
-   if (vcpu->arch.emul_inst != KVM_INST_FETCH_FAILED)
-   vcpu->arch.last_inst = kvmppc_need_byteswap(vcpu) ?
-   swab32(vcpu->arch.emul_inst) :
-   vcpu->arch.emul_inst;
-   if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) {
-   r = kvmppc_emulate_debug_inst(run, vcpu);
+   {
+   u32 instr = vcpu->arch.emul_inst;
+   unsigned long srr1_bit = SRR1_PROGILL;
+
+   vcpu->arch.last_inst = kvmppc_need_byteswap(vcpu) ?
+   swab32(instr) : instr;
+
+   r = EMULATE_FAIL;
+   if (vcpu->arch.shregs.msr & SRR1_PROGPRIV) {
+   

[PATCH v2 21/33] KVM: PPC: Book3S HV: Handle hypercalls correctly when nested

2018-09-28 Thread Paul Mackerras
When we are running as a nested hypervisor, we use a hypercall to
enter the guest rather than code in book3s_hv_rmhandlers.S.  This means
that the hypercall handlers listed in hcall_real_table never get called.
There are some hypercalls that are handled there and not in
kvmppc_pseries_do_hcall(), which therefore won't get processed for
a nested guest.

To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
hypercalls, with the following exceptions:

- The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
  we only support radix mode for nested guests.

- H_CEDE has to be handled specially because the cede logic in
  kvmhv_run_single_vcpu assumes that it has been processed by the time
  that kvmhv_p9_guest_entry() returns.  Therefore we put a special
  case for H_CEDE in kvmhv_p9_guest_entry().

For the XICS hypercalls, if real-mode processing is enabled, then the
virtual-mode handlers assume that they are being called only to finish
up the operation.  Therefore we turn off the real-mode flag in the XICS
code when running as a nested hypervisor.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |  4 +++
 arch/powerpc/kvm/book3s_hv.c  | 43 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  2 ++
 arch/powerpc/kvm/book3s_xics.c|  3 ++-
 4 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 5c9b00c..c55ba3b 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -167,4 +167,8 @@ void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
 
 int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
 
+long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);
+long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr,
+   unsigned long dabrx);
+
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8d2f91f..84d08d5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -915,6 +916,19 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
break;
}
return RESUME_HOST;
+   case H_SET_DABR:
+   ret = kvmppc_h_set_dabr(vcpu, kvmppc_get_gpr(vcpu, 4));
+   break;
+   case H_SET_XDABR:
+   ret = kvmppc_h_set_xdabr(vcpu, kvmppc_get_gpr(vcpu, 4),
+   kvmppc_get_gpr(vcpu, 5));
+   break;
+   case H_GET_TCE:
+   ret = kvmppc_h_get_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+   kvmppc_get_gpr(vcpu, 5));
+   if (ret == H_TOO_HARD)
+   return RESUME_HOST;
+   break;
case H_PUT_TCE:
ret = kvmppc_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
kvmppc_get_gpr(vcpu, 5),
@@ -938,6 +952,10 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (ret == H_TOO_HARD)
return RESUME_HOST;
break;
+   case H_RANDOM:
+   if (!powernv_get_random_long(>arch.regs.gpr[4]))
+   ret = H_HARDWARE;
+   break;
 
case H_SET_PARTITION_TABLE:
ret = H_FUNCTION;
@@ -966,6 +984,24 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
return RESUME_GUEST;
 }
 
+/*
+ * Handle H_CEDE in the nested virtualization case where we haven't
+ * called the real-mode hcall handlers in book3s_hv_rmhandlers.S.
+ * This has to be done early, not in kvmppc_pseries_do_hcall(), so
+ * that the cede logic in kvmppc_run_single_vcpu() works properly.
+ */
+static void kvmppc_nested_cede(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.shregs.msr |= MSR_EE;
+   vcpu->arch.ceded = 1;
+   smp_mb();
+   if (vcpu->arch.prodded) {
+   vcpu->arch.prodded = 0;
+   smp_mb();
+   vcpu->arch.ceded = 0;
+   }
+}
+
 static int kvmppc_hcall_impl_hv(unsigned long cmd)
 {
switch (cmd) {
@@ -3422,6 +3458,13 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
vcpu->arch.shregs.msr = vcpu->arch.regs.msr;
vcpu->arch.shregs.dar = mfspr(SPRN_DAR);
vcpu->arch.shregs.dsisr = mfspr(SPRN_DSISR);
+
+   /* H_CEDE has to be handled now, not later */
+   if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
+   kvmppc_get_gpr(vcpu, 3) == H_CEDE) {
+   kvmppc_nested_cede(vcpu);
+   trap = 0;
+   }
} else {
trap = 

[PATCH v2 20/33] KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor

2018-09-28 Thread Paul Mackerras
This adds code to call the H_IPI and H_EOI hypercalls when we are
running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
feature) and we would otherwise access the XICS interrupt controller
directly or via an OPAL call.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c |  7 +-
 arch/powerpc/kvm/book3s_hv_builtin.c | 44 +---
 arch/powerpc/kvm/book3s_hv_rm_xics.c |  8 +++
 3 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 60adf47..8d2f91f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -173,6 +173,10 @@ static bool kvmppc_ipi_thread(int cpu)
 {
unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
 
+   /* If we're a nested hypervisor, fall back to ordinary IPIs for now */
+   if (kvmhv_on_pseries())
+   return false;
+
/* On POWER9 we can use msgsnd to IPI any cpu */
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
msg |= get_hard_smp_processor_id(cpu);
@@ -5164,7 +5168,8 @@ static int kvmppc_book3s_init_hv(void)
 * indirectly, via OPAL.
 */
 #ifdef CONFIG_SMP
-   if (!xive_enabled() && !local_paca->kvm_hstate.xics_phys) {
+   if (!xive_enabled() && !kvmhv_on_pseries() &&
+   !local_paca->kvm_hstate.xics_phys) {
struct device_node *np;
 
np = of_find_compatible_node(NULL, NULL, "ibm,opal-intc");
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index ccfea5b..a71e2fc 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -231,6 +231,15 @@ void kvmhv_rm_send_ipi(int cpu)
void __iomem *xics_phys;
unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
 
+   /* For a nested hypervisor, use the XICS via hcall */
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   plpar_hcall_raw(H_IPI, retbuf, get_hard_smp_processor_id(cpu),
+   IPI_PRIORITY);
+   return;
+   }
+
/* On POWER9 we can use msgsnd for any destination cpu. */
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
msg |= get_hard_smp_processor_id(cpu);
@@ -460,12 +469,19 @@ static long kvmppc_read_one_intr(bool *again)
return 1;
 
/* Now read the interrupt from the ICP */
-   xics_phys = local_paca->kvm_hstate.xics_phys;
-   rc = 0;
-   if (!xics_phys)
-   rc = opal_int_get_xirr(, false);
-   else
-   xirr = __raw_rm_readl(xics_phys + XICS_XIRR);
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   rc = plpar_hcall_raw(H_XIRR, retbuf, 0xFF);
+   xirr = cpu_to_be32(retbuf[0]);
+   } else {
+   xics_phys = local_paca->kvm_hstate.xics_phys;
+   rc = 0;
+   if (!xics_phys)
+   rc = opal_int_get_xirr(, false);
+   else
+   xirr = __raw_rm_readl(xics_phys + XICS_XIRR);
+   }
if (rc < 0)
return 1;
 
@@ -494,7 +510,13 @@ static long kvmppc_read_one_intr(bool *again)
 */
if (xisr == XICS_IPI) {
rc = 0;
-   if (xics_phys) {
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   plpar_hcall_raw(H_IPI, retbuf,
+   hard_smp_processor_id(), 0xff);
+   plpar_hcall_raw(H_EOI, retbuf, h_xirr);
+   } else if (xics_phys) {
__raw_rm_writeb(0xff, xics_phys + XICS_MFRR);
__raw_rm_writel(xirr, xics_phys + XICS_XIRR);
} else {
@@ -520,7 +542,13 @@ static long kvmppc_read_one_intr(bool *again)
/* We raced with the host,
 * we need to resend that IPI, bummer
 */
-   if (xics_phys)
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   plpar_hcall_raw(H_IPI, retbuf,
+   hard_smp_processor_id(),
+   IPI_PRIORITY);
+   } else if (xics_phys)
__raw_rm_writeb(IPI_PRIORITY,
xics_phys + XICS_MFRR);
else
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index 8b9f356..b3f5786 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -767,6 +767,14 @@ static void icp_eoi(struct irq_chip *c, 

[PATCH v2 19/33] KVM: PPC: Book3S HV: Nested guest entry via hypercall

2018-09-28 Thread Paul Mackerras
This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
hypervisor to enter one of its nested guests.  The hypercall supplies
register values in two structs.  Those values are copied by the level 0
(L0) hypervisor (the one which is running in hypervisor mode) into the
vcpu struct of the L1 guest, and then the guest is run until an
interrupt or error occurs which needs to be reported to L1 via the
hypercall return value.

Currently this assumes that the L0 and L1 hypervisors are the same
endianness, and the structs passed as arguments are in native
endianness.  If they are of different endianness, the version number
check will fail and the hcall will be rejected.

Nested hypervisors do not support indep_threads_mode=N, so this adds
code to print a warning message if the administrator has set
indep_threads_mode=N, and treat it as Y.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/hvcall.h   |  36 +
 arch/powerpc/include/asm/kvm_book3s.h   |   7 +
 arch/powerpc/include/asm/kvm_host.h |   5 +
 arch/powerpc/kernel/asm-offsets.c   |   1 +
 arch/powerpc/kvm/book3s_hv.c| 212 +
 arch/powerpc/kvm/book3s_hv_nested.c | 230 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   8 ++
 7 files changed, 470 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index c95c651..45e8789 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -466,6 +466,42 @@ struct h_cpu_char_result {
u64 behaviour;
 };
 
+/* Register state for entering a nested guest with H_ENTER_NESTED */
+struct hv_guest_state {
+   u64 version;/* version of this structure layout */
+   u32 lpid;
+   u32 vcpu_token;
+   /* These registers are hypervisor privileged (at least for writing) */
+   u64 lpcr;
+   u64 pcr;
+   u64 amor;
+   u64 dpdes;
+   u64 hfscr;
+   s64 tb_offset;
+   u64 dawr0;
+   u64 dawrx0;
+   u64 ciabr;
+   u64 hdec_expiry;
+   u64 purr;
+   u64 spurr;
+   u64 ic;
+   u64 vtb;
+   u64 hdar;
+   u64 hdsisr;
+   u64 heir;
+   u64 asdr;
+   /* These are OS privileged but need to be set late in guest entry */
+   u64 srr0;
+   u64 srr1;
+   u64 sprg[4];
+   u64 pidr;
+   u64 cfar;
+   u64 ppr;
+};
+
+/* Latest version of hv_guest_state structure */
+#define HV_GUEST_STATE_VERSION 1
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_HVCALL_H */
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 7719ca5..2dd996c 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -280,6 +280,13 @@ void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
+long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
+int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu,
+ u64 time_limit, unsigned long lpcr);
+void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
+void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
+  struct hv_guest_state *hr);
+long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index c35d4f2..ceb9f20 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -95,6 +95,7 @@ struct dtl_entry;
 
 struct kvmppc_vcpu_book3s;
 struct kvmppc_book3s_shadow_vcpu;
+struct kvm_nested_guest;
 
 struct kvm_vm_stat {
ulong remote_tlb_flush;
@@ -786,6 +787,10 @@ struct kvm_vcpu_arch {
u32 emul_inst;
 
u32 online;
+
+   /* For support of nested guests */
+   struct kvm_nested_guest *nested;
+   u32 nested_vcpu_id;
 #endif
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 7c3738d..d0abcbb 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -503,6 +503,7 @@ int main(void)
OFFSET(VCPU_VPA, kvm_vcpu, arch.vpa.pinned_addr);
OFFSET(VCPU_VPA_DIRTY, kvm_vcpu, arch.vpa.dirty);
OFFSET(VCPU_HEIR, kvm_vcpu, arch.emul_inst);
+   OFFSET(VCPU_NESTED, kvm_vcpu, arch.nested);
OFFSET(VCPU_CPU, kvm_vcpu, cpu);
OFFSET(VCPU_THREAD_CPU, kvm_vcpu, arch.thread_cpu);
 #endif
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 43c607b..60adf47 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -942,6 +942,13 @@ int 

[PATCH v2 18/33] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization

2018-09-28 Thread Paul Mackerras
This starts the process of adding the code to support nested HV-style
virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
a nested hypervisor can use to set the base address and size of a
partition table in its memory (analogous to the PTCR register).
On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
hypercall from the guest is handled by code that saves the virtual
PTCR value for the guest.

This also adds code for creating and destroying nested guests and for
reading the partition table entry for a nested guest from L1 memory.
Each nested guest has its own shadow LPID value, different in general
from the LPID value used by the nested hypervisor to refer to it.  The
shadow LPID value is allocated at nested guest creation time.

Nested hypervisor functionality is only available for a radix guest,
which therefore means a radix host on a POWER9 (or later) processor.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/hvcall.h |   5 +
 arch/powerpc/include/asm/kvm_book3s.h |  10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |  33 
 arch/powerpc/include/asm/kvm_book3s_asm.h |   3 +
 arch/powerpc/include/asm/kvm_host.h   |   5 +
 arch/powerpc/kvm/Makefile |   3 +-
 arch/powerpc/kvm/book3s_hv.c  |  26 ++-
 arch/powerpc/kvm/book3s_hv_nested.c   | 283 ++
 8 files changed, 361 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_nested.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index a0b17f9..c95c651 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -322,6 +322,11 @@
 #define H_GET_24X7_DATA0xF07C
 #define H_GET_PERF_COUNTER_INFO0xF080
 
+/* Platform-specific hcalls used for nested HV KVM */
+#define H_SET_PARTITION_TABLE  0xF800
+#define H_ENTER_NESTED 0xF804
+#define H_TLB_INVALIDATE   0xF808
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 91c9779..7719ca5 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -274,6 +274,13 @@ static inline void kvmppc_save_tm_sprs(struct kvm_vcpu 
*vcpu) {}
 static inline void kvmppc_restore_tm_sprs(struct kvm_vcpu *vcpu) {}
 #endif
 
+bool kvmhv_nested_init(void);
+void kvmhv_nested_exit(void);
+void kvmhv_vm_nested_init(struct kvm *kvm);
+long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
+void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
+void kvmhv_release_all_nested(struct kvm *kvm);
+
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
 extern int kvm_irq_bypass;
@@ -387,9 +394,6 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
 /* TO = 31 for unconditional trap */
 #define INS_TW 0x7fe8
 
-/* LPIDs we support with this build -- runtime limit may be lower */
-#define KVMPPC_NR_LPIDS(LPID_RSVD + 1)
-
 #define SPLIT_HACK_MASK0xff00
 #define SPLIT_HACK_OFFS0xfb00
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 5c0e2d9..6d67b6a 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -23,6 +23,39 @@
 #include 
 #include 
 #include 
+#include 
+
+#ifdef CONFIG_PPC_PSERIES
+static inline bool kvmhv_on_pseries(void)
+{
+   return !cpu_has_feature(CPU_FTR_HVMODE);
+}
+#else
+static inline bool kvmhv_on_pseries(void)
+{
+   return false;
+}
+#endif
+
+/*
+ * Structure for a nested guest, that is, for a guest that is managed by
+ * one of our guests.
+ */
+struct kvm_nested_guest {
+   struct kvm *l1_host;/* L1 VM that owns this nested guest */
+   int l1_lpid;/* lpid L1 guest thinks this guest is */
+   int shadow_lpid;/* real lpid of this nested guest */
+   pgd_t *shadow_pgtable;  /* our page table for this guest */
+   u64 l1_gr_to_hr;/* L1's addr of part'n-scoped table */
+   u64 process_table;  /* process table entry for this guest */
+   long refcnt;/* number of pointers to this struct */
+   struct mutex tlb_lock;  /* serialize page faults and tlbies */
+   struct kvm_nested_guest *next;
+};
+
+struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
+ bool create);
+void kvmhv_put_nested(struct kvm_nested_guest *gp);
 
 /* Power architecture requires HPT is at least 256kiB, at most 64TiB */
 #define PPC_MIN_HPT_ORDER  18
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 

[PATCH v2 17/33] KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix()

2018-09-28 Thread Paul Mackerras
kvmppc_unmap_pte() does a sequence of operations that are open-coded in
kvm_unmap_radix().  This extends kvmppc_unmap_pte() a little so that it
can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 33 +
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 47f2b18..bd06a95 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -240,19 +240,22 @@ static void kvmppc_pmd_free(pmd_t *pmdp)
 }
 
 static void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
-unsigned long gpa, unsigned int shift)
+unsigned long gpa, unsigned int shift,
+struct kvm_memory_slot *memslot)
 
 {
-   unsigned long page_size = 1ul << shift;
unsigned long old;
 
old = kvmppc_radix_update_pte(kvm, pte, ~0UL, 0, gpa, shift);
kvmppc_radix_tlbie_page(kvm, gpa, shift);
if (old & _PAGE_DIRTY) {
unsigned long gfn = gpa >> PAGE_SHIFT;
-   struct kvm_memory_slot *memslot;
+   unsigned long page_size = PAGE_SIZE;
 
-   memslot = gfn_to_memslot(kvm, gfn);
+   if (shift)
+   page_size = 1ul << shift;
+   if (!memslot)
+   memslot = gfn_to_memslot(kvm, gfn);
if (memslot && memslot->dirty_bitmap)
kvmppc_update_dirty_map(memslot, gfn, page_size);
}
@@ -282,7 +285,7 @@ static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t 
*pte, bool full)
WARN_ON_ONCE(1);
kvmppc_unmap_pte(kvm, p,
 pte_pfn(*p) << PAGE_SHIFT,
-PAGE_SHIFT);
+PAGE_SHIFT, NULL);
}
}
 
@@ -304,7 +307,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
*pmd, bool full)
WARN_ON_ONCE(1);
kvmppc_unmap_pte(kvm, (pte_t *)p,
 pte_pfn(*(pte_t *)p) << PAGE_SHIFT,
-PMD_SHIFT);
+PMD_SHIFT, NULL);
}
} else {
pte_t *pte;
@@ -468,7 +471,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t 
*pgtable, pte_t pte,
goto out_unlock;
}
/* Valid 1GB page here already, remove it */
-   kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT);
+   kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT, NULL);
}
if (level == 2) {
if (!pud_none(*pud)) {
@@ -517,7 +520,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t 
*pgtable, pte_t pte,
goto out_unlock;
}
/* Valid 2MB page here already, remove it */
-   kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT);
+   kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT, NULL);
}
if (level == 1) {
if (!pmd_none(*pmd)) {
@@ -780,20 +783,10 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
pte_t *ptep;
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
-   unsigned long old;
 
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
-   if (ptep && pte_present(*ptep)) {
-   old = kvmppc_radix_update_pte(kvm, ptep, ~0UL, 0,
- gpa, shift);
-   kvmppc_radix_tlbie_page(kvm, gpa, shift);
-   if ((old & _PAGE_DIRTY) && memslot->dirty_bitmap) {
-   unsigned long psize = PAGE_SIZE;
-   if (shift)
-   psize = 1ul << shift;
-   kvmppc_update_dirty_map(memslot, gfn, psize);
-   }
-   }
+   if (ptep && pte_present(*ptep))
+   kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot);
return 0;   
 }
 
-- 
2.7.4



[PATCH v2 16/33] KVM: PPC: Book3S HV: Refactor radix page fault handler

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

The radix page fault handler accounts for all cases, including just
needing to insert a pte.  This breaks it up into separate functions for
the two main cases; setting rc and inserting a pte.

This allows us to make the setting of rc and inserting of a pte
generic for any pgtable, not specific to the one for this guest.

[pau...@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 210 +++--
 1 file changed, 123 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index f2976f4..47f2b18 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -400,8 +400,9 @@ static void kvmppc_unmap_free_pud_entry_table(struct kvm 
*kvm, pud_t *pud,
  */
 #define PTE_BITS_MUST_MATCH (~(_PAGE_WRITE | _PAGE_DIRTY | _PAGE_ACCESSED))
 
-static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
-unsigned int level, unsigned long mmu_seq)
+static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
+unsigned long gpa, unsigned int level,
+unsigned long mmu_seq)
 {
pgd_t *pgd;
pud_t *pud, *new_pud = NULL;
@@ -410,7 +411,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
int ret;
 
/* Traverse the guest's 2nd-level tree, allocate new levels needed */
-   pgd = kvm->arch.pgtable + pgd_index(gpa);
+   pgd = pgtable + pgd_index(gpa);
pud = NULL;
if (pgd_present(*pgd))
pud = pud_offset(pgd, gpa);
@@ -565,95 +566,49 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
return ret;
 }
 
-int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
-  unsigned long ea, unsigned long dsisr)
+static bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
+   bool writing, unsigned long gpa)
+{
+   unsigned long pgflags;
+   unsigned int shift;
+   pte_t *ptep;
+
+   /*
+* Need to set an R or C bit in the 2nd-level tables;
+* since we are just helping out the hardware here,
+* it is sufficient to do what the hardware does.
+*/
+   pgflags = _PAGE_ACCESSED;
+   if (writing)
+   pgflags |= _PAGE_DIRTY;
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
+   ptep = __find_linux_pte(pgtable, gpa, NULL, );
+   if (ptep && pte_present(*ptep) && (!writing || pte_write(*ptep))) {
+   kvmppc_radix_update_pte(kvm, ptep, 0, pgflags, gpa, shift);
+   return true;
+   }
+   return false;
+}
+
+static int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
+   unsigned long gpa,
+   struct kvm_memory_slot *memslot,
+   bool writing, bool kvm_ro,
+   pte_t *inserted_pte, unsigned int *levelp)
 {
struct kvm *kvm = vcpu->kvm;
-   unsigned long mmu_seq;
-   unsigned long gpa, gfn, hva;
-   struct kvm_memory_slot *memslot;
struct page *page = NULL;
-   long ret;
-   bool writing;
+   unsigned long mmu_seq;
+   unsigned long hva, gfn = gpa >> PAGE_SHIFT;
bool upgrade_write = false;
bool *upgrade_p = _write;
pte_t pte, *ptep;
-   unsigned long pgflags;
unsigned int shift, level;
-
-   /* Check for unusual errors */
-   if (dsisr & DSISR_UNSUPP_MMU) {
-   pr_err("KVM: Got unsupported MMU fault\n");
-   return -EFAULT;
-   }
-   if (dsisr & DSISR_BADACCESS) {
-   /* Reflect to the guest as DSI */
-   pr_err("KVM: Got radix HV page fault with DSISR=%lx\n", dsisr);
-   kvmppc_core_queue_data_storage(vcpu, ea, dsisr);
-   return RESUME_GUEST;
-   }
-
-   /* Translate the logical address and get the page */
-   gpa = vcpu->arch.fault_gpa & ~0xfffUL;
-   gpa &= ~0xF000ul;
-   gfn = gpa >> PAGE_SHIFT;
-   if (!(dsisr & DSISR_PRTABLE_FAULT))
-   gpa |= ea & 0xfff;
-   memslot = gfn_to_memslot(kvm, gfn);
-
-   /* No memslot means it's an emulated MMIO region */
-   if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID)) {
-   if (dsisr & (DSISR_PRTABLE_FAULT | DSISR_BADACCESS |
-DSISR_SET_RC)) {
-   /*
-* Bad address in guest 

[PATCH v2 14/33] KVM: PPC: Book3S HV: Clear partition table entry on vm teardown

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

When destroying a VM we return the LPID to the pool, however we never
zero the partition table entry. This is instead done when we reallocate
the LPID.

Zero the partition table entry on VM teardown before returning the LPID
to the pool. This means if we were running as a nested hypervisor the
real hypervisor could use this to determine when it can free resources.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0ad5541..b8703be 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4501,13 +4501,19 @@ static void kvmppc_core_destroy_vm_hv(struct kvm *kvm)
 
kvmppc_free_vcores(kvm);
 
-   kvmppc_free_lpid(kvm->arch.lpid);
 
if (kvm_is_radix(kvm))
kvmppc_free_radix(kvm);
else
kvmppc_free_hpt(>arch.hpt);
 
+   /* Perform global invalidation and return lpid to the pool */
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   kvm->arch.process_table = 0;
+   kvmppc_setup_partition_table(kvm);
+   }
+   kvmppc_free_lpid(kvm->arch.lpid);
+
kvmppc_free_pimap(kvm);
 }
 
-- 
2.7.4



[PATCH v2 15/33] KVM: PPC: Book3S HV: Make kvmppc_mmu_radix_xlate process/partition table agnostic

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

kvmppc_mmu_radix_xlate() is used to translate an effective address
through the process tables. The process table and partition tables have
identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate()
function able to translate either an effective address through the
process tables or a guest real address through the partition tables.

[pau...@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h  |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 109 +++--
 2 files changed, 78 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index dd18d81..91c9779 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,9 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, 
unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
+   struct kvmppc_pte *gpte, u64 table,
+   int table_index, u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 71951b5..f2976f4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -29,83 +29,92 @@
  */
 static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
 
-int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
-  struct kvmppc_pte *gpte, bool data, bool iswrite)
+/*
+ * Used to walk a partition or process table radix tree in guest memory
+ * Note: We exploit the fact that a partition table and a process
+ * table have the same layout, a partition-scoped page table and a
+ * process-scoped page table have the same layout, and the 2nd
+ * doubleword of a partition table entry has the same layout as
+ * the PTCR register.
+ */
+int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
+struct kvmppc_pte *gpte, u64 table,
+int table_index, u64 *pte_ret_p)
 {
struct kvm *kvm = vcpu->kvm;
-   u32 pid;
int ret, level, ps;
-   __be64 prte, rpte;
-   unsigned long ptbl;
-   unsigned long root, pte, index;
+   unsigned long ptbl, root;
unsigned long rts, bits, offset;
-   unsigned long gpa;
-   unsigned long proc_tbl_size;
+   unsigned long size, index;
+   struct prtb_entry entry;
+   u64 pte, base, gpa;
+   __be64 rpte;
 
-   /* Work out effective PID */
-   switch (eaddr >> 62) {
-   case 0:
-   pid = vcpu->arch.pid;
-   break;
-   case 3:
-   pid = 0;
-   break;
-   default:
+   if ((table & PRTS_MASK) > 24)
return -EINVAL;
-   }
-   proc_tbl_size = 1 << ((kvm->arch.process_table & PRTS_MASK) + 12);
-   if (pid * 16 >= proc_tbl_size)
+   size = 1ul << ((table & PRTS_MASK) + 12);
+
+   /* Is the table big enough to contain this entry? */
+   if ((table_index * sizeof(entry)) >= size)
return -EINVAL;
 
-   /* Read partition table to find root of tree for effective PID */
-   ptbl = (kvm->arch.process_table & PRTB_MASK) + (pid * 16);
-   ret = kvm_read_guest(kvm, ptbl, , sizeof(prte));
+   /* Read the table to find the root of the radix tree */
+   ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
+   ret = kvm_read_guest(kvm, ptbl, , sizeof(entry));
if (ret)
return ret;
 
-   root = be64_to_cpu(prte);
+   /* Root is stored in the first double word */
+   root = be64_to_cpu(entry.prtb0);
rts = ((root & RTS1_MASK) >> (RTS1_SHIFT - 3)) |
((root & RTS2_MASK) >> RTS2_SHIFT);
bits = root & RPDS_MASK;
-   root = root & RPDB_MASK;
+   base = root & RPDB_MASK;
 
offset = rts + 31;
 
-   /* current implementations only support 52-bit space */
+   /* Current implementations only support 52-bit space */
if (offset != 52)
return -EINVAL;
 
+   /* Walk each level of the radix tree */
for (level = 3; level >= 0; --level) {
+   /* Check a valid size */
if (level && bits != p9_supported_radix_bits[level])
return -EINVAL;
if (level == 0 && !(bits == 5 || bits == 9))

[PATCH v2 13/33] KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct

2018-09-28 Thread Paul Mackerras
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits.  This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h|  4 ++--
 arch/powerpc/include/asm/kvm_book3s_64.h |  4 ++--
 arch/powerpc/include/asm/kvm_booke.h |  4 ++--
 arch/powerpc/include/asm/kvm_host.h  |  2 --
 arch/powerpc/kernel/asm-offsets.c|  4 ++--
 arch/powerpc/kvm/book3s_emulate.c| 12 ++--
 arch/powerpc/kvm/book3s_hv.c |  4 ++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  4 ++--
 arch/powerpc/kvm/book3s_hv_tm.c  |  6 +++---
 arch/powerpc/kvm/book3s_hv_tm_builtin.c  |  5 +++--
 arch/powerpc/kvm/book3s_pr.c |  4 ++--
 arch/powerpc/kvm/bookehv_interrupts.S|  8 
 arch/powerpc/kvm/emulate_loadstore.c |  1 -
 13 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 83a9aa3..dd18d81 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -301,12 +301,12 @@ static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, 
int num)
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.cr = val;
+   vcpu->arch.regs.ccr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.cr;
+   return vcpu->arch.regs.ccr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, ulong val)
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index af25aaa..5c0e2d9 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -483,7 +483,7 @@ static inline u64 sanitize_msr(u64 msr)
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 static inline void copy_from_checkpoint(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.cr  = vcpu->arch.cr_tm;
+   vcpu->arch.regs.ccr  = vcpu->arch.cr_tm;
vcpu->arch.regs.xer = vcpu->arch.xer_tm;
vcpu->arch.regs.link  = vcpu->arch.lr_tm;
vcpu->arch.regs.ctr = vcpu->arch.ctr_tm;
@@ -500,7 +500,7 @@ static inline void copy_from_checkpoint(struct kvm_vcpu 
*vcpu)
 
 static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.cr_tm  = vcpu->arch.cr;
+   vcpu->arch.cr_tm  = vcpu->arch.regs.ccr;
vcpu->arch.xer_tm = vcpu->arch.regs.xer;
vcpu->arch.lr_tm  = vcpu->arch.regs.link;
vcpu->arch.ctr_tm = vcpu->arch.regs.ctr;
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index d513e3e..f0cef62 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -46,12 +46,12 @@ static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, 
int num)
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.cr = val;
+   vcpu->arch.regs.ccr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.cr;
+   return vcpu->arch.regs.ccr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, ulong val)
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index a3d4f61..c9cc42f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -538,8 +538,6 @@ struct kvm_vcpu_arch {
ulong tar;
 #endif
 
-   u32 cr;
-
 #ifdef CONFIG_PPC_BOOK3S
ulong hflags;
ulong guest_owned_ext;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 89cf155..7c3738d 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -438,7 +438,7 @@ int main(void)
 #ifdef CONFIG_PPC_BOOK3S
OFFSET(VCPU_TAR, kvm_vcpu, arch.tar);
 #endif
-   OFFSET(VCPU_CR, kvm_vcpu, arch.cr);
+   OFFSET(VCPU_CR, kvm_vcpu, arch.regs.ccr);
OFFSET(VCPU_PC, kvm_vcpu, arch.regs.nip);
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
OFFSET(VCPU_MSR, kvm_vcpu, arch.shregs.msr);
@@ -695,7 +695,7 @@ int main(void)
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #else /* CONFIG_PPC_BOOK3S */
-   OFFSET(VCPU_CR, kvm_vcpu, arch.cr);
+   OFFSET(VCPU_CR, kvm_vcpu, arch.regs.ccr);
OFFSET(VCPU_XER, kvm_vcpu, arch.regs.xer);
OFFSET(VCPU_LR, kvm_vcpu, arch.regs.link);
OFFSET(VCPU_CTR, kvm_vcpu, arch.regs.ctr);
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 36b11c5..2654df2 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ 

[PATCH v2 12/33] powerpc: Turn off CPU_FTR_P9_TM_HV_ASSIST in non-hypervisor mode

2018-09-28 Thread Paul Mackerras
When doing nested virtualization, it is only necessary to do the
transactional memory hypervisor assist at level 0, that is, when
we are in hypervisor mode.  Nested hypervisors can just use the TM
facilities as architected.  Therefore we should clear the
CPU_FTR_P9_TM_HV_ASSIST bit when we are not in hypervisor mode,
along with the CPU_FTR_HVMODE bit.

Doing this will not change anything at this stage because the only
code that tests CPU_FTR_P9_TM_HV_ASSIST is in HV KVM, which currently
can only be used when when CPU_FTR_HVMODE is set.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kernel/cpu_setup_power.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 458b928..c317080 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -147,8 +147,8 @@ __init_hvmode_206:
rldicl. r0,r3,4,63
bnelr
ld  r5,CPU_SPEC_FEATURES(r4)
-   LOAD_REG_IMMEDIATE(r6,CPU_FTR_HVMODE)
-   xor r5,r5,r6
+   LOAD_REG_IMMEDIATE(r6,CPU_FTR_HVMODE | CPU_FTR_P9_TM_HV_ASSIST)
+   andcr5,r5,r6
std r5,CPU_SPEC_FEATURES(r4)
blr
 
-- 
2.7.4



[PATCH v2 11/33] powerpc: Add LPCR_EVIRT define

2018-09-28 Thread Paul Mackerras
From: Suraj Jitindar Singh 

Add definition of the LPCR EVIRT (enhanced virtualisation) bit.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 6fda746..9c42abf 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -456,6 +456,7 @@
 #define   LPCR_HVICE   ASM_CONST(0x0002)  /* P9: HV 
interrupt enable */
 #define   LPCR_HDICE   ASM_CONST(0x0001)  /* Hyp Decr 
enable (HV,PR,EE) */
 #define   LPCR_UPRTASM_CONST(0x0040)  /* Use 
Process Table (ISA 3) */
+#define   LPCR_EVIRT   ASM_CONST(0x0020)  /* Enhanced 
Virtualisation */
 #define   LPCR_HR  ASM_CONST(0x0010)
 #ifndef SPRN_LPID
 #define SPRN_LPID  0x13F   /* Logical Partition Identifier */
-- 
2.7.4



[PATCH v2 10/33] KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings

2018-09-28 Thread Paul Mackerras
This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format.  It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |   1 +
 arch/powerpc/include/asm/kvm_host.h  |   1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 179 +++
 arch/powerpc/kvm/book3s_hv.c |   2 +
 4 files changed, 183 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index dc435a5..af25aaa 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -435,6 +435,7 @@ static inline struct kvm_memslots *kvm_memslots_raw(struct 
kvm *kvm)
 }
 
 extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
+extern void kvmhv_radix_debugfs_init(struct kvm *kvm);
 
 extern void kvmhv_rm_send_ipi(int cpu);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3cd0b9f..a3d4f61 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -291,6 +291,7 @@ struct kvm_arch {
u64 process_table;
struct dentry *debugfs_dir;
struct dentry *htab_dentry;
+   struct dentry *radix_dentry;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 933c574..71951b5 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -10,6 +10,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -853,6 +856,182 @@ static void pmd_ctor(void *addr)
memset(addr, 0, RADIX_PMD_TABLE_SIZE);
 }
 
+struct debugfs_radix_state {
+   struct kvm  *kvm;
+   struct mutexmutex;
+   unsigned long   gpa;
+   int chars_left;
+   int buf_index;
+   charbuf[128];
+   u8  hdr;
+};
+
+static int debugfs_radix_open(struct inode *inode, struct file *file)
+{
+   struct kvm *kvm = inode->i_private;
+   struct debugfs_radix_state *p;
+
+   p = kzalloc(sizeof(*p), GFP_KERNEL);
+   if (!p)
+   return -ENOMEM;
+
+   kvm_get_kvm(kvm);
+   p->kvm = kvm;
+   mutex_init(>mutex);
+   file->private_data = p;
+
+   return nonseekable_open(inode, file);
+}
+
+static int debugfs_radix_release(struct inode *inode, struct file *file)
+{
+   struct debugfs_radix_state *p = file->private_data;
+
+   kvm_put_kvm(p->kvm);
+   kfree(p);
+   return 0;
+}
+
+static ssize_t debugfs_radix_read(struct file *file, char __user *buf,
+size_t len, loff_t *ppos)
+{
+   struct debugfs_radix_state *p = file->private_data;
+   ssize_t ret, r;
+   unsigned long n;
+   struct kvm *kvm;
+   unsigned long gpa;
+   pgd_t *pgt;
+   pgd_t pgd, *pgdp;
+   pud_t pud, *pudp;
+   pmd_t pmd, *pmdp;
+   pte_t *ptep;
+   int shift;
+   unsigned long pte;
+
+   kvm = p->kvm;
+   if (!kvm_is_radix(kvm))
+   return 0;
+
+   ret = mutex_lock_interruptible(>mutex);
+   if (ret)
+   return ret;
+
+   if (p->chars_left) {
+   n = p->chars_left;
+   if (n > len)
+   n = len;
+   r = copy_to_user(buf, p->buf + p->buf_index, n);
+   n -= r;
+   p->chars_left -= n;
+   p->buf_index += n;
+   buf += n;
+   len -= n;
+   ret = n;
+   if (r) {
+   if (!n)
+   ret = -EFAULT;
+   goto out;
+   }
+   }
+
+   gpa = p->gpa;
+   pgt = kvm->arch.pgtable;
+   while (len != 0 && gpa < RADIX_PGTABLE_RANGE) {
+   if (!p->hdr) {
+   n = scnprintf(p->buf, sizeof(p->buf),
+ "pgdir: %lx\n", (unsigned long)pgt);
+   p->hdr = 1;
+   goto copy;
+   }
+
+   pgdp = pgt + pgd_index(gpa);
+   pgd = READ_ONCE(*pgdp);
+   if (!(pgd_val(pgd) & _PAGE_PRESENT)) {
+   gpa = (gpa & PGDIR_MASK) + PGDIR_SIZE;
+   continue;
+   }
+
+   pudp = pud_offset(, gpa);
+   pud = READ_ONCE(*pudp);
+   if (!(pud_val(pud) & _PAGE_PRESENT)) {
+   gpa = (gpa & PUD_MASK) + PUD_SIZE;
+   continue;
+   }
+   

[PATCH v2 09/33] KVM: PPC: Book3S HV: Handle hypervisor instruction faults better

2018-09-28 Thread Paul Mackerras
Currently the code for handling hypervisor instruction page faults
passes 0 for the flags indicating the type of fault, which is OK in
the usual case that the page is not mapped in the partition-scoped
page tables.  However, there are other causes for hypervisor
instruction page faults, such as not being to update a reference
(R) or change (C) bit.  The cause is indicated in bits in HSRR1,
including a bit which indicates that the fault is due to not being
able to write to a page (for example to update an R or C bit).
Not handling these other kinds of faults correctly can lead to a
loop of continual faults without forward progress in the guest.

In order to handle these faults better, this patch constructs a
"DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
have in common, and passes it to kvmppc_book3s_hv_page_fault() so
that it knows what caused the fault.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h | 1 +
 arch/powerpc/kvm/book3s_hv.c   | 5 -
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e5b314e..6fda746 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -766,6 +766,7 @@
 #define SPRN_HSRR0 0x13A   /* Save/Restore Register 0 */
 #define SPRN_HSRR1 0x13B   /* Save/Restore Register 1 */
 #define   HSRR1_DENORM 0x0010 /* Denorm exception */
+#define   HSRR1_HISI_WRITE 0x0001 /* HISI bcs couldn't update mem */
 
 #define SPRN_TBCTL 0x35f   /* PA6T Timebase control register */
 #define   TBCTL_FREEZE 0xull /* Freeze all tbs */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d6035b1..4487526 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1188,7 +1188,10 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
vcpu->arch.fault_dar = kvmppc_get_pc(vcpu);
-   vcpu->arch.fault_dsisr = 0;
+   vcpu->arch.fault_dsisr = vcpu->arch.shregs.msr &
+   DSISR_SRR1_MATCH_64S;
+   if (vcpu->arch.shregs.msr & HSRR1_HISI_WRITE)
+   vcpu->arch.fault_dsisr |= DSISR_ISSTORE;
r = RESUME_PAGE_FAULT;
break;
/*
-- 
2.7.4



[PATCH v2 08/33] KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests

2018-09-28 Thread Paul Mackerras
This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.

The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered.  On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE.  At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.

The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   2 +
 arch/powerpc/kvm/book3s_hv.c  | 425 +-
 arch/powerpc/kvm/book3s_hv_ras.c  |   2 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  95 ++-
 arch/powerpc/kvm/book3s_xive.c|  63 +
 6 files changed, 585 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 0c1a2b0..5c9b00c 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -165,4 +165,6 @@ void kvmhv_load_host_pmu(void);
 void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
 void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
 
+int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 83d61b8..245e564 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -585,6 +585,7 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
+extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
   u32 priority) { return -1; }
@@ -607,6 +608,7 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
 
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
+static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
 #endif /* CONFIG_KVM_XIVE */
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0e17593..d6035b1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3080,6 +3080,269 @@ static noinline void kvmppc_run_core(struct 
kvmppc_vcore *vc)
 }
 
 /*
+ * Load up hypervisor-mode registers on P9.
+ */
+static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit)
+{
+   struct kvmppc_vcore *vc = vcpu->arch.vcore;
+   s64 hdec;
+   u64 tb, purr, spurr;
+   int trap;
+   unsigned long host_hfscr = mfspr(SPRN_HFSCR);
+   unsigned long host_ciabr = mfspr(SPRN_CIABR);
+   unsigned long host_dawr = mfspr(SPRN_DAWR);
+   unsigned long host_dawrx = mfspr(SPRN_DAWRX);
+   unsigned long host_psscr = mfspr(SPRN_PSSCR);
+   unsigned long host_pidr = mfspr(SPRN_PID);
+
+   hdec = time_limit - mftb();
+   if (hdec < 0)
+   return BOOK3S_INTERRUPT_HV_DECREMENTER;
+   mtspr(SPRN_HDEC, hdec);
+
+   if (vc->tb_offset) {
+   u64 new_tb = mftb() + vc->tb_offset;
+   mtspr(SPRN_TBU40, new_tb);
+   tb = mftb();
+   if ((tb & 0xff) < (new_tb & 0xff))
+   mtspr(SPRN_TBU40, new_tb + 0x100);
+   vc->tb_offset_applied = vc->tb_offset;
+   }
+
+   if (vc->pcr)
+   mtspr(SPRN_PCR, vc->pcr);
+   mtspr(SPRN_DPDES, vc->dpdes);
+   mtspr(SPRN_VTB, vc->vtb);
+
+   local_paca->kvm_hstate.host_purr = mfspr(SPRN_PURR);
+   local_paca->kvm_hstate.host_spurr = mfspr(SPRN_SPURR);
+   mtspr(SPRN_PURR, vcpu->arch.purr);
+   mtspr(SPRN_SPURR, vcpu->arch.spurr);
+
+   if (cpu_has_feature(CPU_FTR_DAWR)) {
+   mtspr(SPRN_DAWR, vcpu->arch.dawr);
+   mtspr(SPRN_DAWRX, vcpu->arch.dawrx);
+   }
+   

[PATCH v2 07/33] KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked

2018-09-28 Thread Paul Mackerras
Currently kvmppc_handle_exit_hv() is called with the vcore lock held
because it is called within a for_each_runnable_thread loop.
However, we already unlock the vcore within kvmppc_handle_exit_hv()
under certain circumstances, and this is safe because (a) any vcpus
that become runnable and are added to the runnable set by
kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
run in the guest (because the vcore state is VCORE_EXITING), and
(b) for_each_runnable_thread is safe against addition or removal
of vcpus from the runnable set.

Therefore, in order to simplify things for following patches, let's
drop the vcore lock in the for_each_runnable_thread loop, so
kvmppc_handle_exit_hv() gets called without the vcore lock held.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 49a686c..0e17593 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1084,7 +1084,6 @@ static int kvmppc_emulate_doorbell_instr(struct kvm_vcpu 
*vcpu)
return RESUME_GUEST;
 }
 
-/* Called with vcpu->arch.vcore->lock held */
 static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 struct task_struct *tsk)
 {
@@ -1205,10 +1204,7 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
swab32(vcpu->arch.emul_inst) :
vcpu->arch.emul_inst;
if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) {
-   /* Need vcore unlocked to call kvmppc_get_last_inst */
-   spin_unlock(>arch.vcore->lock);
r = kvmppc_emulate_debug_inst(run, vcpu);
-   spin_lock(>arch.vcore->lock);
} else {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
r = RESUME_GUEST;
@@ -1224,12 +1220,8 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
case BOOK3S_INTERRUPT_H_FAC_UNAVAIL:
r = EMULATE_FAIL;
if (((vcpu->arch.hfscr >> 56) == FSCR_MSGP_LG) &&
-   cpu_has_feature(CPU_FTR_ARCH_300)) {
-   /* Need vcore unlocked to call kvmppc_get_last_inst */
-   spin_unlock(>arch.vcore->lock);
+   cpu_has_feature(CPU_FTR_ARCH_300))
r = kvmppc_emulate_doorbell_instr(vcpu);
-   spin_lock(>arch.vcore->lock);
-   }
if (r == EMULATE_FAIL) {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
r = RESUME_GUEST;
@@ -2599,6 +2591,14 @@ static void post_guest_process(struct kvmppc_vcore *vc, 
bool is_master)
spin_lock(>lock);
now = get_tb();
for_each_runnable_thread(i, vcpu, vc) {
+   /*
+* It's safe to unlock the vcore in the loop here, because
+* for_each_runnable_thread() is safe against removal of
+* the vcpu, and the vcore state is VCORE_EXITING here,
+* so any vcpus becoming runnable will have their arch.trap
+* set to zero and can't actually run in the guest.
+*/
+   spin_unlock(>lock);
/* cancel pending dec exception if dec is positive */
if (now < vcpu->arch.dec_expires &&
kvmppc_core_pending_dec(vcpu))
@@ -2614,6 +2614,7 @@ static void post_guest_process(struct kvmppc_vcore *vc, 
bool is_master)
vcpu->arch.ret = ret;
vcpu->arch.trap = 0;
 
+   spin_lock(>lock);
if (is_kvmppc_resume_guest(vcpu->arch.ret)) {
if (vcpu->arch.pending_exceptions)
kvmppc_core_prepare_to_enter(vcpu);
-- 
2.7.4



[PATCH v2 06/33] KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable

2018-09-28 Thread Paul Mackerras
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
which allows the caller to indicate whether it wants the nonvolatile
register state to be preserved across the call, as required by the C
calling conventions.  This parameter being non-zero also causes the
MSR bits that enable TM, FP, VMX and VSX to be preserved.  The
condition register and DSCR are now always preserved.

With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
from C code provided the 3rd parameter is non-zero.  So that these
functions can be called from modules, they now include code to set
the TOC pointer (r2) on entry, as they can call other built-in C
functions which will assume the TOC to have been set.

Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
assume that treclaim in fake-suspend state does not modify any registers,
which is the case on POWER9.  This enables the code to be simplified
quite a bit.

_kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
this change, since they now only need to save and restore TAR and pass
1 for the 3rd argument to __kvmppc_{save,restore}_tm.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |  10 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  49 +++---
 arch/powerpc/kvm/tm.S | 250 --
 3 files changed, 169 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 024e8fc..0c1a2b0 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -150,6 +150,16 @@ extern s32 patch__memset_nocache, patch__memcpy_nocache;
 
 extern long flush_count_cache;
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+void kvmppc_save_tm_hv(struct kvm_vcpu *vcpu, u64 msr, bool preserve_nv);
+void kvmppc_restore_tm_hv(struct kvm_vcpu *vcpu, u64 msr, bool preserve_nv);
+#else
+static inline void kvmppc_save_tm_hv(struct kvm_vcpu *vcpu, u64 msr,
+bool preserve_nv) { }
+static inline void kvmppc_restore_tm_hv(struct kvm_vcpu *vcpu, u64 msr,
+   bool preserve_nv) { }
+#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+
 void kvmhv_save_host_pmu(void);
 void kvmhv_load_host_pmu(void);
 void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 772740d..67a847f 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -759,11 +759,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
mr  r3, r4
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_restore_tm_hv
+   nop
ld  r4, HSTATE_KVM_VCPU(r13)
 91:
 #endif
@@ -1603,11 +1605,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
mr  r3, r9
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_save_tm_hv
+   nop
ld  r9, HSTATE_KVM_VCPU(r13)
 91:
 #endif
@@ -2486,11 +2490,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
ld  r3, HSTATE_KVM_VCPU(r13)
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_save_tm_hv
+   nop
 91:
 #endif
 
@@ -2606,11 +2612,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
mr  r3, r4
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_restore_tm_hv
+   nop
ld  r4, HSTATE_KVM_VCPU(r13)
 91:
 #endif
@@ -2943,10 +2951,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
  * Save transactional state and TM-related registers.
  * Called with r3 pointing to the vcpu struct and r4 containing
  * the guest MSR value.
- * This can modify all checkpointed registers, but
+ * r5 is non-zero iff non-volatile register state needs to 

[PATCH v2 05/33] KVM: PPC: Book3S HV: Simplify real-mode interrupt handling

2018-09-28 Thread Paul Mackerras
This streamlines the first part of the code that handles a hypervisor
interrupt that occurred in the guest.  With this, all of the real-mode
handling that occurs is done before the "guest_exit_cont" label; once
we get to that label we are committed to exiting to host virtual mode.
Thus the machine check and HMI real-mode handling is moved before that
label.

Also, the code to handle external interrupts is moved out of line, as
is the code that calls kvmppc_realmode_hmi_handler().

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_ras.c|   8 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 220 
 2 files changed, 119 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
index b11043b..ee564b6 100644
--- a/arch/powerpc/kvm/book3s_hv_ras.c
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -331,5 +331,13 @@ long kvmppc_realmode_hmi_handler(void)
} else {
wait_for_tb_resync();
}
+
+   /*
+* Reset tb_offset_applied so the guest exit code won't try
+* to subtract the previous timebase offset from the timebase.
+*/
+   if (local_paca->kvm_hstate.kvm_vcore)
+   local_paca->kvm_hstate.kvm_vcore->tb_offset_applied = 0;
+
return 0;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 5b2ae34..772740d 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1018,8 +1018,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
 no_xive:
 #endif /* CONFIG_KVM_XICS */
 
-deliver_guest_interrupt:
-kvmppc_cede_reentry:   /* r4 = vcpu, r13 = paca */
+deliver_guest_interrupt:   /* r4 = vcpu, r13 = paca */
/* Check if we can deliver an external or decrementer interrupt now */
ld  r0, VCPU_PENDING_EXC(r4)
 BEGIN_FTR_SECTION
@@ -1269,18 +1268,26 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
std r3, VCPU_CTR(r9)
std r4, VCPU_XER(r9)
 
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-   /* For softpatch interrupt, go off and do TM instruction emulation */
-   cmpwi   r12, BOOK3S_INTERRUPT_HV_SOFTPATCH
-   beq kvmppc_tm_emul
-#endif
+   /* Save more register state  */
+   mfdar   r6
+   mfdsisr r7
+   std r6, VCPU_DAR(r9)
+   stw r7, VCPU_DSISR(r9)
 
/* If this is a page table miss then see if it's theirs or ours */
cmpwi   r12, BOOK3S_INTERRUPT_H_DATA_STORAGE
beq kvmppc_hdsi
+   std r6, VCPU_FAULT_DAR(r9)
+   stw r7, VCPU_FAULT_DSISR(r9)
cmpwi   r12, BOOK3S_INTERRUPT_H_INST_STORAGE
beq kvmppc_hisi
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   /* For softpatch interrupt, go off and do TM instruction emulation */
+   cmpwi   r12, BOOK3S_INTERRUPT_HV_SOFTPATCH
+   beq kvmppc_tm_emul
+#endif
+
/* See if this is a leftover HDEC interrupt */
cmpwi   r12,BOOK3S_INTERRUPT_HV_DECREMENTER
bne 2f
@@ -1303,7 +1310,7 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
lbz r0, HSTATE_HOST_IPI(r13)
cmpwi   r0, 0
-   beq 4f
+   beq maybe_reenter_guest
b   guest_exit_cont
 3:
/* If it's a hypervisor facility unavailable interrupt, save HFSCR */
@@ -1315,82 +1322,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 14:
/* External interrupt ? */
cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
-   bne+guest_exit_cont
-
-   /* External interrupt, first check for host_ipi. If this is
-* set, we know the host wants us out so let's do it now
-*/
-   bl  kvmppc_read_intr
-
-   /*
-* Restore the active volatile registers after returning from
-* a C function.
-*/
-   ld  r9, HSTATE_KVM_VCPU(r13)
-   li  r12, BOOK3S_INTERRUPT_EXTERNAL
-
-   /*
-* kvmppc_read_intr return codes:
-*
-* Exit to host (r3 > 0)
-*   1 An interrupt is pending that needs to be handled by the host
-* Exit guest and return to host by branching to guest_exit_cont
-*
-*   2 Passthrough that needs completion in the host
-* Exit guest and return to host by branching to guest_exit_cont
-* However, we also set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
-* to indicate to the host to complete handling the interrupt
-*
-* Before returning to guest, we check if any CPU is heading out
-* to the host and if so, we head out also. If no CPUs are heading
-* check return values <= 0.
-*
-* Return to guest (r3 <= 0)
-*  0 No external interrupt is pending
-* -1 A guest wakeup IPI (which has now been cleared)
-*In either case, we return to guest to deliver any pending
-*guest interrupts.
-*
-* -2 A PCI 

[PATCH v2 04/33] KVM: PPC: Book3S HV: Extract PMU save/restore operations as C-callable functions

2018-09-28 Thread Paul Mackerras
This pulls out the assembler code that is responsible for saving and
restoring the PMU state for the host and guest into separate functions
so they can be used from an alternate entry path.  The calling
convention is made compatible with C.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |   5 +
 arch/powerpc/kvm/book3s_hv_interrupts.S   |  95 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 363 --
 3 files changed, 253 insertions(+), 210 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 1f4691c..024e8fc 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -150,4 +150,9 @@ extern s32 patch__memset_nocache, patch__memcpy_nocache;
 
 extern long flush_count_cache;
 
+void kvmhv_save_host_pmu(void);
+void kvmhv_load_host_pmu(void);
+void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
+void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S 
b/arch/powerpc/kvm/book3s_hv_interrupts.S
index 666b91c..a6d1001 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -64,52 +64,7 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 
/* Save host PMU registers */
-BEGIN_FTR_SECTION
-   /* Work around P8 PMAE bug */
-   li  r3, -1
-   clrrdi  r3, r3, 10
-   mfspr   r8, SPRN_MMCR2
-   mtspr   SPRN_MMCR2, r3  /* freeze all counters using MMCR2 */
-   isync
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
-   li  r3, 1
-   sldir3, r3, 31  /* MMCR0_FC (freeze counters) bit */
-   mfspr   r7, SPRN_MMCR0  /* save MMCR0 */
-   mtspr   SPRN_MMCR0, r3  /* freeze all counters, disable 
interrupts */
-   mfspr   r6, SPRN_MMCRA
-   /* Clear MMCRA in order to disable SDAR updates */
-   li  r5, 0
-   mtspr   SPRN_MMCRA, r5
-   isync
-   lbz r5, PACA_PMCINUSE(r13)  /* is the host using the PMU? */
-   cmpwi   r5, 0
-   beq 31f /* skip if not */
-   mfspr   r5, SPRN_MMCR1
-   mfspr   r9, SPRN_SIAR
-   mfspr   r10, SPRN_SDAR
-   std r7, HSTATE_MMCR0(r13)
-   std r5, HSTATE_MMCR1(r13)
-   std r6, HSTATE_MMCRA(r13)
-   std r9, HSTATE_SIAR(r13)
-   std r10, HSTATE_SDAR(r13)
-BEGIN_FTR_SECTION
-   mfspr   r9, SPRN_SIER
-   std r8, HSTATE_MMCR2(r13)
-   std r9, HSTATE_SIER(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
-   mfspr   r3, SPRN_PMC1
-   mfspr   r5, SPRN_PMC2
-   mfspr   r6, SPRN_PMC3
-   mfspr   r7, SPRN_PMC4
-   mfspr   r8, SPRN_PMC5
-   mfspr   r9, SPRN_PMC6
-   stw r3, HSTATE_PMC1(r13)
-   stw r5, HSTATE_PMC2(r13)
-   stw r6, HSTATE_PMC3(r13)
-   stw r7, HSTATE_PMC4(r13)
-   stw r8, HSTATE_PMC5(r13)
-   stw r9, HSTATE_PMC6(r13)
-31:
+   bl  kvmhv_save_host_pmu
 
/*
 * Put whatever is in the decrementer into the
@@ -161,3 +116,51 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
ld  r0, PPC_LR_STKOFF(r1)
mtlrr0
blr
+
+_GLOBAL(kvmhv_save_host_pmu)
+BEGIN_FTR_SECTION
+   /* Work around P8 PMAE bug */
+   li  r3, -1
+   clrrdi  r3, r3, 10
+   mfspr   r8, SPRN_MMCR2
+   mtspr   SPRN_MMCR2, r3  /* freeze all counters using MMCR2 */
+   isync
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+   li  r3, 1
+   sldir3, r3, 31  /* MMCR0_FC (freeze counters) bit */
+   mfspr   r7, SPRN_MMCR0  /* save MMCR0 */
+   mtspr   SPRN_MMCR0, r3  /* freeze all counters, disable 
interrupts */
+   mfspr   r6, SPRN_MMCRA
+   /* Clear MMCRA in order to disable SDAR updates */
+   li  r5, 0
+   mtspr   SPRN_MMCRA, r5
+   isync
+   lbz r5, PACA_PMCINUSE(r13)  /* is the host using the PMU? */
+   cmpwi   r5, 0
+   beq 31f /* skip if not */
+   mfspr   r5, SPRN_MMCR1
+   mfspr   r9, SPRN_SIAR
+   mfspr   r10, SPRN_SDAR
+   std r7, HSTATE_MMCR0(r13)
+   std r5, HSTATE_MMCR1(r13)
+   std r6, HSTATE_MMCRA(r13)
+   std r9, HSTATE_SIAR(r13)
+   std r10, HSTATE_SDAR(r13)
+BEGIN_FTR_SECTION
+   mfspr   r9, SPRN_SIER
+   std r8, HSTATE_MMCR2(r13)
+   std r9, HSTATE_SIER(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+   mfspr   r3, SPRN_PMC1
+   mfspr   r5, SPRN_PMC2
+   mfspr   r6, SPRN_PMC3
+   mfspr   r7, SPRN_PMC4
+   mfspr   r8, SPRN_PMC5
+   mfspr   r9, SPRN_PMC6
+   stw r3, HSTATE_PMC1(r13)
+   stw r5, HSTATE_PMC2(r13)
+   stw r6, HSTATE_PMC3(r13)
+   stw r7, 

[PATCH v2 01/33] KVM: PPC: Book3S: Simplify external interrupt handling

2018-09-28 Thread Paul Mackerras
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller).  The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.

In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller.  That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit.  This
patch does that.

With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt.  In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_asm.h |  4 +--
 arch/powerpc/include/asm/kvm_host.h|  1 +
 arch/powerpc/kvm/book3s.c  | 43 --
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |  5 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|  4 +--
 arch/powerpc/kvm/book3s_pr.c   |  1 -
 arch/powerpc/kvm/book3s_xics.c | 11 +++
 arch/powerpc/kvm/book3s_xive_template.c|  2 +-
 arch/powerpc/kvm/trace_book3s.h|  1 -
 tools/perf/arch/powerpc/util/book3s_hv_exits.h |  1 -
 10 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index a790d5c..1f32191 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -84,7 +84,6 @@
 #define BOOK3S_INTERRUPT_INST_STORAGE  0x400
 #define BOOK3S_INTERRUPT_INST_SEGMENT  0x480
 #define BOOK3S_INTERRUPT_EXTERNAL  0x500
-#define BOOK3S_INTERRUPT_EXTERNAL_LEVEL0x501
 #define BOOK3S_INTERRUPT_EXTERNAL_HV   0x502
 #define BOOK3S_INTERRUPT_ALIGNMENT 0x600
 #define BOOK3S_INTERRUPT_PROGRAM   0x700
@@ -134,8 +133,7 @@
 #define BOOK3S_IRQPRIO_EXTERNAL14
 #define BOOK3S_IRQPRIO_DECREMENTER 15
 #define BOOK3S_IRQPRIO_PERFORMANCE_MONITOR 16
-#define BOOK3S_IRQPRIO_EXTERNAL_LEVEL  17
-#define BOOK3S_IRQPRIO_MAX 18
+#define BOOK3S_IRQPRIO_MAX 17
 
 #define BOOK3S_HFLAG_DCBZ320x1
 #define BOOK3S_HFLAG_SLB   0x2
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 906bcbdf..3cd0b9f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -707,6 +707,7 @@ struct kvm_vcpu_arch {
u8 hcall_needed;
u8 epr_flags; /* KVMPPC_EPR_xxx */
u8 epr_needed;
+   u8 external_oneshot;/* clear external irq after delivery */
 
u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 87348e4..66a5521 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -150,7 +150,6 @@ static int kvmppc_book3s_vec2irqprio(unsigned int vec)
case 0x400: prio = BOOK3S_IRQPRIO_INST_STORAGE; break;
case 0x480: prio = BOOK3S_IRQPRIO_INST_SEGMENT; break;
case 0x500: prio = BOOK3S_IRQPRIO_EXTERNAL; break;
-   case 0x501: prio = BOOK3S_IRQPRIO_EXTERNAL_LEVEL;   break;
case 0x600: prio = BOOK3S_IRQPRIO_ALIGNMENT;break;
case 0x700: prio = BOOK3S_IRQPRIO_PROGRAM;  break;
case 0x800: prio = BOOK3S_IRQPRIO_FP_UNAVAIL;   break;
@@ -236,18 +235,35 @@ EXPORT_SYMBOL_GPL(kvmppc_core_dequeue_dec);
 void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
 struct kvm_interrupt *irq)
 {
-   unsigned int vec = BOOK3S_INTERRUPT_EXTERNAL;
-
-   if (irq->irq == KVM_INTERRUPT_SET_LEVEL)
-   vec = BOOK3S_INTERRUPT_EXTERNAL_LEVEL;
+   /*
+* This case (KVM_INTERRUPT_SET) should never actually arise for
+* a pseries guest (because pseries guests expect their interrupt
+* controllers to continue asserting an external interrupt request
+* until it is acknowledged at the interrupt controller), but is
+* included to avoid ABI breakage and potentially for other
+* sorts of guest.
+*
+* There is a subtlety here: HV KVM does not test the
+* external_oneshot flag in the code that synthesizes
+* external interrupts for the guest 

[PATCH v2 02/33] KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation

2018-09-28 Thread Paul Mackerras
This removes code that clears the external interrupt pending bit in
the pending_exceptions bitmap.  This is left over from an earlier
iteration of the code where this bit was set when an escalation
interrupt arrived in order to wake the vcpu from cede.  Currently
we set the vcpu->arch.irq_pending flag instead for this purpose.
Therefore there is no need to do anything with the pending_exceptions
bitmap.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_xive_template.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
b/arch/powerpc/kvm/book3s_xive_template.c
index 203ea65..033363d 100644
--- a/arch/powerpc/kvm/book3s_xive_template.c
+++ b/arch/powerpc/kvm/book3s_xive_template.c
@@ -280,14 +280,6 @@ X_STATIC unsigned long GLUE(X_PFX,h_xirr)(struct kvm_vcpu 
*vcpu)
/* First collect pending bits from HW */
GLUE(X_PFX,ack_pending)(xc);
 
-   /*
-* Cleanup the old-style bits if needed (they may have been
-* set by pull or an escalation interrupts).
-*/
-   if (test_bit(BOOK3S_IRQPRIO_EXTERNAL, >arch.pending_exceptions))
-   clear_bit(BOOK3S_IRQPRIO_EXTERNAL,
- >arch.pending_exceptions);
-
pr_devel(" new pending=0x%02x hw_cppr=%d cppr=%d\n",
 xc->pending, xc->hw_cppr, xc->cppr);
 
-- 
2.7.4



[PATCH v2 00/33] KVM: PPC: Book3S HV: Nested HV virtualization

2018-09-28 Thread Paul Mackerras
This patch series implements nested virtualization in the KVM-HV
module for radix guests on POWER9 systems.  Unlike PR KVM, nested
guests are able to run in supervisor mode, meaning that performance is
much better than with PR KVM, and is very close to the performance of
a non-nested guests for most things.

The way this works is that each nested guest is also a guest of the
real hypervisor, also known as the level 0 or L0 hypervisor, which
runs in the CPU's hypervisor mode.  Its guests are at level 1, and
when a L1 system wants to run a nested guest, it performs hypercalls
to L0 to set up a virtual partition table in its (L1's) memory and to
enter the L2 guest.  The L0 hypervisor maintains a shadow
partition-scoped page table for the L2 guest and demand-faults entries
into it by translating the L1 real addresses in the partition-scoped
page table in L1 memory into L0 real addresses and puts them in the
shadow partition-scoped page table for L2.

Essentially what this is doing is providing L1 with the ability to do
(some) hypervisor functions using a mix of instruction emulation and
paravirtualization.

Along the way, this implements a new guest entry/exit path for radix
guests on POWER9 systems which is written almost entirely in C and
does not do any of the inter-thread coordination that the existing
entry/exit path does.  It is only used for radix guests and when
indep_threads_mode=Y (the default).

The limitations of this scheme are:

- Host and all nested hypervisors and their guests must be in radix
  mode.

- Nested hypervisors cannot use indep_threads_mode=N.

- If the host (i.e. the L0 hypervisor) has indep_threads_mode=N then
  only one nested vcpu can be run on any core at any given time; the
  secondary threads will do nothing.

- A nested hypervisor can't use a smaller page size than the base page
  size of the hypervisor(s) above it.

- A nested hypervisor is limited to having at most 1023 guests below
  it, each of which can have at most NR_CPUS virtual CPUs.

Changes in this version since version 1 (the RFC series):

- Rebased onto the kvm tree master branch.

- Added hypercall to do TLB invalidations and code to use it.

- Implemented a different method to ensure the build can succeed when
  CONFIG_PPC_PSERIES=n.

- Fixed bugs relating to interrupt and doorbell handling.

- Reimplemented the rmap code to use much less memory.

- Changed some names, comments and code based on review feedback.

- Handle the case when L0 and L1 are of different endianness.

- More santization of the register values provided by L1.

- Fixed bugs that prevented nested guests from successfully running
  guests under them (double nesting).

- Fixed a bug relating to the max_nested_lpid computation.

- Fixed a bug causing continual HDSI interrupts when a page of a page
  table or process table got paged out.

Paul.

 Documentation/virtual/kvm/api.txt  |   15 +
 arch/powerpc/include/asm/asm-prototypes.h  |   21 +
 arch/powerpc/include/asm/book3s/64/mmu-hash.h  |   12 +
 .../powerpc/include/asm/book3s/64/tlbflush-radix.h |1 +
 arch/powerpc/include/asm/hvcall.h  |   41 +
 arch/powerpc/include/asm/kvm_asm.h |4 +-
 arch/powerpc/include/asm/kvm_book3s.h  |   49 +-
 arch/powerpc/include/asm/kvm_book3s_64.h   |  119 +-
 arch/powerpc/include/asm/kvm_book3s_asm.h  |3 +
 arch/powerpc/include/asm/kvm_booke.h   |4 +-
 arch/powerpc/include/asm/kvm_host.h|   16 +-
 arch/powerpc/include/asm/kvm_ppc.h |4 +
 arch/powerpc/include/asm/ppc-opcode.h  |1 +
 arch/powerpc/include/asm/reg.h |3 +
 arch/powerpc/include/uapi/asm/kvm.h|1 +
 arch/powerpc/kernel/asm-offsets.c  |5 +-
 arch/powerpc/kernel/cpu_setup_power.S  |4 +-
 arch/powerpc/kvm/Makefile  |3 +-
 arch/powerpc/kvm/book3s.c  |   43 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|7 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  720 ---
 arch/powerpc/kvm/book3s_emulate.c  |   13 +-
 arch/powerpc/kvm/book3s_hv.c   |  923 --
 arch/powerpc/kvm/book3s_hv_builtin.c   |   92 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S|   95 +-
 arch/powerpc/kvm/book3s_hv_nested.c| 1318 
 arch/powerpc/kvm/book3s_hv_ras.c   |   10 +
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   13 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|  809 ++--
 arch/powerpc/kvm/book3s_hv_tm.c|6 +-
 arch/powerpc/kvm/book3s_hv_tm_builtin.c|5 +-
 arch/powerpc/kvm/book3s_pr.c   |5 +-
 arch/powerpc/kvm/book3s_xics.c |   14 +-
 arch/powerpc/kvm/book3s_xive.c |  

[PATCH v2 03/33] KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code

2018-09-28 Thread Paul Mackerras
This is based on a patch by Suraj Jitindar Singh.

This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c.  This is to
make future maintenance and modification easier.  The algorithm
expressed in the C code is almost identical to the previous
algorithm.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  1 +
 arch/powerpc/kvm/book3s_hv.c|  3 +-
 arch/powerpc/kvm/book3s_hv_builtin.c| 48 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 70 -
 4 files changed, 67 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e991821..83d61b8 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -652,6 +652,7 @@ int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long 
server,
 unsigned long mfrr);
 int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
 int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
+void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu);
 
 /*
  * Host-side operations we want to set up while running in real
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3e3a715..49a686c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -730,8 +730,7 @@ static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
/*
 * Ensure that the read of vcore->dpdes comes after the read
 * of vcpu->doorbell_request.  This barrier matches the
-* lwsync in book3s_hv_rmhandlers.S just before the
-* fast_guest_return label.
+* smb_wmb() in kvmppc_guest_entry_inject().
 */
smp_rmb();
vc = vcpu->arch.vcore;
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index fc6bb96..ccfea5b 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -729,3 +729,51 @@ void kvmhv_p9_restore_lpcr(struct kvm_split_mode *sip)
smp_mb();
local_paca->kvm_hstate.kvm_split_mode = NULL;
 }
+
+/*
+ * Is there a PRIV_DOORBELL pending for the guest (on POWER9)?
+ * Can we inject a Decrementer or a External interrupt?
+ */
+void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu)
+{
+   int ext;
+   unsigned long vec = 0;
+   unsigned long lpcr;
+
+   /* Insert EXTERNAL bit into LPCR at the MER bit position */
+   ext = (vcpu->arch.pending_exceptions >> BOOK3S_IRQPRIO_EXTERNAL) & 1;
+   lpcr = mfspr(SPRN_LPCR);
+   lpcr |= ext << LPCR_MER_SH;
+   mtspr(SPRN_LPCR, lpcr);
+   isync();
+
+   if (vcpu->arch.shregs.msr & MSR_EE) {
+   if (ext) {
+   vec = BOOK3S_INTERRUPT_EXTERNAL;
+   } else {
+   long int dec = mfspr(SPRN_DEC);
+   if (!(lpcr & LPCR_LD))
+   dec = (int) dec;
+   if (dec < 0)
+   vec = BOOK3S_INTERRUPT_DECREMENTER;
+   }
+   }
+   if (vec) {
+   unsigned long msr, old_msr = vcpu->arch.shregs.msr;
+
+   kvmppc_set_srr0(vcpu, kvmppc_get_pc(vcpu));
+   kvmppc_set_srr1(vcpu, old_msr);
+   kvmppc_set_pc(vcpu, vec);
+   msr = vcpu->arch.intr_msr;
+   if (MSR_TM_ACTIVE(old_msr))
+   msr |= MSR_TS_S;
+   vcpu->arch.shregs.msr = msr;
+   }
+
+   if (vcpu->arch.doorbell_request) {
+   mtspr(SPRN_DPDES, 1);
+   vcpu->arch.vcore->dpdes = 1;
+   smp_wmb();
+   vcpu->arch.doorbell_request = 0;
+   }
+}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 77960e6..6752da1 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1101,13 +1101,20 @@ no_xive:
 #endif /* CONFIG_KVM_XICS */
 
 deliver_guest_interrupt:
-   ld  r6, VCPU_CTR(r4)
-   ld  r7, VCPU_XER(r4)
-
-   mtctr   r6
-   mtxer   r7
-
 kvmppc_cede_reentry:   /* r4 = vcpu, r13 = paca */
+   /* Check if we can deliver an external or decrementer interrupt now */
+   ld  r0, VCPU_PENDING_EXC(r4)
+BEGIN_FTR_SECTION
+   /* On POWER9, also check for emulated doorbell interrupt */
+   lbz r3, VCPU_DBELL_REQ(r4)
+   or  r0, r0, r3
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
+   cmpdi   r0, 0
+   beq 71f
+   mr  r3, r4
+   bl  kvmppc_guest_entry_inject_int
+   ld  r4, HSTATE_KVM_VCPU(r13)
+71:
ld  r10, VCPU_PC(r4)
ld  r11, VCPU_MSR(r4)
ld  r6, VCPU_SRR0(r4)
@@ -1120,53 +1127,10 @@ 

[PATCH v2] powerpc: wire up memtest

2018-09-28 Thread Christophe Leroy
Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.

Signed-off-by: Christophe Leroy 
---
 v2: moved the test after initmem_init() as PPC64 sets max_low_pfn later than 
PPC32.

 arch/powerpc/kernel/setup-common.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 93fa0c99681e..9ca9db707bcb 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -966,6 +967,8 @@ void __init setup_arch(char **cmdline_p)
 
initmem_init();
 
+   early_memtest(min_low_pfn << PAGE_SHIFT, max_low_pfn << PAGE_SHIFT);
+
 #ifdef CONFIG_DUMMY_CONSOLE
conswitchp = _con;
 #endif
-- 
2.13.3



[PATCH v2 5/5] soc/fsl_qbman: export coalesce change API

2018-09-28 Thread Madalin Bucur
Export the API required to control the QMan portal interrupt coalescing
settings.

Signed-off-by: Madalin Bucur 
---
 drivers/soc/fsl/qbman/qman.c | 31 +++
 include/soc/fsl/qman.h   | 27 +++
 2 files changed, 58 insertions(+)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 99d0f87889b8..8ab75bb44c4d 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -1012,6 +1012,37 @@ static inline void put_affine_portal(void)
 
 static struct workqueue_struct *qm_portal_wq;
 
+void qman_dqrr_set_ithresh(struct qman_portal *portal, u8 ithresh)
+{
+   if (!portal)
+   return;
+
+   qm_dqrr_set_ithresh(>p, ithresh);
+   portal->p.dqrr.ithresh = ithresh;
+}
+EXPORT_SYMBOL(qman_dqrr_set_ithresh);
+
+void qman_dqrr_get_ithresh(struct qman_portal *portal, u8 *ithresh)
+{
+   if (portal && ithresh)
+   *ithresh = portal->p.dqrr.ithresh;
+}
+EXPORT_SYMBOL(qman_dqrr_get_ithresh);
+
+void qman_portal_get_iperiod(struct qman_portal *portal, u32 *iperiod)
+{
+   if (portal && iperiod)
+   *iperiod = qm_in(>p, QM_REG_ITPR);
+}
+EXPORT_SYMBOL(qman_portal_get_iperiod);
+
+void qman_portal_set_iperiod(struct qman_portal *portal, u32 iperiod)
+{
+   if (portal)
+   qm_out(>p, QM_REG_ITPR, iperiod);
+}
+EXPORT_SYMBOL(qman_portal_set_iperiod);
+
 int qman_wq_alloc(void)
 {
qm_portal_wq = alloc_workqueue("qman_portal_wq", 0, 1);
diff --git a/include/soc/fsl/qman.h b/include/soc/fsl/qman.h
index d4dfefdee6c1..42f50eb51529 100644
--- a/include/soc/fsl/qman.h
+++ b/include/soc/fsl/qman.h
@@ -1186,4 +1186,31 @@ int qman_alloc_cgrid_range(u32 *result, u32 count);
  */
 int qman_release_cgrid(u32 id);
 
+/**
+ * qman_dqrr_get_ithresh - Get coalesce interrupt threshold
+ * @portal: portal to get the value for
+ * @ithresh: threshold pointer
+ */
+void qman_dqrr_get_ithresh(struct qman_portal *portal, u8 *ithresh);
+
+/**
+ * qman_dqrr_set_ithresh - Set coalesce interrupt threshold
+ * @portal: portal to set the new value on
+ * @ithresh: new threshold value
+ */
+void qman_dqrr_set_ithresh(struct qman_portal *portal, u8 ithresh);
+
+/**
+ * qman_dqrr_get_iperiod - Get coalesce interrupt period
+ * @portal: portal to get the value for
+ * @iperiod: period pointer
+ */
+void qman_portal_get_iperiod(struct qman_portal *portal, u32 *iperiod);
+/*
+ * qman_dqrr_set_iperiod - Set coalesce interrupt period
+ * @portal: portal to set the new value on
+ * @ithresh: new period value
+ */
+void qman_portal_set_iperiod(struct qman_portal *portal, u32 iperiod);
+
 #endif /* __FSL_QMAN_H */
-- 
2.1.0



[PATCH v2 4/5] soc/fsl/qbman: Use last response to determine valid bit

2018-09-28 Thread Madalin Bucur
From: Roy Pledge 

Use the last valid response when determining what valid bit
to use next for management commands. This is needed in the
case that the portal was previously used by other software
like a bootloader or if the kernel is restarted without a
hardware reset.

Signed-off-by: Roy Pledge 
Signed-off-by: Madalin Bucur 
---
 drivers/soc/fsl/qbman/qman.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 0ffe7a1d0eae..99d0f87889b8 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -850,12 +850,24 @@ static inline void qm_mr_set_ithresh(struct qm_portal 
*portal, u8 ithresh)
 
 static inline int qm_mc_init(struct qm_portal *portal)
 {
+   u8 rr0, rr1;
struct qm_mc *mc = >mc;
 
mc->cr = portal->addr.ce + QM_CL_CR;
mc->rr = portal->addr.ce + QM_CL_RR0;
-   mc->rridx = (mc->cr->_ncw_verb & QM_MCC_VERB_VBIT)
-   ? 0 : 1;
+   /*
+* The expected valid bit polarity for the next CR command is 0
+* if RR1 contains a valid response, and is 1 if RR0 contains a
+* valid response. If both RR contain all 0, this indicates either
+* that no command has been executed since reset (in which case the
+* expected valid bit polarity is 1)
+*/
+   rr0 = mc->rr->verb;
+   rr1 = (mc->rr+1)->verb;
+   if ((rr0 == 0 && rr1 == 0) || rr0 != 0)
+   mc->rridx = 1;
+   else
+   mc->rridx = 0;
mc->vbit = mc->rridx ? QM_MCC_VERB_VBIT : 0;
 #ifdef CONFIG_FSL_DPAA_CHECKING
mc->state = qman_mc_idle;
-- 
2.1.0



[PATCH v2 3/5] soc/fsl/qbman: Add 64 bit DMA addressing requirement to QBMan

2018-09-28 Thread Madalin Bucur
From: Roy Pledge 

The QBMan block is memory mapped on SoCs above a 32 bit (4 Gigabyte)
boundary so enabling 64 bit DMA addressing is needed for QBMan to
be usuable.

Signed-off-by: Roy Pledge 
Signed-off-by: Madalin Bucur 
---
 drivers/soc/fsl/qbman/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/soc/fsl/qbman/Kconfig b/drivers/soc/fsl/qbman/Kconfig
index d570cb5fd381..b0943e541796 100644
--- a/drivers/soc/fsl/qbman/Kconfig
+++ b/drivers/soc/fsl/qbman/Kconfig
@@ -1,6 +1,6 @@
 menuconfig FSL_DPAA
bool "QorIQ DPAA1 framework support"
-   depends on (FSL_SOC_BOOKE || ARCH_LAYERSCAPE)
+   depends on ((FSL_SOC_BOOKE || ARCH_LAYERSCAPE) && ARCH_DMA_ADDR_T_64BIT)
select GENERIC_ALLOCATOR
help
  The Freescale Data Path Acceleration Architecture (DPAA) is a set of
-- 
2.1.0



[PATCH v2 1/5] soc/fsl/qbman: Check if CPU is offline when initializing portals

2018-09-28 Thread Madalin Bucur
From: Roy Pledge 

If the CPU to affine the portal interrupt is offline at boot time
affine the portal interrupt to another online CPU. If the CPU is later
brought online the hotplug handler will correctly adjust the affinity.
Moved common code in a function.

Signed-off-by: Roy Pledge 
Signed-off-by: Madalin Bucur 
---
 drivers/soc/fsl/qbman/bman.c |  6 ++
 drivers/soc/fsl/qbman/dpaa_sys.h | 20 
 drivers/soc/fsl/qbman/qman.c |  6 ++
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/soc/fsl/qbman/bman.c b/drivers/soc/fsl/qbman/bman.c
index f9485cedc648..f84ab596bde8 100644
--- a/drivers/soc/fsl/qbman/bman.c
+++ b/drivers/soc/fsl/qbman/bman.c
@@ -562,11 +562,9 @@ static int bman_create_portal(struct bman_portal *portal,
dev_err(c->dev, "request_irq() failed\n");
goto fail_irq;
}
-   if (c->cpu != -1 && irq_can_set_affinity(c->irq) &&
-   irq_set_affinity(c->irq, cpumask_of(c->cpu))) {
-   dev_err(c->dev, "irq_set_affinity() failed\n");
+
+   if (dpaa_set_portal_irq_affinity(c->dev, c->irq, c->cpu))
goto fail_affinity;
-   }
 
/* Need RCR to be empty before continuing */
ret = bm_rcr_get_fill(p);
diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h b/drivers/soc/fsl/qbman/dpaa_sys.h
index 9f379000da85..ae8afa552b1e 100644
--- a/drivers/soc/fsl/qbman/dpaa_sys.h
+++ b/drivers/soc/fsl/qbman/dpaa_sys.h
@@ -111,4 +111,24 @@ int qbman_init_private_mem(struct device *dev, int idx, 
dma_addr_t *addr,
 #define QBMAN_MEMREMAP_ATTRMEMREMAP_WC
 #endif
 
+static inline int dpaa_set_portal_irq_affinity(struct device *dev,
+  int irq, int cpu)
+{
+   int ret = 0;
+
+   if (!irq_can_set_affinity(irq)) {
+   dev_err(dev, "unable to set IRQ affinity\n");
+   return -EINVAL;
+   }
+
+   if (cpu == -1 || !cpu_online(cpu))
+   cpu = cpumask_any(cpu_online_mask);
+
+   ret = irq_set_affinity(irq, cpumask_of(cpu));
+   if (ret)
+   dev_err(dev, "irq_set_affinity() on CPU %d failed\n", cpu);
+
+   return ret;
+}
+
 #endif /* __DPAA_SYS_H */
diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index ecb22749df0b..0ffe7a1d0eae 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -1210,11 +1210,9 @@ static int qman_create_portal(struct qman_portal *portal,
dev_err(c->dev, "request_irq() failed\n");
goto fail_irq;
}
-   if (c->cpu != -1 && irq_can_set_affinity(c->irq) &&
-   irq_set_affinity(c->irq, cpumask_of(c->cpu))) {
-   dev_err(c->dev, "irq_set_affinity() failed\n");
+
+   if (dpaa_set_portal_irq_affinity(c->dev, c->irq, c->cpu))
goto fail_affinity;
-   }
 
/* Need EQCR to be empty before continuing */
isdr &= ~QM_PIRQ_EQCI;
-- 
2.1.0



[PATCH v2 2/5] soc/fsl/qbman: replace CPU 0 with any online CPU in hotplug handlers

2018-09-28 Thread Madalin Bucur
The existing code sets portal IRQ affinity to CPU 0 in the
offline hotplug handler. If CPU 0 is offline this is invalid.
Use a different online CPU instead.

Signed-off-by: Madalin Bucur 
---
 drivers/soc/fsl/qbman/bman_portal.c | 4 +++-
 drivers/soc/fsl/qbman/qman_portal.c | 6 --
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/soc/fsl/qbman/bman_portal.c 
b/drivers/soc/fsl/qbman/bman_portal.c
index 2f71f7df3465..088cdfa7c034 100644
--- a/drivers/soc/fsl/qbman/bman_portal.c
+++ b/drivers/soc/fsl/qbman/bman_portal.c
@@ -65,7 +65,9 @@ static int bman_offline_cpu(unsigned int cpu)
if (!pcfg)
return 0;
 
-   irq_set_affinity(pcfg->irq, cpumask_of(0));
+   /* use any other online CPU */
+   cpu = cpumask_any_but(cpu_online_mask, cpu);
+   irq_set_affinity(pcfg->irq, cpumask_of(cpu));
return 0;
 }
 
diff --git a/drivers/soc/fsl/qbman/qman_portal.c 
b/drivers/soc/fsl/qbman/qman_portal.c
index a120002b630e..4efd6ea598b1 100644
--- a/drivers/soc/fsl/qbman/qman_portal.c
+++ b/drivers/soc/fsl/qbman/qman_portal.c
@@ -195,8 +195,10 @@ static int qman_offline_cpu(unsigned int cpu)
if (p) {
pcfg = qman_get_qm_portal_config(p);
if (pcfg) {
-   irq_set_affinity(pcfg->irq, cpumask_of(0));
-   qman_portal_update_sdest(pcfg, 0);
+   /* select any other online CPU */
+   cpu = cpumask_any_but(cpu_online_mask, cpu);
+   irq_set_affinity(pcfg->irq, cpumask_of(cpu));
+   qman_portal_update_sdest(pcfg, cpu);
}
}
return 0;
-- 
2.1.0



Re: [PATCH] powerpc: wire up memtest

2018-09-28 Thread Christophe LEROY




Le 28/09/2018 à 05:41, Michael Ellerman a écrit :

Christophe Leroy  writes:

Add call to early_memtest() so that kernel compiled with
CONFIG_MEMTEST really perform memtest at startup when requested
via 'memtest' boot parameter.

Signed-off-by: Christophe Leroy 
---
  arch/powerpc/kernel/setup-common.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 93fa0c99681e..904b728eb20d 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -33,6 +33,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -917,6 +918,8 @@ void __init setup_arch(char **cmdline_p)
/* Parse memory topology */
mem_topology_setup();
  
+	early_memtest(min_low_pfn << PAGE_SHIFT, max_low_pfn << PAGE_SHIFT);


On a ppc64le VM this boils down to early_memtest(0, 0) for me.

I think it's too early, we don't set up max_low_pfn until
initmem_init().

If I move it after initmem_init() then it does something more useful:


Ok. On my 8xx max_low_pfn is set in mem_topology_setup().

Moving the test afte initmem_init() still works on the 8xx so I'll do that.

Thanks for testing.

Christophe



early_memtest: # of tests: 17
   0x01450580 - 0x01450800 pattern 4c494e5558726c7a
   0x01450c00 - 0x0360 pattern 4c494e5558726c7a
   0x047c - 0x2fff pattern 4c494e5558726c7a
   0x3000 - 0x3ff24000 pattern 4c494e5558726c7a
   0x3fff4000 - 0x3fff4c00 pattern 4c494e5558726c7a
   0x3fff5000 - 0x3fff5300 pattern 4c494e5558726c7a
   0x3fff5c00 - 0x3fff5f00 pattern 4c494e5558726c7a
   0x3fff6800 - 0x3fff6b00 pattern 4c494e5558726c7a
   0x3fff7400 - 0x3fff7700 pattern 4c494e5558726c7a
   0x3fff8000 - 0x3fff8300 pattern 4c494e5558726c7a
   0x3fff8c00 - 0x3fff8f00 pattern 4c494e5558726c7a
   0x3fff9800 - 0x3fff9b00 pattern 4c494e5558726c7a
   0x3fffa400 - 0x3fffa700 pattern 4c494e5558726c7a
   0x3fffb000 - 0x3fffb300 pattern 4c494e5558726c7a
   0x3fffbc00 - 0x3fffbf00 pattern 4c494e5558726c7a
   0x3fffc800 - 0x3fffcb00 pattern 4c494e5558726c7a
   0x3fffd400 - 0x3fffd700 pattern 4c494e5558726c7a
   0x3fffe000 - 0x3fffe100 pattern 4c494e5558726c7a
   0x4000 - 0xffc1 pattern 4c494e5558726c7a
   0xfffa - 0xfffa5b00 pattern 4c494e5558726c7a
   0x0001 - 0x0001ffbe pattern 4c494e5558726c7a
   0x0001fff6 - 0x0001fff61b00 pattern 4c494e5558726c7a
   0x0001fffec000 - 0x0001fffec4b8 pattern 4c494e5558726c7a
   0x0001fffec524 - 0x0001fffec528 pattern 4c494e5558726c7a


cheers



Re: [PATCH kernel] cxl: Remove unused include

2018-09-28 Thread Andrew Donnellan

On 28/9/18 4:38 pm, Alexey Kardashevskiy wrote:

The included opal.h gives a wrong idea that CXL makes PPC OPAL calls
while it does not so let's remote it.

Signed-off-by: Alexey Kardashevskiy 


Thanks for catching this

Acked-by: Andrew Donnellan 


---
  drivers/misc/cxl/pci.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index b66d832..8cbcbb7 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -17,7 +17,6 @@
  #include 
  #include 
  #include 
-#include 
  #include 
  #include 
  #include 



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH] powerpc/rtas: Fix a potential race between CPU-Offline & Migration

2018-09-28 Thread Gautham R Shenoy
Hi Nathan,

On Thu, Sep 27, 2018 at 12:31:34PM -0500, Nathan Fontenot wrote:
> On 09/27/2018 11:51 AM, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" 
> > 
> > Live Partition Migrations require all the present CPUs to execute the
> > H_JOIN call, and hence rtas_ibm_suspend_me() onlines any offline CPUs
> > before initiating the migration for this purpose.
> > 
> > The commit 85a88cabad57
> > ("powerpc/pseries: Disable CPU hotplug across migrations")
> > disables any CPU-hotplug operations once all the offline CPUs are
> > brought online to prevent any further state change. Once the
> > CPU-Hotplug operation is disabled, the code assumes that all the CPUs
> > are online.
> > 
> > However, there is a minor window in rtas_ibm_suspend_me() between
> > onlining the offline CPUs and disabling CPU-Hotplug when a concurrent
> > CPU-offline operations initiated by the userspace can succeed thereby
> > nullifying the the aformentioned assumption. In this unlikely case
> > these offlined CPUs will not call H_JOIN, resulting in a system hang.
> > 
> > Fix this by verifying that all the present CPUs are actually online
> > after CPU-Hotplug has been disabled, failing which we return from
> > rtas_ibm_suspend_me() with -EBUSY.
> 
> Would we also want to havr the ability to re-try onlining all of the cpus
> before failing the migration?

Given that we haven't been able to hit issue in practice after your
fix to disable CPU Hotplug after migrations, it indicates that the
race-window, if it is not merely a theoretical one, is extremely
narrow. So, this current patch addresses the safety aspect, as in,
should someone manage to exploit this narrow race-window, it ensures
that the system doesn't go to a hang state.

Having the ability to retry onlining all the CPUs is only required for
progress of LPM in this rarest of cases. We should add the code to
retry onlining the CPUs if the consequence of failing an LPM is high,
even in this rarest of case. Otherwise IMHO we should be ok not adding
the additional code.

> 
> This would involve a bigger code change as the current code to online all
> CPUs would work in its current form.
> 
> -Nathan
> 
> > 
> > Cc: Nathan Fontenot 
> > Cc: Tyrel Datwyler 
> > Suggested-by: Michael Ellerman 
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >  arch/powerpc/kernel/rtas.c | 10 ++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> > index 2c7ed31..27f6fd3 100644
> > --- a/arch/powerpc/kernel/rtas.c
> > +++ b/arch/powerpc/kernel/rtas.c
> > @@ -982,6 +982,16 @@ int rtas_ibm_suspend_me(u64 handle)
> > }
> > 
> > cpu_hotplug_disable();
> > +
> > +   /* Check if we raced with a CPU-Offline Operation */
> > +   if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) {
> > +   pr_err("%s: Raced against a concurrent CPU-Offline\n",
> > +  __func__);
> > +   atomic_set(, -EBUSY);
> > +   cpu_hotplug_enable();
> > +   goto out;
> > +   }
> > +
> > stop_topology_update();
> > 
> > /* Call function on all CPUs.  One of us will make the
> > 



[PATCH kernel] powerpc/powernv/npu: Remove unused headers and a macro.

2018-09-28 Thread Alexey Kardashevskiy
The macro and few headers are not used so remove them.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 8006c54..3a5c4ed 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -9,32 +9,18 @@
  * License as published by the Free Software Foundation.
  */
 
-#include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
-#include 
-#include 
 
 #include 
-#include 
 #include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
 #include 
 
-#include "powernv.h"
 #include "pci.h"
 
-#define npu_to_phb(x) container_of(x, struct pnv_phb, npu)
-
 /*
  * spinlock to protect initialisation of an npu_context for a particular
  * mm_struct.
-- 
2.11.0



[PATCH kernel] KVM: PPC: Optimize clearing TCEs for sparse tables

2018-09-28 Thread Alexey Kardashevskiy
The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
table and a table with userspace addresses. These tables are radix trees,
we allocate indirect levels when they are written to. Since
the memory allocation is problematic in real mode, we have 2 accessors
to the entries:
- for virtual mode: it allocates the memory and it is always expected
to return non-NULL;
- fr real mode: it does not allocate and can return NULL.

Also, DMA windows can span to up to 55 bits of the address space and since
we never have this much RAM, such windows are sparse. However currently
the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory.

Since we maintain a userspace addresses table for VFIO which is a mirror
of the hardware table, we can use it to know which parts of the DMA
window have not been mapped and skip these so does this patch.

The bare metal systems do not have this problem as they use a bypass mode
of a PHB which maps RAM directly.

This helps a lot with sparse DMA windows, reducing the shutdown time from
about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest.
Just skipping the last level seems to be good enough.

As non-allocating accessor is used now in virtual mode as well, rename it
from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only).

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h|  2 +-
 arch/powerpc/kvm/book3s_64_vio.c|  5 ++---
 arch/powerpc/kvm/book3s_64_vio_hv.c |  6 +++---
 drivers/vfio/vfio_iommu_spapr_tce.c | 22 --
 4 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 3d4b88c..35db0cb 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -126,7 +126,7 @@ struct iommu_table {
int it_nid;
 };
 
-#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \
+#define IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry) \
((tbl)->it_ops->useraddrptr((tbl), (entry), false))
 #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
((tbl)->it_ops->useraddrptr((tbl), (entry), true))
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index c0c64d1..62a8d03 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -410,11 +410,10 @@ static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
 {
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
-   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry);
 
if (!pua)
-   /* it_userspace allocation might be delayed */
-   return H_TOO_HARD;
+   return H_SUCCESS;
 
mem = mm_iommu_lookup(kvm->mm, be64_to_cpu(*pua), pgsize);
if (!mem)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 389dac1..583031d 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -212,7 +212,7 @@ static long iommu_tce_xchg_rm(struct mm_struct *mm, struct 
iommu_table *tbl,
 
if (!ret && ((*direction == DMA_FROM_DEVICE) ||
(*direction == DMA_BIDIRECTIONAL))) {
-   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry);
/*
 * kvmppc_rm_tce_iommu_do_map() updates the UA cache after
 * calling this so we still get here a valid UA.
@@ -238,7 +238,7 @@ static long kvmppc_rm_tce_iommu_mapped_dec(struct kvm *kvm,
 {
struct mm_iommu_table_group_mem_t *mem = NULL;
const unsigned long pgsize = 1ULL << tbl->it_page_shift;
-   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry);
 
if (!pua)
/* it_userspace allocation might be delayed */
@@ -302,7 +302,7 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, 
struct iommu_table *tbl,
 {
long ret;
unsigned long hpa = 0;
-   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry);
struct mm_iommu_table_group_mem_t *mem;
 
if (!pua)
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 96721b1..b1a8ab3 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -444,7 +444,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container 
*container,
struct mm_iommu_table_group_mem_t *mem = NULL;
int ret;
unsigned long hpa = 0;
-   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry);
 
if (!pua)

[PATCH kernel] powerpc/powernv/ioda: Allocate indirect TCE levels of cached userspace addresses on demand

2018-09-28 Thread Alexey Kardashevskiy
The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
table and a table with userspace addresses; the latter is used for
marking pages dirty when corresponging TCEs are unmapped from
the hardware table.

a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels
on demand") enabled on-demand allocation of the hardware table,
however it missed the other table so it has still been fully allocated
at the boot time. This fixes the issue by allocating a single level,
just like we do for the hardware table.

Fixes: a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels on 
demand")
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index fe96910..7639b21 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -299,7 +299,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 
bus_offset,
if (alloc_userspace_copy) {
offset = 0;
uas = pnv_pci_ioda2_table_do_alloc_pages(nid, level_shift,
-   levels, tce_table_size, ,
+   tmplevels, tce_table_size, ,
_allocated_uas);
if (!uas)
goto free_tces_exit;
-- 
2.11.0



[PATCH kernel] cxl: Remove unused include

2018-09-28 Thread Alexey Kardashevskiy
The included opal.h gives a wrong idea that CXL makes PPC OPAL calls
while it does not so let's remote it.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/misc/cxl/pci.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index b66d832..8cbcbb7 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.11.0