Re: [PATCH v2 39/45] drivers: tty: serial: efm32-uart: use devm_* functions

2019-03-15 Thread Uwe Kleine-König
Hello Enrico,

On Thu, Mar 14, 2019 at 11:34:09PM +0100, Enrico Weigelt, metux IT consult 
wrote:
> Use the safer devm versions of memory mapping functions.

In which aspect is devm_ioremap safer than ioremap?

The only upside I'm aware of is that the memory is automatically
unmapped on device unbind. But we don't benefit from this because an
UART port is "released" before the device is unbound and we call
devm_iounmap() then anyhow. So this patch just adds a memory allocation
(side note: on a platform that is quite tight on RAM) with no added
benefit.

I didn't look at the other patches in this series, but assuming that
they are similar in spirit, the same question applies for them.

Do I miss anything?

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |


[PATCH kernel RFC 1/2] vfio_pci: Allow device specific error handlers

2019-03-15 Thread Alexey Kardashevskiy
PCI device drivers can define own pci_error_handlers which are called
on errors or before/after reset. The VFIO PCI driver defines one as well.

This adds a vfio_pci_error_handlers struct for VFIO PCI which is a wrapper
on top of vfio_err_handlers. At the moment it defines reset_done() -
this hook is called right after the device reset and it can be used to do
some device tweaking before the userspace gets a chance to use the device.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/pci/vfio_pci_private.h |  5 +
 drivers/vfio/pci/vfio_pci.c | 17 +
 2 files changed, 22 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 1812cf22fc4f..aff96fa28726 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -87,8 +87,13 @@ struct vfio_pci_reflck {
struct mutexlock;
 };
 
+struct vfio_pci_error_handlers {
+   void (*reset_done)(struct vfio_pci_device *vdev);
+};
+
 struct vfio_pci_device {
struct pci_dev  *pdev;
+   struct vfio_pci_error_handlers *error_handlers;
void __iomem*barmap[PCI_STD_RESOURCE_END + 1];
boolbar_mmap_supported[PCI_STD_RESOURCE_END + 1];
u8  *pci_config_map;
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 5bd97fa632d3..6ebc441d91c3 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1434,8 +1434,25 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct 
pci_dev *pdev,
return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
+static void vfio_pci_reset_done(struct pci_dev *dev)
+{
+   struct vfio_pci_device *vdev;
+   struct vfio_device *device;
+
+   device = vfio_device_get_from_dev(&dev->dev);
+   if (device == NULL)
+   return;
+
+   vdev = vfio_device_data(device);
+   if (vdev && vdev->error_handlers && vdev->error_handlers->reset_done)
+   vdev->error_handlers->reset_done(vdev);
+
+   vfio_device_put(device);
+}
+
 static const struct pci_error_handlers vfio_err_handlers = {
.error_detected = vfio_pci_aer_err_detected,
+   .reset_done = vfio_pci_reset_done,
 };
 
 static struct pci_driver vfio_pci_driver = {
-- 
2.17.1



[PATCH kernel RFC 0/2] vfio, powerpc/powernv: Isolate GV100GL

2019-03-15 Thread Alexey Kardashevskiy


Here is an attempt to isolate NVLink interconnects between GPU to
let them be passed through individually.

At the moment I mostly wonder about the sanity of the appoach.

Please comment. Thanks.



Alexey Kardashevskiy (2):
  vfio_pci: Allow device specific error handlers
  vfio-pci-nvlink2: Implement interconnect isolation

 drivers/vfio/pci/vfio_pci_private.h  |  5 ++
 arch/powerpc/platforms/powernv/npu-dma.c | 24 +-
 drivers/vfio/pci/vfio_pci.c  | 17 
 drivers/vfio/pci/vfio_pci_nvlink2.c  | 98 
 4 files changed, 142 insertions(+), 2 deletions(-)

-- 
2.17.1



[PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation

2019-03-15 Thread Alexey Kardashevskiy
The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
(on POWER9) NVLinks. In addition to that, GPUs themselves have direct
peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
platform puts all interconnected GPUs to the same IOMMU group.

However the user may want to pass individual GPUs to the userspace so
in order to do so we need to put them into separate IOMMU groups and
cut off the interconnects.

Thankfully V100 GPUs implement an interface to do by programming link
disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
this interface, it cannot be re-enabled until the secondary bus reset is
issued to the GPU.

This defines a reset_done() handler for V100 NVlink2 device which
determines what links need to be disabled. This relies on presence
of the new "ibm,nvlink-peers" device tree property of a GPU telling which
PCI peers it is connected to (which includes NVLink bridges or peer GPUs).

This does not change the existing behaviour and instead adds
a new "isolate_nvlink" kernel parameter to allow such isolation.

The alternative approaches would be:

1. do this in the system firmware (skiboot) but for that we would need
to tell skiboot via an additional OPAL call whether or not we want this
isolation - skiboot is unaware of IOMMU groups.

2. do this in the secondary bus reset handler in the POWERNV platform -
the problem with that is at that point the device is not enabled, i.e.
config space is not restored so we need to enable the device (i.e. MMIO
bit in CMD register + program valid address to BAR0) in order to disable
links and then perhaps undo all this initialization to bring the device
back to the state where pci_try_reset_function() expects it to be.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 24 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c  | 98 
 2 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 3a102378c8dc..6f5c769b6fc8 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -441,6 +441,23 @@ static void pnv_comp_attach_table_group(struct npu_comp 
*npucomp,
++npucomp->pe_num;
 }
 
+static bool isolate_nvlink;
+
+static int __init parse_isolate_nvlink(char *p)
+{
+   bool val;
+
+   if (!p)
+   val = true;
+   else if (kstrtobool(p, &val))
+   return -EINVAL;
+
+   isolate_nvlink = val;
+
+   return 0;
+}
+early_param("isolate_nvlink", parse_isolate_nvlink);
+
 struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 {
struct iommu_table_group *table_group;
@@ -463,7 +480,7 @@ struct iommu_table_group 
*pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
hose = pci_bus_to_host(npdev->bus);
phb = hose->private_data;
 
-   if (hose->npu) {
+   if (hose->npu && !isolate_nvlink) {
if (!phb->npucomp) {
phb->npucomp = kzalloc(sizeof(struct npu_comp),
GFP_KERNEL);
@@ -477,7 +494,10 @@ struct iommu_table_group 
*pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
pe->pe_number);
}
} else {
-   /* Create a group for 1 GPU and attached NPUs for POWER8 */
+   /*
+* Create a group for 1 GPU and attached NPUs for
+* POWER8 (always) or POWER9 (when isolate_nvlink).
+*/
pe->npucomp = kzalloc(sizeof(*pe->npucomp), GFP_KERNEL);
table_group = &pe->npucomp->table_group;
table_group->ops = &pnv_npu_peers_ops;
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 32f695ffe128..bb6bba762f46 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -206,6 +206,102 @@ static int vfio_pci_nvgpu_group_notifier(struct 
notifier_block *nb,
return NOTIFY_OK;
 }
 
+static int vfio_pci_nvdia_v100_is_ph_in_group(struct device *dev, void *data)
+{
+   return dev->of_node->phandle == *(phandle *) data;
+}
+
+static u32 vfio_pci_nvdia_v100_get_disable_mask(struct device *dev)
+{
+   int npu, peer;
+   u32 mask;
+   struct device_node *dn;
+   struct iommu_group *group;
+
+   dn = dev->of_node;
+   if (!of_find_property(dn, "ibm,nvlink-peers", NULL))
+   return 0;
+
+   group = iommu_group_get(dev);
+   if (!group)
+   return 0;
+
+   /*
+* Collect links to keep which includes links to NPU and links to
+* other GPUs in the same IOMMU group.
+*/
+   for (npu = 0, mask = 0; ; ++npu) {
+   u32 npuph = 0;
+
+   if (of_property_read_u32_index(dn, "ibm,npu", npu, &npuph))
+ 

Re: [PATCH v2 02/45] drivers: tty: serial: 8250_dw: use devm_ioremap_resource()

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 12:41 AM Enrico Weigelt, metux IT consult
 wrote:
>
> Instead of fetching out data from a struct resource for passing
> it to devm_ioremap(), directly use devm_ioremap_resource()

I don't see any advantage of this change.
See also below.

> --- a/drivers/tty/serial/8250/8250_dw.c
> +++ b/drivers/tty/serial/8250/8250_dw.c
> @@ -526,7 +526,7 @@ static int dw8250_probe(struct platform_device *pdev)
> p->set_ldisc= dw8250_set_ldisc;
> p->set_termios  = dw8250_set_termios;
>
> -   p->membase = devm_ioremap(dev, regs->start, resource_size(regs));
> +   p->membase = devm_ioremap_resource(dev, regs);
> if (!p->membase)

And how did you test this? devm_ioremap_resource() returns error
pointer in case of error.

> return -ENOMEM;

-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH v2 45/45] drivers: tty: serial: mux: use devm_* functions

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 12:37 AM Enrico Weigelt, metux IT consult
 wrote:
>
> Use the safer devm versions of memory mapping functions.

If you are going to use devm_*_free(), what's the point to have this
change from the beginning?

P.S. Disregard that this is untested series...

> --- a/drivers/tty/serial/mux.c
> +++ b/drivers/tty/serial/mux.c
> @@ -456,8 +456,9 @@ static int __init mux_probe(struct parisc_device *dev)
> printk(KERN_INFO "Serial mux driver (%d ports) Revision: 0.6\n", 
> port_count);
>
> dev_set_drvdata(&dev->dev, (void *)(long)port_count);
> -   request_mem_region(dev->hpa.start + MUX_OFFSET,
> -   port_count * MUX_LINE_OFFSET, "Mux");

> +   devm_request_mem_region(&dev->dev,
> +   dev->hpa.start + MUX_OFFSET,
> +   port_count * MUX_LINE_OFFSET, "Mux");

...and on top of this where is error checking?

>
> if(!port_cnt) {
> mux_driver.cons = MUX_CONSOLE;
> @@ -474,7 +475,9 @@ static int __init mux_probe(struct parisc_device *dev)
> port->iobase= 0;
> port->mapbase   = dev->hpa.start + MUX_OFFSET +
> (i * MUX_LINE_OFFSET);
> -   port->membase   = ioremap_nocache(port->mapbase, 
> MUX_LINE_OFFSET);
> +   port->membase   = devm_ioremap_nocache(port->dev,
> +  port->mapbase,
> +  MUX_LINE_OFFSET);
> port->iotype= UPIO_MEM;
> port->type  = PORT_MUX;
> port->irq   = 0;
> @@ -517,10 +520,12 @@ static int __exit mux_remove(struct parisc_device *dev)
>
> uart_remove_one_port(&mux_driver, port);
> if(port->membase)
> -   iounmap(port->membase);
> +   devm_iounmap(port->dev, port->membase);
> }
>
> -   release_mem_region(dev->hpa.start + MUX_OFFSET, port_count * 
> MUX_LINE_OFFSET);
> +   devm_release_mem_region(&dev->dev,
> +   dev->hpa.start + MUX_OFFSET,
> +   port_count * MUX_LINE_OFFSET);
> return 0;
>  }


-- 
With Best Regards,
Andy Shevchenko


Re: serial driver cleanups v2

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 12:40 AM Enrico Weigelt, metux IT consult
 wrote:

> here's v2 of my serial cleanups queue - part I:
>
> essentially using helpers to code more compact and switching to
> devm_*() functions for mmio management.
>
> Part II will be about moving the mmio range from mapbase and
> mapsize (which are used quite inconsistently) to a struct resource
> and using helpers for that. But this one isn't finished yet.
> (if somebody likes to have a look at it, I can send it, too)

Let's do that way you are preparing a branch somewhere and anounce
here as an RFC, since this was neither tested nor correct.
And selling point for many of them is not true: it doesn't make any
difference in the size in code, but increases a time to run
(devm_ioremap_resource() does more than plain devm_iomap() call).

-- 
With Best Regards,
Andy Shevchenko


Re: serial driver cleanups v2

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 11:12 AM Andy Shevchenko
 wrote:
> On Fri, Mar 15, 2019 at 12:40 AM Enrico Weigelt, metux IT consult
>  wrote:
>
> > here's v2 of my serial cleanups queue - part I:
> >
> > essentially using helpers to code more compact and switching to
> > devm_*() functions for mmio management.
> >
> > Part II will be about moving the mmio range from mapbase and
> > mapsize (which are used quite inconsistently) to a struct resource
> > and using helpers for that. But this one isn't finished yet.
> > (if somebody likes to have a look at it, I can send it, too)
>
> Let's do that way you are preparing a branch somewhere and anounce
> here as an RFC, since this was neither tested nor correct.
> And selling point for many of them is not true: it doesn't make any
> difference in the size in code, but increases a time to run
> (devm_ioremap_resource() does more than plain devm_iomap() call).

And one more thing, perhaps you can run existing and / or contribute
to coccinelle since this all scriptable and maintainers can decide if
this or that coccinelle script is useful.

-- 
With Best Regards,
Andy Shevchenko


Re: [PATCH v2 10/45] drivers: tty: serial: zs: use devm_* functions

2019-03-15 Thread Enrico Weigelt, metux IT consult
On 14.03.19 23:52, Greg KH wrote:
> On Thu, Mar 14, 2019 at 11:33:40PM +0100, Enrico Weigelt, metux IT consult 
> wrote:
>> Use the safer devm versions of memory mapping functions.
> 
> What is "safer" about them?

Garbage collection :)

Several drivers didn't seem to clean up properly (maybe these're just
used compiled-in, so nobody noticed yet).

In general, I think devm_* should be the standard case, unless there's
a really good reason to do otherwise.



> Isn't the whole goal of the devm* functions such that you are not
> required to call "release" on them?

Looks that one slipped through, when I was doing that big bulk change
in the middle of the night and not enough coffe ;-)

One problem here is that many drivers do this stuff in request/release
port, instead of probe/remove. I'm not sure yet, whether we should
rewrite that. There're also cases which do request/release, but no
ioremap (doing hardcoded register accesses), others do ioremap w/o
request/release.

IMHO, we should have a closer look at those and check whether that's
really okay (just adding request/release blindly could cause trouble)

> And also, why make the change, you aren't changing any functionality for
> these old drivers at all from what I can tell (for the devm calls).
> What am I missing here?

Okay, there's a bigger story behind, you can't know yet. Finally, I'd
like to move everything to using struct resource and corresponding
helpers consistently, so most of the drivers would be pretty simple
at that point. (there're of course special cases, like devices w/
multiple register spaces, etc)

Here's my wip branch:

https://github.com/metux/linux/commits/wip/serial-res

In this consolidation process, I'm trying to move everything to
devm_*, to have it more generic (for now, I still need two versions
of the request/release/ioremap/iounmap helpers - one w/ and one
w/o devm).

My idea was moving to devm first, so it can be reviewed/tested
independently, before moving forward. Smaller, easily digestable
pieces should minimize the risk of breaking anything. But if you
prefer having this things squashed together, just let me know.

In the queue are also other minor cleanups like using dev_err()
instead of printk(), etc. Should I send these separately ?

By the way: do you have some public branch where you're collecting
accepted patches, which I could base mine on ? (tty.git/tty-next ?)


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: [PATCH v2 02/45] drivers: tty: serial: 8250_dw: use devm_ioremap_resource()

2019-03-15 Thread Enrico Weigelt, metux IT consult
On 15.03.19 10:04, Andy Shevchenko wrote:
> On Fri, Mar 15, 2019 at 12:41 AM Enrico Weigelt, metux IT consult
>  wrote:
>>
>> Instead of fetching out data from a struct resource for passing
>> it to devm_ioremap(), directly use devm_ioremap_resource()
> 
> I don't see any advantage of this change.
> See also below.

I see that the whole story wasn't clear. Please see my reply to Greg,
hope that clears it up a little bit.

>> --- a/drivers/tty/serial/8250/8250_dw.c
>> +++ b/drivers/tty/serial/8250/8250_dw.c
>> @@ -526,7 +526,7 @@ static int dw8250_probe(struct platform_device *pdev)
>> p->set_ldisc= dw8250_set_ldisc;
>> p->set_termios  = dw8250_set_termios;
>>
>> -   p->membase = devm_ioremap(dev, regs->start, resource_size(regs));
>> +   p->membase = devm_ioremap_resource(dev, regs);
>> if (!p->membase)
> 
> And how did you test this? devm_ioremap_resource() returns error
> pointer in case of error.

hmm, devm_ioremap_resource() does so, but devm_ioremap() does not ?
I really didn't expect that. Thanks for pointing that out.


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


[PATCH] arch/powerpc/dax: Add MAP_SYNC mmap flag

2019-03-15 Thread Aneesh Kumar K.V
This enables support for synchronous DAX fault on powerpc

The generic changes are added as part of
commit b6fb293f2497 ("mm: Define MAP_SYNC and VM_SYNC flags")

Without this, mmap returns EOPNOTSUPP for MAP_SYNC with MAP_SHARED_VALIDATE

Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions")
Signed-off-by: Vaibhav Jain 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/uapi/asm/mman.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/uapi/asm/mman.h 
b/arch/powerpc/include/uapi/asm/mman.h
index 65065ce32814..e08cc5fc9d2f 100644
--- a/arch/powerpc/include/uapi/asm/mman.h
+++ b/arch/powerpc/include/uapi/asm/mman.h
@@ -29,6 +29,7 @@
 #define MAP_NONBLOCK   0x1 /* do not block on IO */
 #define MAP_STACK  0x2 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x4 /* create a huge page mapping */
+#define MAP_SYNC   0x8 /* perform synchronous page faults for 
the mapping */
 
 /* Override any generic PKEY permission defines */
 #define PKEY_DISABLE_EXECUTE   0x4
-- 
2.20.1



Re: serial driver cleanups v2

2019-03-15 Thread Enrico Weigelt, metux IT consult
On 15.03.19 10:12, Andy Shevchenko wrote:

>> Part II will be about moving the mmio range from mapbase and
>> mapsize (which are used quite inconsistently) to a struct resource
>> and using helpers for that. But this one isn't finished yet.
>> (if somebody likes to have a look at it, I can send it, too)
> 
> Let's do that way you are preparing a branch somewhere and anounce
> here as an RFC, since this was neither tested nor correct.

Okay, here it is:

I. https://github.com/metux/linux/tree/submit/serial-clean-v3
   --> general cleanups, as basis for II

II. https://github.com/metux/linux/tree/wip/serial-res
   --> moving towards using struct resource consistently

III. https://github.com/metux/linux/tree/hack/serial
--> the final steps, which are yet completely broken
(more a notepad for things still to do :o)

The actual goal is generalizing the whole iomem handling, so individual
usually just need to call some helpers that do most of the things.
Finally, I also wanted to have all io region information consolidated
in struct resource.

Meanwhile I've learned that I probably was a bit too eager w/ that.
Guess I'll have to rethink my strategy.

> And selling point for many of them is not true: it doesn't make any
> difference in the size in code, but increases a time to run
> (devm_ioremap_resource() does more than plain devm_iomap() call).

Okay, just seen it. Does the the runtime overhead cause any problems ?


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-15 Thread Laurent Vivier
On 04/03/2019 20:59, Laurent Vivier wrote:
> When we hotplug a CPU in a memoryless/cpuless node,
> the kernel crashes when it rebuilds the sched_domains data.
> 
> I reproduce this problem on POWER and with a pseries VM, with the following
> QEMU parameters:
> 
>   -machine pseries -enable-kvm -m 8192 \
>   -smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \
>   -numa node,nodeid=0,cpus=0-1,mem=0 \
>   -numa node,nodeid=1,cpus=2-3,mem=8192 \
>   -numa node,nodeid=2,cpus=4-5,mem=0 \
>   -numa node,nodeid=3,cpus=6-7,mem=0
> 
> Then I can trigger the crash by hotplugging a CPU on node-id 3:
> 
>   (qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3
> 
> Built 2 zonelists, mobility grouping on.  Total pages: 130162
> Policy zone: Normal
> WARNING: workqueue cpumask: online intersect > possible intersect
> BUG: Kernel NULL pointer dereference at 0x0400
> Faulting instruction address: 0xc0170edc
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute 
> bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security 
> ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 
> nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables 
> xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci 
> virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
> CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20
> Workqueue: events cpuset_hotplug_workfn
> NIP:  c0170edc LR: c0170f98 CTR: 
> REGS: c3e931a0 TRAP: 0380   Not tainted  (5.0.0-rc6+)
> MSR:  80009033   CR: 22284028  XER: 
> CFAR: c0170f20 IRQMASK: 0
> GPR00: c0170f98 c3e93430 c11ac500 c001efe22000
> GPR04: 0001   0010
> GPR08: 0001 0400  
> GPR12: 8800 c0003fffd680 c001f14b c11e1bf0
> GPR16: c11e61f4 c001efe22200 c001efe22020 c001fba8
> GPR20: c001ff567a80 0001 c0e27a80 e830
> GPR24: ec30 102f 102f c001efca1000
> GPR28: c001efca0400 c001efe22000 c001efe23bff c001efe22a00
> NIP [c0170edc] free_sched_groups+0x5c/0xf0
> LR [c0170f98] destroy_sched_domain+0x28/0x90
> Call Trace:
> [c3e93430] [102f] 0x102f (unreliable)
> [c3e93470] [c0170f98] destroy_sched_domain+0x28/0x90
> [c3e934a0] [c01716e0] cpu_attach_domain+0x100/0x920
> [c3e93600] [c0173128] build_sched_domains+0x1228/0x1370
> [c3e93740] [c017429c] partition_sched_domains+0x23c/0x400
> [c3e937e0] [c01f5ec8] 
> rebuild_sched_domains_locked+0x78/0xe0
> [c3e93820] [c01f9ff0] rebuild_sched_domains+0x30/0x50
> [c3e93850] [c01fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70
> [c3e93c80] [c012e5a0] process_one_work+0x1b0/0x480
> [c3e93d20] [c012e8f8] worker_thread+0x88/0x540
> [c3e93db0] [c013714c] kthread+0x15c/0x1a0
> [c3e93e20] [c000b55c] ret_from_kernel_thread+0x5c/0x80
> Instruction dump:
> 2e24 f8010010 f821ffc1 409e0014 4880 7fbdf040 7fdff378 419e0074
> ebdf 4192002c e93f0010 7c0004ac <7d404828> 314a 7d40492d 40c2fff4
> ---[ end trace f992c4a7d47d602a ]---
> 
> Kernel panic - not syncing: Fatal exception
> 
> This happens in free_sched_groups() because the linked list of the
> sched_groups is corrupted. Here what happens when we hotplug the CPU:
> 
>  - build_sched_groups() builds a sched_groups linked list for
>sched_domain D1, with only one entry A, refcount=1
> 
>D1: A(ref=1)
> 
>  - build_sched_groups() builds a sched_groups linked list for
>sched_domain D2, with the same entry A
> 
>D2: A(ref=2)
> 
>  - build_sched_groups() builds a sched_groups linked list for
>sched_domain D3, with the same entry A and a new entry B:
> 
>D3: A(ref=3) -> B(ref=1)
> 
>  - destroy_sched_domain() is called for D1:
> 
>D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released,
>  but A->next always points to B
> 
>  - destroy_sched_domain() is called for D3:
> 
>D3: A(ref=2) -> B(ref=0)
> 
> kernel crashes when it tries to use data inside B, as the memory has been
> corrupted as it has been freed, the linked list (next) is broken too.
> 
> This problem appears with commit 051f3ca02e46
> (

[PATCH v3 00/17] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-03-15 Thread Cédric Le Goater
Hello,

On the POWER9 processor, the XIVE interrupt controller can control
interrupt sources using MMIOs to trigger events, to EOI or to turn off
the sources. Priority management and interrupt acknowledgment is also
controlled by MMIO in the CPU presenter sub-engine.

PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
special support from the hypervisor to do the same. This is called the
XIVE native exploitation mode and today, it can be activated under the
PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
and still offers the old interrupt mode interface using a KVM device
implementing the XICS hcalls over XIVE.

The following series is proposal to add the same support under KVM.

A new KVM device is introduced for the XIVE native exploitation
mode. It reuses most of the XICS-over-XIVE glue implementation
structures which are internal to KVM but has a completely different
interface. A set of KVM device ioctls provide support for the
hypervisor calls, all handled in QEMU, to configure the sources and
the event queues. From there, all interrupt control is transferred to
the guest which can use MMIOs.

These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
similarly to VFIO, and the associated VMAs are populated dynamically
with the appropriate pages using a fault handler. These are now
implemented using mmap()s of the KVM device fd.

Migration has its own specific needs regarding memory. The patchset
provides a specific control to quiesce XIVE before capturing the
memory. The save and restore of the internal state is based on the
same ioctls used for the hcalls.

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with a
interrupt controller using the XICS legacy model, as found on POWER8,
or in XIVE exploitation mode. Which means that the KVM interrupt
device should be created at run-time, after the machine has started.
This requires extra support from KVM to destroy KVM devices. It is
introduced at the end of the patchset as it still requires some
attention and a XIVE-only VM would not need.

This is 5.2 material hopefully. The OPAL patches have not yet been
merged.


GitHub trees available here :
 
QEMU sPAPR:

  https://github.com/legoater/qemu/commits/xive-next
  
Linux/KVM:

  https://github.com/legoater/linux/commits/xive-5.0

OPAL:

  https://github.com/legoater/skiboot/commits/xive

Thanks,

C.

Caveats :

 - We should introduce a set of definitions common to XIVE and XICS
 - The XICS-over-XIVE device file book3s_xive.c could be renamed to
   book3s_xics_on_xive.c or book3s_xics_p9.c
 - The XICS-over-XIVE device has locking issues in the setup. 

Changes since v2:

 - removed extra OPAL call definitions
 - removed ->q_order setting. Only useful in the XICS-on-XIVE KVM
   device which allocates the EQs on behalf of the guest.
 - returned -ENXIO when VP base is invalid
 - made use of the xive_vp() macro to compute VP identifiers
 - reworked locking in kvmppc_xive_native_connect_vcpu() to fix races 
 - stop advertising KVM_CAP_PPC_IRQ_XIVE as support is not fully
   available yet
 - fixed comment on XIVE IRQ number space
 - removed usage of the __x_* macros
 - fixed locking on source block
 - fixed comments on the KVM device attribute definitions
 - handled MASKED EAS configuration
 - fixed check on supported EQ size to restrict to 64K pages
 - checked kvm_eq.flags that need to be zero
 - removed the OPAL call when EQ qtoggle bit and index are zero. 
 - reduced the size of kvmppc_one_reg timaval attribute to two u64s
 - stopped returning of the OS CAM line value
 
Changes since v1:

 - Better documentation (was missing)
 - Nested support. XIVE not advertised on non PowerNV platforms. This
   is a good way to test the fallback on QEMU emulated devices.
 - ESB and TIMA special mapping done using the KVM device fd
 - All hcalls moved to QEMU. Dropped the patch moving the hcall flags.
 - Reworked of the KVM device ioctl controls to support hcalls and
   migration needs to capture/save states
 - Merged the control syncing XIVE and marking the EQ page dirty
 - Fixed passthrough support using the KVM device file address_space
   to clear the ESB pages from the mapping
 - Misc enhancements and fixes 

Cédric Le Goater (17):
  powerpc/xive: add OPAL extensions for the XIVE native exploitation
support
  KVM: PPC: Book3S HV: add a new KVM device for the XIVE native
exploitation mode
  KVM: PPC: Book3S HV: XIVE: introduce a new capability
KVM_CAP_PPC_IRQ_XIVE
  KVM: PPC: Book3S HV: XIVE: add a control to initialize a source
  KVM: PPC: Book3S HV: XIVE: add a control to configure a source
  KVM: PPC: Book3S HV: XIVE: add controls for the EQ configuration
  KVM: PPC: Book3S HV: XIVE: add a global reset control
  KVM: PPC: Book3S HV: XIVE: add a control to sync the sources
  KVM: PPC: Book3S HV: XIVE: add a control to dirty the XIVE EQ pages
  KVM: PPC: Book3S HV: XIVE: add get/set ac

[PATCH v3 06/17] KVM: PPC: Book3S HV: XIVE: add controls for the EQ configuration

2019-03-15 Thread Cédric Le Goater
These controls will be used by the H_INT_SET_QUEUE_CONFIG and
H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
Event Queue in the XIVE IC. They will also be used to restore the
configuration of the XIVE EQs and to capture the internal run-time
state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
the EQ toggle bit and EQ index which are updated by the XIVE IC when
event notifications are enqueued in the EQ.

The value of the guest physical address of the event queue is saved in
the XIVE internal xive_q structure for later use. That is when
migration needs to mark the EQ pages dirty to capture a consistent
memory state of the VM.

To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
but restoring the EQ state will.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2 :
 
 - fixed comments on the KVM device attribute definitions
 - fixed check on supported EQ size to restrict to 64K pages
 - checked kvm_eq.flags that need to be zero
 - removed the OPAL call when EQ qtoggle bit and index are zero. 

 arch/powerpc/include/asm/xive.h|   2 +
 arch/powerpc/include/uapi/asm/kvm.h|  21 ++
 arch/powerpc/kvm/book3s_xive.h |   2 +
 arch/powerpc/kvm/book3s_xive.c |  15 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 232 +
 Documentation/virtual/kvm/devices/xive.txt |  31 +++
 6 files changed, 297 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index b579a943407b..46891f321606 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -73,6 +73,8 @@ struct xive_q {
u32 esc_irq;
atomic_tcount;
atomic_tpending_count;
+   u64 guest_qpage;
+   u32 guest_qsize;
 };
 
 /* Global enable flags for the XIVE support */
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 12bb01baf0ae..1cd728c87d7c 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -679,6 +679,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
+#define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
@@ -694,4 +695,24 @@ struct kvm_ppc_cpu_char {
 #define KVM_XIVE_SOURCE_EISN_SHIFT 33
 #define KVM_XIVE_SOURCE_EISN_MASK  0xfffeULL
 
+/* Layout of 64-bit EQ identifier */
+#define KVM_XIVE_EQ_PRIORITY_SHIFT 0
+#define KVM_XIVE_EQ_PRIORITY_MASK  0x7
+#define KVM_XIVE_EQ_SERVER_SHIFT   3
+#define KVM_XIVE_EQ_SERVER_MASK0xfff8ULL
+
+/* Layout of EQ configuration values (64 bytes) */
+struct kvm_ppc_xive_eq {
+   __u32 flags;
+   __u32 qsize;
+   __u64 qpage;
+   __u32 qtoggle;
+   __u32 qindex;
+   __u8  pad[40];
+};
+
+#define KVM_XIVE_EQ_FLAG_ENABLED   0x0001
+#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY 0x0002
+#define KVM_XIVE_EQ_FLAG_ESCALATE  0x0004
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index ae26fe653d98..622f594d93e1 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -272,6 +272,8 @@ struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
struct kvmppc_xive *xive, int irq);
 void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
 int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
+ bool single_escalation);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e09f3addffe5..c1b7aa7dbc28 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -166,7 +166,8 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio,
+ bool single_escalation)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
struct xive_q *q = &xc->queues[prio];
@@ -185,7 +186,7 @@ static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 
prio)
return -EIO;
}
 
-   if (xc->xive->single_escalation)
+   if (single_escalation)
name = kasprintf(GFP_KERNEL, "kvm-%d-%d",
 

[PATCH v3 07/17] KVM: PPC: Book3S HV: XIVE: add a global reset control

2019-03-15 Thread Cédric Le Goater
This control is to be used by the H_INT_RESET hcall from QEMU. Its
purpose is to clear all configuration of the sources and EQs. This is
necessary in case of a kexec (for a kdump kernel for instance) to make
sure that no remaining configuration is left from the previous boot
setup so that the new kernel can start safely from a clean state.

The queue 7 is ignored when the XIVE device is configured to run in
single escalation mode. Prio 7 is used by escalations.

The XIVE VP is kept enabled as the vCPU is still active and connected
to the XIVE device.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2 :

 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 85 ++
 Documentation/virtual/kvm/devices/xive.txt |  5 ++
 3 files changed, 91 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 1cd728c87d7c..95e82ab57c03 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -677,6 +677,7 @@ struct kvm_ppc_cpu_char {
 
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
+#define   KVM_DEV_XIVE_RESET   1
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 42e824658a30..3385c336fd89 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -560,6 +560,83 @@ static int kvmppc_xive_native_get_queue_config(struct 
kvmppc_xive *xive,
return 0;
 }
 
+static void kvmppc_xive_reset_sources(struct kvmppc_xive_src_block *sb)
+{
+   int i;
+
+   for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
+   struct kvmppc_xive_irq_state *state = &sb->irq_state[i];
+
+   if (!state->valid)
+   continue;
+
+   if (state->act_priority == MASKED)
+   continue;
+
+   state->eisn = 0;
+   state->act_server = 0;
+   state->act_priority = MASKED;
+   xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
+   xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
+   if (state->pt_number) {
+   xive_vm_esb_load(state->pt_data, XIVE_ESB_SET_PQ_01);
+   xive_native_configure_irq(state->pt_number,
+ 0, MASKED, 0);
+   }
+   }
+}
+
+static int kvmppc_xive_reset(struct kvmppc_xive *xive)
+{
+   struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
+   unsigned int i;
+
+   pr_devel("%s\n", __func__);
+
+   mutex_lock(&kvm->lock);
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+   unsigned int prio;
+
+   if (!xc)
+   continue;
+
+   kvmppc_xive_disable_vcpu_interrupts(vcpu);
+
+   for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
+
+   /* Single escalation, no queue 7 */
+   if (prio == 7 && xive->single_escalation)
+   break;
+
+   if (xc->esc_virq[prio]) {
+   free_irq(xc->esc_virq[prio], vcpu);
+   irq_dispose_mapping(xc->esc_virq[prio]);
+   kfree(xc->esc_virq_names[prio]);
+   xc->esc_virq[prio] = 0;
+   }
+
+   kvmppc_xive_native_cleanup_queue(vcpu, prio);
+   }
+   }
+
+   for (i = 0; i <= xive->max_sbid; i++) {
+   struct kvmppc_xive_src_block *sb = xive->src_blocks[i];
+
+   if (sb) {
+   arch_spin_lock(&sb->lock);
+   kvmppc_xive_reset_sources(sb);
+   arch_spin_unlock(&sb->lock);
+   }
+   }
+
+   mutex_unlock(&kvm->lock);
+
+   return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -567,6 +644,10 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
 
switch (attr->group) {
case KVM_DEV_XIVE_GRP_CTRL:
+   switch (attr->attr) {
+   case KVM_DEV_XIVE_RESET:
+   return kvmppc_xive_reset(xive);
+   }
break;
case KVM_DEV_XIVE_GRP_SOURCE:
return kvmppc_xive_native_set_source(xive, attr->attr,
@@ -599,6 +680,10 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
 {
   

[PATCH v3 11/17] KVM: introduce a 'mmap' method for KVM devices

2019-03-15 Thread Cédric Le Goater
Some KVM devices will want to handle special mappings related to the
underlying HW. For instance, the XIVE interrupt controller of the
POWER9 processor has MMIO pages for thread interrupt management and
for interrupt source control that need to be exposed to the guest when
the OS has the required support.

Cc: Paolo Bonzini 
Signed-off-by: Cédric Le Goater 
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..cbf81487b69f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1223,6 +1223,7 @@ struct kvm_device_ops {
int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
  unsigned long arg);
+   int (*mmap)(struct kvm_device *dev, struct vm_area_struct *vma);
 };
 
 void kvm_device_get(struct kvm_device *dev);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 076bc38963bf..e4881a8c2a6f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2878,6 +2878,16 @@ static long kvm_vcpu_compat_ioctl(struct file *filp,
 }
 #endif
 
+static int kvm_device_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct kvm_device *dev = filp->private_data;
+
+   if (dev->ops->mmap)
+   return dev->ops->mmap(dev, vma);
+
+   return -ENODEV;
+}
+
 static int kvm_device_ioctl_attr(struct kvm_device *dev,
 int (*accessor)(struct kvm_device *dev,
 struct kvm_device_attr *attr),
@@ -2927,6 +2937,7 @@ static const struct file_operations kvm_device_fops = {
.unlocked_ioctl = kvm_device_ioctl,
.release = kvm_device_release,
KVM_COMPAT(kvm_device_ioctl),
+   .mmap = kvm_device_mmap,
 };
 
 struct kvm_device *kvm_device_from_filp(struct file *filp)
-- 
2.20.1



[PATCH v3 15/17] KVM: PPC: Book3S HV: XIVE: activate XIVE exploitation mode

2019-03-15 Thread Cédric Le Goater
Full support for the XIVE native exploitation mode is now available,
advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on
PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM
Hypervisor) is not yet available. XIVE should also have been activated
which is default setting on POWER9 systems running a recent Linux
kernel.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/powerpc.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index bb51faf29162..d70b19f8725b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -573,10 +573,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
 #ifdef CONFIG_KVM_XIVE
case KVM_CAP_PPC_IRQ_XIVE:
/*
-* Return false until all the XIVE infrastructure is
-* in place including support for migration.
+* We need XIVE to be enabled on the platform (implies
+* a POWER9 processor) and the PowerNV platform, as
+* nested is not yet supported.
 */
-   r = 0;
+   r = xive_enabled() && !!cpu_has_feature(CPU_FTR_HVMODE);
break;
 #endif
 
-- 
2.20.1



[PATCH v3 04/17] KVM: PPC: Book3S HV: XIVE: add a control to initialize a source

2019-03-15 Thread Cédric Le Goater
The XIVE KVM device maintains a list of interrupt sources for the VM
which are allocated in the pool of generic interrupts (IPIs) of the
main XIVE IC controller. These are used for the CPU IPIs as well as
for virtual device interrupts. The IRQ number space is defined by
QEMU.

The XIVE device reuses the source structures of the XICS-on-XIVE
device for the source blocks (2-level tree) and for the source
interrupts. Under XIVE native, the source interrupt caches mostly
configuration information and is less used than under the XICS-on-XIVE
device in which hcalls are still necessary at run-time.

When a source is initialized in KVM, an IPI interrupt source is simply
allocated at the OPAL level and then MASKED. KVM only needs to know
about its type: LSI or MSI.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2:

 - extra documentation in commit log
 - fixed comments on XIVE IRQ number space
 - removed usage of the __x_* macros
 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|   5 +
 arch/powerpc/kvm/book3s_xive.h |  10 ++
 arch/powerpc/kvm/book3s_xive.c |   8 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 106 +
 Documentation/virtual/kvm/devices/xive.txt |  15 +++
 5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index b002c0c67787..11985148073f 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -677,5 +677,10 @@ struct kvm_ppc_cpu_char {
 
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
+#define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
+
+/* Layout of 64-bit XIVE source attribute values */
+#define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
+#define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1)
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index d366df69b9cb..1be921cb5dcb 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -12,6 +12,13 @@
 #ifdef CONFIG_KVM_XICS
 #include "book3s_xics.h"
 
+/*
+ * The XIVE Interrupt source numbers are within the range 0 to
+ * KVMPPC_XICS_NR_IRQS.
+ */
+#define KVMPPC_XIVE_FIRST_IRQ  0
+#define KVMPPC_XIVE_NR_IRQSKVMPPC_XICS_NR_IRQS
+
 /*
  * State for one guest irq source.
  *
@@ -258,6 +265,9 @@ extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, 
unsigned long xirr);
  */
 void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
 int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+   struct kvmppc_xive *xive, int irq);
+void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e7f1ada1c3de..6c9f9fd0855f 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1480,8 +1480,8 @@ static int xive_get_source(struct kvmppc_xive *xive, long 
irq, u64 addr)
return 0;
 }
 
-static struct kvmppc_xive_src_block *xive_create_src_block(struct kvmppc_xive 
*xive,
-  int irq)
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+   struct kvmppc_xive *xive, int irq)
 {
struct kvm *kvm = xive->kvm;
struct kvmppc_xive_src_block *sb;
@@ -1560,7 +1560,7 @@ static int xive_set_source(struct kvmppc_xive *xive, long 
irq, u64 addr)
sb = kvmppc_xive_find_source(xive, irq, &idx);
if (!sb) {
pr_devel("No source, creating source block...\n");
-   sb = xive_create_src_block(xive, irq);
+   sb = kvmppc_xive_create_src_block(xive, irq);
if (!sb) {
pr_devel("Failed to create block...\n");
return -ENOMEM;
@@ -1784,7 +1784,7 @@ static void kvmppc_xive_cleanup_irq(u32 hw_num, struct 
xive_irq_data *xd)
xive_cleanup_irq_data(xd);
 }
 
-static void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
+void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 {
int i;
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index a078f99bc156..99c04d5c5566 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -31,6 +31,17 @@
 
 #include "book3s_xive.h"
 
+static u8 xive_vm_esb_load(struct xive_irq_data *xd, u32 offset)
+{
+   u64 val;
+
+   if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
+   offset |= offset << 4;
+
+   val = in_be64(xd->eoi_mmio + offset);
+   return (u8)val;
+}
+
 static void kvmppc_xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
 {
struct kvmppc_xive_vcp

[PATCH v3 02/17] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-03-15 Thread Cédric Le Goater
This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new KVM device
to be created by QEMU, only available when running on a L0 hypervisor
only. Support for nested guests is not available yet.

The XIVE device reuses the device structure of the XICS-on-XIVE device
as they have a lot in common. That could possibly change in the future
if the need arise.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2:

 - removed ->q_order setting. Only useful in the XICS-on-XIVE KVM
   device which allocates the EQs on behalf of the guest.
 - returned -ENXIO when VP base is invalid

 arch/powerpc/include/asm/kvm_host.h|   1 +
 arch/powerpc/include/asm/kvm_ppc.h |   8 +
 arch/powerpc/include/uapi/asm/kvm.h|   3 +
 include/uapi/linux/kvm.h   |   2 +
 arch/powerpc/kvm/book3s.c  |   7 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 184 +
 Documentation/virtual/kvm/devices/xive.txt |  19 +++
 arch/powerpc/kvm/Makefile  |   2 +-
 8 files changed, 224 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
 create mode 100644 Documentation/virtual/kvm/devices/xive.txt

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 091430339db1..9f75a75a07f2 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
 struct kvmppc_xive;
 struct kvmppc_xive_vcpu;
 extern struct kvm_device_ops kvm_xive_ops;
+extern struct kvm_device_ops kvm_xive_native_ops;
 
 struct kvmppc_passthru_irqmap;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index b3bf4f61b30c..4b72ddde7dc1 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -593,6 +593,10 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+
+extern void kvmppc_xive_native_init_module(void);
+extern void kvmppc_xive_native_exit_module(void);
+
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
   u32 priority) { return -1; }
@@ -616,6 +620,10 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+
+static inline void kvmppc_xive_native_init_module(void) { }
+static inline void kvmppc_xive_native_exit_module(void) { }
+
 #endif /* CONFIG_KVM_XIVE */
 
 #ifdef CONFIG_PPC_POWERNV
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 8c876c166ef2..b002c0c67787 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -675,4 +675,7 @@ struct kvm_ppc_cpu_char {
 #define  KVM_XICS_PRESENTED(1ULL << 43)
 #define  KVM_XICS_QUEUED   (1ULL << 44)
 
+/* POWER9 XIVE Native Interrupt Controller */
+#define KVM_DEV_XIVE_GRP_CTRL  1
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..e6368163d3a0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1211,6 +1211,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3   KVM_DEV_TYPE_ARM_VGIC_V3
KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS  KVM_DEV_TYPE_ARM_VGIC_ITS
+   KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE  KVM_DEV_TYPE_XIVE
KVM_DEV_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 601c094f15ab..96d43f091255 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1040,6 +1040,9 @@ static int kvmppc_book3s_init(void)
if (xics_on_xive()) {
kvmppc_xive_init_module();
kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
+   kvmppc_xive_native_init_module();
+   kvm_register_device_ops(&kvm_xive_native_ops,
+   KVM_DEV_TYPE_XIVE);
} else
 #endif
kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
@@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
 static void kvmppc_book3s_exit(void)
 {
 #ifdef CONFIG_KVM_XICS
-   if (xics_on_xive())
+   if (xics_on_xive()) {
kvmppc_xive_exit_module();
+   kvmppc_xive_native_exit_module();
+   }
 #endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
kvmppc_book3s_exit_pr(

[PATCH v3 13/17] KVM: PPC: Book3S HV: XIVE: add a mapping for the source ESB pages

2019-03-15 Thread Cédric Le Goater
Each source is associated with an Event State Buffer (ESB) with a
even/odd pair of pages which provides commands to manage the source:
to trigger, to EOI, to turn off the source for instance.

The custom VM fault handler will deduce the guest IRQ number from the
offset of the fault, and the ESB page of the associated XIVE interrupt
will be inserted into the VMA using the internal structure caching
information on the interrupts.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 57 ++
 Documentation/virtual/kvm/devices/xive.txt |  7 +++
 3 files changed, 65 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 6836d38a517c..76458d18e479 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -721,5 +721,6 @@ struct kvm_ppc_xive_eq {
 #define KVM_XIVE_EQ_FLAG_ESCALATE  0x0004
 
 #define KVM_XIVE_TIMA_PAGE_OFFSET  0
+#define KVM_XIVE_ESB_PAGE_OFFSET   4
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index f10087dbcac2..e465d4c53f5c 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -170,6 +170,59 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
return rc;
 }
 
+static int xive_native_esb_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct kvm_device *dev = vma->vm_file->private_data;
+   struct kvmppc_xive *xive = dev->private;
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   struct xive_irq_data *xd;
+   u32 hw_num;
+   u16 src;
+   u64 page;
+   unsigned long irq;
+   u64 page_offset;
+
+   /*
+* Linux/KVM uses a two pages ESB setting, one for trigger and
+* one for EOI
+*/
+   page_offset = vmf->pgoff - vma->vm_pgoff;
+   irq = page_offset / 2;
+
+   sb = kvmppc_xive_find_source(xive, irq, &src);
+   if (!sb) {
+   pr_devel("%s: source %lx not found !\n", __func__, irq);
+   return VM_FAULT_SIGBUS;
+   }
+
+   state = &sb->irq_state[src];
+   kvmppc_xive_select_irq(state, &hw_num, &xd);
+
+   arch_spin_lock(&sb->lock);
+
+   /*
+* first/even page is for trigger
+* second/odd page is for EOI and management.
+*/
+   page = page_offset % 2 ? xd->eoi_page : xd->trig_page;
+   arch_spin_unlock(&sb->lock);
+
+   if (WARN_ON(!page)) {
+   pr_err("%s: acessing invalid ESB page for source %lx !\n",
+  __func__, irq);
+   return VM_FAULT_SIGBUS;
+   }
+
+   vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct xive_native_esb_vmops = {
+   .fault = xive_native_esb_fault,
+};
+
 static int xive_native_tima_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -199,6 +252,10 @@ static int kvmppc_xive_native_mmap(struct kvm_device *dev,
if (vma_pages(vma) > 4)
return -EINVAL;
vma->vm_ops = &xive_native_tima_vmops;
+   } else if (vma->vm_pgoff == KVM_XIVE_ESB_PAGE_OFFSET) {
+   if (vma_pages(vma) > KVMPPC_XIVE_NR_IRQS * 2)
+   return -EINVAL;
+   vma->vm_ops = &xive_native_esb_vmops;
} else {
return -EINVAL;
}
diff --git a/Documentation/virtual/kvm/devices/xive.txt 
b/Documentation/virtual/kvm/devices/xive.txt
index fbb51c4798fe..686cca450f9f 100644
--- a/Documentation/virtual/kvm/devices/xive.txt
+++ b/Documentation/virtual/kvm/devices/xive.txt
@@ -36,6 +36,13 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
   third (operating system) and the fourth (user level) are exposed the
   guest.
 
+  2. Event State Buffer (ESB)
+
+  Each source is associated with an Event State Buffer (ESB) with
+  either a pair of even/odd pair of pages which provides commands to
+  manage the source: to trigger, to EOI, to turn off the source for
+  instance.
+
 * Groups:
 
   1. KVM_DEV_XIVE_GRP_CTRL
-- 
2.20.1



[PATCH v3 09/17] KVM: PPC: Book3S HV: XIVE: add a control to dirty the XIVE EQ pages

2019-03-15 Thread Cédric Le Goater
When migration of a VM is initiated, a first copy of the RAM is
transferred to the destination before the VM is stopped, but there is
no guarantee that the EQ pages in which the event notifications are
queued have not been modified.

To make sure migration will capture a consistent memory state, the
XIVE device should perform a XIVE quiesce sequence to stop the flow of
event notifications and stabilize the EQs. This is the purpose of the
KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
to force their transfer.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2 :

 - Extra comments
 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 85 ++
 Documentation/virtual/kvm/devices/xive.txt | 29 
 3 files changed, 115 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index fc9211dbfec8..caf52be89494 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -678,6 +678,7 @@ struct kvm_ppc_cpu_char {
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define   KVM_DEV_XIVE_RESET   1
+#define   KVM_DEV_XIVE_EQ_SYNC 2
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 26ac3c505cd2..ea091c0a8fb6 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -669,6 +669,88 @@ static int kvmppc_xive_reset(struct kvmppc_xive *xive)
return 0;
 }
 
+static void kvmppc_xive_native_sync_sources(struct kvmppc_xive_src_block *sb)
+{
+   int j;
+
+   for (j = 0; j < KVMPPC_XICS_IRQ_PER_ICS; j++) {
+   struct kvmppc_xive_irq_state *state = &sb->irq_state[j];
+   struct xive_irq_data *xd;
+   u32 hw_num;
+
+   if (!state->valid)
+   continue;
+
+   /*
+* The struct kvmppc_xive_irq_state reflects the state
+* of the EAS configuration and not the state of the
+* source. The source is masked setting the PQ bits to
+* '-Q', which is what is being done before calling
+* the KVM_DEV_XIVE_EQ_SYNC control.
+*
+* If a source EAS is configured, OPAL syncs the XIVE
+* IC of the source and the XIVE IC of the previous
+* target if any.
+*
+* So it should be fine ignoring MASKED sources as
+* they have been synced already.
+*/
+   if (state->act_priority == MASKED)
+   continue;
+
+   kvmppc_xive_select_irq(state, &hw_num, &xd);
+   xive_native_sync_source(hw_num);
+   xive_native_sync_queue(hw_num);
+   }
+}
+
+static int kvmppc_xive_native_vcpu_eq_sync(struct kvm_vcpu *vcpu)
+{
+   struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+   unsigned int prio;
+
+   if (!xc)
+   return -ENOENT;
+
+   for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
+   struct xive_q *q = &xc->queues[prio];
+
+   if (!q->qpage)
+   continue;
+
+   /* Mark EQ page dirty for migration */
+   mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qpage));
+   }
+   return 0;
+}
+
+static int kvmppc_xive_native_eq_sync(struct kvmppc_xive *xive)
+{
+   struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
+   unsigned int i;
+
+   pr_devel("%s\n", __func__);
+
+   mutex_lock(&kvm->lock);
+   for (i = 0; i <= xive->max_sbid; i++) {
+   struct kvmppc_xive_src_block *sb = xive->src_blocks[i];
+
+   if (sb) {
+   arch_spin_lock(&sb->lock);
+   kvmppc_xive_native_sync_sources(sb);
+   arch_spin_unlock(&sb->lock);
+   }
+   }
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   kvmppc_xive_native_vcpu_eq_sync(vcpu);
+   }
+   mutex_unlock(&kvm->lock);
+
+   return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -679,6 +761,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
switch (attr->attr) {
case KVM_DEV_XIVE_RESET:
return kvmppc_xive_reset(xive);
+   case KVM_DEV_XIVE_EQ_SYNC:
+   return kvmppc_xive_native_eq_sync(xive);
}
break;
case

[PATCH v3 17/17] KVM: PPC: Book3S HV: XIVE: clear the vCPU interrupt presenters

2019-03-15 Thread Cédric Le Goater
When the VM boots, the CAS negotiation process determines which
interrupt mode to use and invokes a machine reset. At that time, the
previous KVM interrupt device is 'destroyed' before the chosen one is
created. Upon destruction, the vCPU interrupt presenters using the KVM
device should be cleared first, the machine will reconnect them later
to the new device after it is created.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2 :

 - removed comments on possible race in kvmppc_native_connect_vcpu()
   for the XIVE KVM device. This is still an issue in the
   XICS-over-XIVE device.

 arch/powerpc/kvm/book3s_xics.c| 19 +
 arch/powerpc/kvm/book3s_xive.c| 39 +--
 arch/powerpc/kvm/book3s_xive_native.c | 12 +
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index f27ee57ab46e..81cdabf4295f 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -1342,6 +1342,25 @@ static void kvmppc_xics_free(struct kvm_device *dev)
struct kvmppc_xics *xics = dev->private;
int i;
struct kvm *kvm = xics->kvm;
+   struct kvm_vcpu *vcpu;
+
+   /*
+* When destroying the VM, the vCPUs are destroyed first and
+* the vCPU list should be empty. If this is not the case,
+* then we are simply destroying the device and we should
+* clean up the vCPU interrupt presenters first.
+*/
+   if (atomic_read(&kvm->online_vcpus) != 0) {
+   /*
+* call kick_all_cpus_sync() to ensure that all CPUs
+* have executed any pending interrupts
+*/
+   if (is_kvmppc_hv_enabled(kvm))
+   kick_all_cpus_sync();
+
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvmppc_xics_free_icp(vcpu);
+   }
 
debugfs_remove(xics->dentry);
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 480a3fc6b9fd..cf6a4c6c5a28 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1100,11 +1100,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct 
kvm_vcpu *vcpu)
 void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
-   struct kvmppc_xive *xive = xc->xive;
+   struct kvmppc_xive *xive;
int i;
 
+   if (!kvmppc_xics_enabled(vcpu))
+   return;
+
+   if (!xc)
+   return;
+
pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num);
 
+   xive = xc->xive;
+
/* Ensure no interrupt is still routed to that VP */
xc->valid = false;
kvmppc_xive_disable_vcpu_interrupts(vcpu);
@@ -1141,6 +1149,10 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
}
/* Free the VP */
kfree(xc);
+
+   /* Cleanup the vcpu */
+   vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+   vcpu->arch.xive_vcpu = NULL;
 }
 
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
@@ -1158,7 +1170,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
}
if (xive->kvm != vcpu->kvm)
return -EPERM;
-   if (vcpu->arch.irq_type)
+   if (vcpu->arch.irq_type != KVMPPC_IRQ_DEFAULT)
return -EBUSY;
if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
pr_devel("Duplicate !\n");
@@ -1828,8 +1840,31 @@ static void kvmppc_xive_free(struct kvm_device *dev)
 {
struct kvmppc_xive *xive = dev->private;
struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
int i;
 
+   /*
+* When destroying the VM, the vCPUs are destroyed first and
+* the vCPU list should be empty. If this is not the case,
+* then we are simply destroying the device and we should
+* clean up the vCPU interrupt presenters first.
+*/
+   if (atomic_read(&kvm->online_vcpus) != 0) {
+   /*
+* call kick_all_cpus_sync() to ensure that all CPUs
+* have executed any pending interrupts
+*/
+   if (is_kvmppc_hv_enabled(kvm))
+   kick_all_cpus_sync();
+
+   /*
+* TODO: There is still a race window with the early
+* checks in kvmppc_native_connect_vcpu()
+*/
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvmppc_xive_cleanup_vcpu(vcpu);
+   }
+
debugfs_remove(xive->dentry);
 
if (kvm)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 67a1bb26a4cc..8f7be5e23177 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -956,8 +956,20 @@ static void kvmppc_xive_native_free(struct kvm_device *dev)
 {
struct kvmppc_xive *xive = dev->private;
struc

[PATCH v3 10/17] KVM: PPC: Book3S HV: XIVE: add get/set accessors for the VP XIVE state

2019-03-15 Thread Cédric Le Goater
The state of the thread interrupt management registers needs to be
collected for migration. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the underlying XIVE NVT structure and merges it in the
KVM state.

Signed-off-by: Cédric Le Goater 
---
 
 Changes since v2 :

 - reduced the size of kvmppc_one_reg timaval attribute to two u64s
 - stopped returning of the OS CAM line value

 arch/powerpc/include/asm/kvm_ppc.h | 11 
 arch/powerpc/include/uapi/asm/kvm.h|  2 +
 arch/powerpc/kvm/book3s.c  | 24 +++
 arch/powerpc/kvm/book3s_xive_native.c  | 76 ++
 Documentation/virtual/kvm/devices/xive.txt | 19 ++
 5 files changed, 132 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 1e61877fe147..37c61a64f68d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -272,6 +272,7 @@ union kvmppc_one_reg {
u64 addr;
u64 length;
}   vpaval;
+   u64 xive_timaval[2];
 };
 
 struct kvmppc_ops {
@@ -604,6 +605,10 @@ extern int kvmppc_xive_native_connect_vcpu(struct 
kvm_device *dev,
 extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
+extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu,
+union kvmppc_one_reg *val);
+extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu,
+union kvmppc_one_reg *val);
 
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
@@ -636,6 +641,12 @@ static inline int kvmppc_xive_native_connect_vcpu(struct 
kvm_device *dev,
 static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
 static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
+static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu,
+   union kvmppc_one_reg *val)
+{ return 0; }
+static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu,
+   union kvmppc_one_reg *val)
+{ return -ENOENT; }
 
 #endif /* CONFIG_KVM_XIVE */
 
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index caf52be89494..3de0d1395c01 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_REG_PPC_ICP_PPRI_SHIFT16  /* pending irq priority */
 #define  KVM_REG_PPC_ICP_PPRI_MASK 0xff
 
+#define KVM_REG_PPC_VP_STATE   (KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x8d)
+
 /* Device control API: PPC-specific devices */
 #define KVM_DEV_MPIC_GRP_MISC  1
 #define   KVM_DEV_MPIC_BASE_ADDR   0   /* 64-bit */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 96d43f091255..f85a9211f30c 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -641,6 +641,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
*val = get_reg_val(id, 
kvmppc_xics_get_icp(vcpu));
break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+   case KVM_REG_PPC_VP_STATE:
+   if (!vcpu->arch.xive_vcpu) {
+   r = -ENXIO;
+   break;
+   }
+   if (xive_enabled())
+   r = kvmppc_xive_native_get_vp(vcpu, val);
+   else
+   r = -ENXIO;
+   break;
+#endif /* CONFIG_KVM_XIVE */
case KVM_REG_PPC_FSCR:
*val = get_reg_val(id, vcpu->arch.fscr);
break;
@@ -714,6 +726,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, 
*val));
break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+   case KVM_REG_PPC_VP_STATE:
+   if (!vcpu->arch.xive_vcpu) {
+   r = -ENXIO;
+   break;
+   }
+   if (xive_enabled())
+   r = kvmppc_xive_native_set_vp(vcpu, val);
+   else
+   r = -ENXIO;
+   break;
+#endif /* CONFIG_KVM_XIVE */
case KVM_REG_PPC_FSCR:
vcpu->arch.fscr = set_reg_val(id, *val);
break;
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch

[PATCH v3 05/17] KVM: PPC: Book3S HV: XIVE: add a control to configure a source

2019-03-15 Thread Cédric Le Goater
This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
QEMU to configure the target of a source and also to restore the
configuration of a source when migrating the VM.

The XIVE source interrupt structure is extended with the value of the
Effective Interrupt Source Number. The EISN is the interrupt number
pushed in the event queue that the guest OS will use to dispatch
events internally. Caching the EISN value in KVM eases the test when
checking if a reconfiguration is indeed needed.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2:

 - fixed comments on the KVM device attribute definitions
 - handled MASKED EAS configuration
 - fixed locking on source block
 
 arch/powerpc/include/uapi/asm/kvm.h| 11 +++
 arch/powerpc/kvm/book3s_xive.h |  4 +
 arch/powerpc/kvm/book3s_xive.c |  5 +-
 arch/powerpc/kvm/book3s_xive_native.c  | 97 ++
 Documentation/virtual/kvm/devices/xive.txt | 21 +
 5 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 11985148073f..12bb01baf0ae 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -678,9 +678,20 @@ struct kvm_ppc_cpu_char {
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
+#define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
 #define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1)
 
+/* Layout of 64-bit XIVE source configuration attribute values */
+#define KVM_XIVE_SOURCE_PRIORITY_SHIFT 0
+#define KVM_XIVE_SOURCE_PRIORITY_MASK  0x7
+#define KVM_XIVE_SOURCE_SERVER_SHIFT   3
+#define KVM_XIVE_SOURCE_SERVER_MASK0xfff8ULL
+#define KVM_XIVE_SOURCE_MASKED_SHIFT   32
+#define KVM_XIVE_SOURCE_MASKED_MASK0x1ULL
+#define KVM_XIVE_SOURCE_EISN_SHIFT 33
+#define KVM_XIVE_SOURCE_EISN_MASK  0xfffeULL
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 1be921cb5dcb..ae26fe653d98 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -61,6 +61,9 @@ struct kvmppc_xive_irq_state {
bool saved_p;
bool saved_q;
u8 saved_scan_prio;
+
+   /* Xive native */
+   u32 eisn;   /* Guest Effective IRQ number */
 };
 
 /* Select the "right" interrupt (IPI vs. passthrough) */
@@ -268,6 +271,7 @@ int kvmppc_xive_debug_show_queues(struct seq_file *m, 
struct kvm_vcpu *vcpu);
 struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
struct kvmppc_xive *xive, int irq);
 void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
 
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 6c9f9fd0855f..e09f3addffe5 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -342,7 +342,7 @@ static int xive_try_pick_queue(struct kvm_vcpu *vcpu, u8 
prio)
return atomic_add_unless(&q->count, 1, max) ? 0 : -EBUSY;
 }
 
-static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
 {
struct kvm_vcpu *vcpu;
int i, rc;
@@ -530,7 +530,7 @@ static int xive_target_interrupt(struct kvm *kvm,
 * priority. The count for that new target will have
 * already been incremented.
 */
-   rc = xive_select_target(kvm, &server, prio);
+   rc = kvmppc_xive_select_target(kvm, &server, prio);
 
/*
 * We failed to find a target ? Not much we can do
@@ -1504,6 +1504,7 @@ struct kvmppc_xive_src_block 
*kvmppc_xive_create_src_block(
 
for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
sb->irq_state[i].number = (bid << KVMPPC_XICS_ICS_SHIFT) | i;
+   sb->irq_state[i].eisn = 0;
sb->irq_state[i].guest_priority = MASKED;
sb->irq_state[i].saved_priority = MASKED;
sb->irq_state[i].act_priority = MASKED;
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 99c04d5c5566..b841d339f674 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -247,6 +247,99 @@ static int kvmppc_xive_native_set_source(struct 
kvmppc_xive *xive, long irq,
return rc;
 }
 
+static int kvmppc_xive_native_update_source_config(struct kvmppc_xive *xive,
+   struct kvmppc_xive_src_block *sb,
+   struct kvmp

[PATCH v3 03/17] KVM: PPC: Book3S HV: XIVE: introduce a new capability KVM_CAP_PPC_IRQ_XIVE

2019-03-15 Thread Cédric Le Goater
The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
let QEMU connect the vCPU presenters to the XIVE KVM device if
required. The capability is not advertised for now as the full support
for the XIVE native exploitation mode is not yet available. When this
is case, the capability will be advertised on PowerNV Hypervisors
only. Nested guests (pseries KVM Hypervisor) are not supported.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2:

 - made use of the xive_vp() macro to compute VP identifiers
 - reworked locking in kvmppc_xive_native_connect_vcpu() to fix races 
 - stop advertising KVM_CAP_PPC_IRQ_XIVE as support is not fully
   available yet 
 
 arch/powerpc/include/asm/kvm_host.h   |   1 +
 arch/powerpc/include/asm/kvm_ppc.h|  13 +++
 arch/powerpc/kvm/book3s_xive.h|  11 ++
 include/uapi/linux/kvm.h  |   1 +
 arch/powerpc/kvm/book3s_xive.c|  88 ---
 arch/powerpc/kvm/book3s_xive_native.c | 150 ++
 arch/powerpc/kvm/powerpc.c|  36 +++
 Documentation/virtual/kvm/api.txt |   9 ++
 8 files changed, 268 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 9f75a75a07f2..eb8581be0ee8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -448,6 +448,7 @@ struct kvmppc_passthru_irqmap {
 #define KVMPPC_IRQ_DEFAULT 0
 #define KVMPPC_IRQ_MPIC1
 #define KVMPPC_IRQ_XICS2 /* Includes a XIVE option */
+#define KVMPPC_IRQ_XIVE3 /* XIVE native exploitation mode */
 
 #define MMIO_HPTE_CACHE_SIZE   4
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 4b72ddde7dc1..1e61877fe147 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -594,6 +594,14 @@ extern int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
 
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
+}
+
+extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+  struct kvm_vcpu *vcpu, u32 cpu);
+extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
 
@@ -621,6 +629,11 @@ static inline int kvmppc_xive_set_irq(struct kvm *kvm, int 
irq_source_id, u32 ir
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
 
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+   { return 0; }
+static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+ struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
+static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
 static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
 
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index a08ae6fd4c51..d366df69b9cb 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -198,6 +198,11 @@ static inline struct kvmppc_xive_src_block 
*kvmppc_xive_find_source(struct kvmpp
return xive->src_blocks[bid];
 }
 
+static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
+{
+   return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
+}
+
 /*
  * Mapping between guest priorities and host priorities
  * is as follow.
@@ -248,5 +253,11 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, 
unsigned long server,
 extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr);
 extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr);
 
+/*
+ * Common Xive routines for XICS-over-XIVE and XIVE native
+ */
+void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
+int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
+
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index e6368163d3a0..52bf74a1616e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
 #define KVM_CAP_HYPERV_CPUID 167
+#define KVM_CAP_PPC_IRQ_XIVE 168
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index f78d002f0fe0..e7f1ada1c3de 100644
--- a/arch/powerpc/kvm/

[PATCH v3 08/17] KVM: PPC: Book3S HV: XIVE: add a control to sync the sources

2019-03-15 Thread Cédric Le Goater
This control will be used by the H_INT_SYNC hcall from QEMU to flush
event notifications on the XIVE IC owning the source.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2 :

 - fixed locking on source block

 arch/powerpc/include/uapi/asm/kvm.h|  1 +
 arch/powerpc/kvm/book3s_xive_native.c  | 36 ++
 Documentation/virtual/kvm/devices/xive.txt |  8 +
 3 files changed, 45 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 95e82ab57c03..fc9211dbfec8 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_SOURCE2   /* 64-bit source 
identifier */
 #define KVM_DEV_XIVE_GRP_SOURCE_CONFIG 3   /* 64-bit source identifier */
 #define KVM_DEV_XIVE_GRP_EQ_CONFIG 4   /* 64-bit EQ identifier */
+#define KVM_DEV_XIVE_GRP_SOURCE_SYNC   5   /* 64-bit source identifier */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 3385c336fd89..26ac3c505cd2 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -340,6 +340,38 @@ static int kvmppc_xive_native_set_source_config(struct 
kvmppc_xive *xive,
   priority, masked, eisn);
 }
 
+static int kvmppc_xive_native_sync_source(struct kvmppc_xive *xive,
+ long irq, u64 addr)
+{
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   struct xive_irq_data *xd;
+   u32 hw_num;
+   u16 src;
+   int rc = 0;
+
+   pr_devel("%s irq=0x%lx", __func__, irq);
+
+   sb = kvmppc_xive_find_source(xive, irq, &src);
+   if (!sb)
+   return -ENOENT;
+
+   state = &sb->irq_state[src];
+
+   rc = -EINVAL;
+
+   arch_spin_lock(&sb->lock);
+
+   if (state->valid) {
+   kvmppc_xive_select_irq(state, &hw_num, &xd);
+   xive_native_sync_source(hw_num);
+   rc = 0;
+   }
+
+   arch_spin_unlock(&sb->lock);
+   return rc;
+}
+
 static int xive_native_validate_queue_size(u32 qsize)
 {
/*
@@ -658,6 +690,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
case KVM_DEV_XIVE_GRP_EQ_CONFIG:
return kvmppc_xive_native_set_queue_config(xive, attr->attr,
   attr->addr);
+   case KVM_DEV_XIVE_GRP_SOURCE_SYNC:
+   return kvmppc_xive_native_sync_source(xive, attr->attr,
+ attr->addr);
}
return -ENXIO;
 }
@@ -687,6 +722,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
break;
case KVM_DEV_XIVE_GRP_SOURCE:
case KVM_DEV_XIVE_GRP_SOURCE_CONFIG:
+   case KVM_DEV_XIVE_GRP_SOURCE_SYNC:
if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
attr->attr < KVMPPC_XIVE_NR_IRQS)
return 0;
diff --git a/Documentation/virtual/kvm/devices/xive.txt 
b/Documentation/virtual/kvm/devices/xive.txt
index e1893d303ab7..055aed0c2abb 100644
--- a/Documentation/virtual/kvm/devices/xive.txt
+++ b/Documentation/virtual/kvm/devices/xive.txt
@@ -89,3 +89,11 @@ the legacy interrupt mode, referred as XICS (POWER7/8).
 -EINVAL: Invalid queue address
 -EFAULT: Invalid user pointer for attr->addr.
 -EIO:Configuration of the underlying HW failed
+
+  5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
+  Synchronize the source to flush event notifications
+  Attributes:
+Interrupt source number  (64-bit)
+  Errors:
+-ENOENT: Unknown source number
+-EINVAL: Not initialized source number
-- 
2.20.1



Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-15 Thread Peter Zijlstra
On Fri, Mar 15, 2019 at 12:12:45PM +0100, Laurent Vivier wrote:

> Another way to avoid the nodes overlapping for the offline nodes at
> startup is to ensure the default values don't define a distance that
> merge all offline nodes into node 0.
> 
> A powerpc specific patch can workaround the kernel crash by doing this:
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 87f0dd0..3ba29bb 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
> struct device_node *memory;
> int default_nid = 0;
> unsigned long i;
> +   int nid, dist;
> 
> if (numa_enabled == 0) {
> printk(KERN_WARNING "NUMA disabled by user\n");
> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
> 
> dbg("NUMA associativity depth for CPU/Memory: %d\n",
> min_common_depth);
> 
> +   for (nid = 0; nid < MAX_NUMNODES; nid ++)
> +   for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
> +   distance_lookup_table[nid][dist] = nid;
> +
> /*
>  * Even though we connect cpus to numa domains later in SMP
>  * init, we need to know the node ids now. This is because

What does that actually do? That is, what does it make the distance
table look like before and after you bring up the CPUs?

> Any comment?

Well, I had a few questions here:

  20190305115952.gh32...@hirez.programming.kicks-ass.net

that I've not yet seen answers to.


[PATCH v3 16/17] KVM: introduce a KVM_DESTROY_DEVICE ioctl

2019-03-15 Thread Cédric Le Goater
The 'destroy' method is currently used to destroy all devices when the
VM is destroyed after the vCPUs have been freed.

This new KVM ioctl exposes the same KVM device method. It acts as a
software reset of the VM to 'destroy' selected devices when necessary
and perform the required cleanups on the vCPUs. Called with the
kvm->lock.

The 'destroy' method could be improved by returning an error code.

Cc: Paolo Bonzini 
Signed-off-by: Cédric Le Goater 
---

  Changes since v2 :

 - checked that device is owned by VM
 
 include/uapi/linux/kvm.h  |  7 ++
 virt/kvm/kvm_main.c   | 42 +++
 Documentation/virtual/kvm/api.txt | 20 +++
 3 files changed, 69 insertions(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52bf74a1616e..d78fafa54274 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1183,6 +1183,11 @@ struct kvm_create_device {
__u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 };
 
+struct kvm_destroy_device {
+   __u32   fd; /* in: device handle */
+   __u32   flags;  /* in: unused */
+};
+
 struct kvm_device_attr {
__u32   flags;  /* no flags currently defined */
__u32   group;  /* device-defined */
@@ -1331,6 +1336,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DESTROY_DEVICE   _IOWR(KVMIO,  0xf0, struct kvm_destroy_device)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4881a8c2a6f..7b616a1d48cf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3026,6 +3026,34 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
return 0;
 }
 
+static int kvm_ioctl_destroy_device(struct kvm *kvm,
+   struct kvm_destroy_device *dd)
+{
+   struct fd f;
+   struct kvm_device *dev;
+
+   f = fdget(dd->fd);
+   if (!f.file)
+   return -EBADF;
+
+   dev = kvm_device_from_filp(f.file);
+   fdput(f);
+
+   if (!dev)
+   return -ENODEV;
+
+   if (dev->kvm != kvm)
+   return -EPERM;
+
+   mutex_lock(&kvm->lock);
+   list_del(&dev->vm_node);
+   dev->ops->destroy(dev);
+   mutex_unlock(&kvm->lock);
+
+   /* TODO: kvm_put_kvm() crashes the host on some occasion ? */
+   return 0;
+}
+
 static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 {
switch (arg) {
@@ -3270,6 +3298,20 @@ static long kvm_vm_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_DESTROY_DEVICE: {
+   struct kvm_destroy_device dd;
+
+   r = -EFAULT;
+   if (copy_from_user(&dd, argp, sizeof(dd)))
+   goto out;
+
+   r = kvm_ioctl_destroy_device(kvm, &dd);
+   if (r)
+   goto out;
+
+   r = 0;
+   break;
+   }
case KVM_CHECK_EXTENSION:
r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
break;
diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 1db1435769b4..914471494602 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3857,6 +3857,26 @@ number of valid entries in the 'entries' array, which is 
then filled.
 'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
 userspace should not expect to get any particular value there.
 
+4.119 KVM_DESTROY_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_destroy_device (in)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EPERM: The device does not belong to the VM
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Destroys an emulated device in the kernel.
+
+struct kvm_destroy_device {
+   __u32   fd; /* in: device handle */
+   __u32   flags;  /* unused */
+};
+
 5. The kvm_run structure
 
 
-- 
2.20.1



[PATCH v3 01/17] powerpc/xive: add OPAL extensions for the XIVE native exploitation support

2019-03-15 Thread Cédric Le Goater
The support for XIVE native exploitation mode in Linux/KVM needs a
couple more OPAL calls to get and set the state of the XIVE internal
structures being used by a sPAPR guest.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v2:
 
 - remove extra OPAL call definitions

 arch/powerpc/include/asm/opal-api.h   |  7 +-
 arch/powerpc/include/asm/opal.h   |  7 ++
 arch/powerpc/include/asm/xive.h   | 14 +++
 arch/powerpc/sysdev/xive/native.c | 99 +++
 .../powerpc/platforms/powernv/opal-wrappers.S |  3 +
 5 files changed, 127 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b239ea..e1d118ac61dc 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -186,8 +186,8 @@
 #define OPAL_XIVE_FREE_IRQ 140
 #define OPAL_XIVE_SYNC 141
 #define OPAL_XIVE_DUMP 142
-#define OPAL_XIVE_RESERVED3143
-#define OPAL_XIVE_RESERVED4144
+#define OPAL_XIVE_GET_QUEUE_STATE  143
+#define OPAL_XIVE_SET_QUEUE_STATE  144
 #define OPAL_SIGNAL_SYSTEM_RESET   145
 #define OPAL_NPU_INIT_CONTEXT  146
 #define OPAL_NPU_DESTROY_CONTEXT   147
@@ -210,7 +210,8 @@
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   165
 #defineOPAL_NX_COPROC_INIT 167
-#define OPAL_LAST  167
+#define OPAL_XIVE_GET_VP_STATE 170
+#define OPAL_LAST  170
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a55b01c90bb1..4e978d4dea5c 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -279,6 +279,13 @@ int64_t opal_xive_allocate_irq(uint32_t chip_id);
 int64_t opal_xive_free_irq(uint32_t girq);
 int64_t opal_xive_sync(uint32_t type, uint32_t id);
 int64_t opal_xive_dump(uint32_t type, uint32_t id);
+int64_t opal_xive_get_queue_state(uint64_t vp, uint32_t prio,
+ __be32 *out_qtoggle,
+ __be32 *out_qindex);
+int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t prio,
+ uint32_t qtoggle,
+ uint32_t qindex);
+int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01);
 int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target,
uint64_t desc, uint16_t pe_number);
 
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 3c704f5dd3ae..b579a943407b 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -109,12 +109,26 @@ extern int xive_native_configure_queue(u32 vp_id, struct 
xive_q *q, u8 prio,
 extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio);
 
 extern void xive_native_sync_source(u32 hw_irq);
+extern void xive_native_sync_queue(u32 hw_irq);
 extern bool is_xive_irq(struct irq_chip *chip);
 extern int xive_native_enable_vp(u32 vp_id, bool single_escalation);
 extern int xive_native_disable_vp(u32 vp_id);
 extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 
*out_chip_id);
 extern bool xive_native_has_single_escalation(void);
 
+extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio,
+ u64 *out_qpage,
+ u64 *out_qsize,
+ u64 *out_qeoi_page,
+ u32 *out_escalate_irq,
+ u64 *out_qflags);
+
+extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle,
+  u32 *qindex);
+extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
+  u32 qindex);
+extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
+
 #else
 
 static inline bool xive_enabled(void) { return false; }
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 1ca127d052a6..0c037e933e55 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -437,6 +437,12 @@ void xive_native_sync_source(u32 hw_irq)
 }
 EXPORT_SYMBOL_GPL(xive_native_sync_source);
 
+void xive_native_sync_queue(u32 hw_irq)
+{
+   opal_xive_sync(XIVE_SYNC_QUEUE, hw_irq);
+}
+EXPORT_SYMBOL_GPL(xive_native_sync_queue);
+
 static const struct xive_ops xive_native_ops = {
.populate_irq_data  = xive_native_populate_irq_data,
.configure_irq  = xive_native_configure_irq,

Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-15 Thread Laurent Vivier
On 15/03/2019 13:25, Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 12:12:45PM +0100, Laurent Vivier wrote:
> 
>> Another way to avoid the nodes overlapping for the offline nodes at
>> startup is to ensure the default values don't define a distance that
>> merge all offline nodes into node 0.
>>
>> A powerpc specific patch can workaround the kernel crash by doing this:
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 87f0dd0..3ba29bb 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
>> struct device_node *memory;
>> int default_nid = 0;
>> unsigned long i;
>> +   int nid, dist;
>>
>> if (numa_enabled == 0) {
>> printk(KERN_WARNING "NUMA disabled by user\n");
>> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
>>
>> dbg("NUMA associativity depth for CPU/Memory: %d\n",
>> min_common_depth);
>>
>> +   for (nid = 0; nid < MAX_NUMNODES; nid ++)
>> +   for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
>> +   distance_lookup_table[nid][dist] = nid;
>> +
>> /*
>>  * Even though we connect cpus to numa domains later in SMP
>>  * init, we need to know the node ids now. This is because
> 
> What does that actually do? That is, what does it make the distance
> table look like before and after you bring up the CPUs?

By default the table is full of 0. When a CPU is brought up the value is
read from the device-tree and the table is updated. What I've seen is
this value is common for 2 nodes at a given level if they share the level.
So as the table is initialized with 0, all offline nodes (no memory no
cpu) are merged with node 0.
My fix initializes the table with unique values for each node, so by
default no nodes are mixed.

> 
>> Any comment?
> 
> Well, I had a few questions here:
> 
>   20190305115952.gh32...@hirez.programming.kicks-ass.net
> 
> that I've not yet seen answers to.

I didn't answer because:

- I thought this was not the good way to fix the problem as you said "it
seems very fragile and unfortunate",

- I don't have the answers, I'd really like someone from IBM that knows
well the NUMA part of powerpc answers to these questions... and perhaps
find a better solution.

Thanks,
Laurent





Re: bpf jit PPC64 (BE) test_verifier PTR_TO_STACK store/load failure

2019-03-15 Thread Naveen N. Rao

Segher Boessenkool wrote:

Hi!

On Wed, Mar 13, 2019 at 12:54:16PM +0200, Yauheni Kaliuta wrote:

This is because of the handling of the +2 offset.


The low two bits of instructions with primary opcodes 58 and 62 are part
of the opcode, not the offset.  These instructions can not have offsets
with the low two bits non-zero.


For stores it is:
#define PPC_STD(r, base, i) EMIT(PPC_INST_STD | ___PPC_RS(r) |\
 ___PPC_RA(base) | ((i) & 0xfffc))

and for loads
#define PPC_LD(r, base, i)  EMIT(PPC_INST_LD | ___PPC_RT(r) | \
 ___PPC_RA(base) | IMM_L(i))
#define IMM_L(i)((uintptr_t)(i) & 0x)

So, in the load case the offset +2 (immediate value) is not
masked and turns the instruction to lwa instead of ld.

Would it be correct to & 0xfffc the immediate value as well?


That is only part of it.  The other thing is you have to make sure those
low bits are zero *already* (and then you do not need the mask anymore).
For example, if the low two bits are not zero load the offset into a
register instead (and then do ldx or lwax).


Thanks for pointing that out, Segher. That is a detail that is easily 
missed.


- Naveen




Re: [PATCH] powerpc/6xx: fix setup and use of SPRN_PGDIR for hash32

2019-03-15 Thread Christophe Leroy

Michael,

Are you able to get this merged before 5.1-rc1 comes out ?

Thanks
Christophe

Le 08/03/2019 à 17:06, Christophe Leroy a écrit :



Le 08/03/2019 à 17:03, Segher Boessenkool a écrit :

On Fri, Mar 08, 2019 at 07:05:22AM +, Christophe Leroy wrote:

Not only the 603 but all 6xx need SPRN_PGDIR to be initialised at
startup. This patch move it from __setup_cpu_603() to start_here()
and __secondary_start(), close to the initialisation of SPRN_THREAD.


I thought you meant an SPR I did not know about.  But you just misspelled
SPRN_SPRG_PGDIR :-)



Oops.

Michael, can you fix the commit text (and subject) when applying ?






[PATCH v3 12/17] KVM: PPC: Book3S HV: XIVE: add a TIMA mapping

2019-03-15 Thread Cédric Le Goater
Each thread has an associated Thread Interrupt Management context
composed of a set of registers. These registers let the thread handle
priority management and interrupt acknowledgment. The most important
are :

- Interrupt Pending Buffer (IPB)
- Current Processor Priority   (CPPR)
- Notification Source Register (NSR)

They are exposed to software in four different pages each proposing a
view with a different privilege. The first page is for the physical
thread context and the second for the hypervisor. Only the third
(operating system) and the fourth (user level) are exposed the guest.

A custom VM fault handler will populate the VMA with the appropriate
pages, which should only be the OS page for now.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/xive.h|  1 +
 arch/powerpc/include/uapi/asm/kvm.h|  2 ++
 arch/powerpc/kvm/book3s_xive_native.c  | 39 ++
 arch/powerpc/sysdev/xive/native.c  | 11 ++
 Documentation/virtual/kvm/devices/xive.txt | 23 +
 5 files changed, 76 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 46891f321606..eb6d302082da 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -23,6 +23,7 @@
  * same offset regardless of where the code is executing
  */
 extern void __iomem *xive_tima;
+extern unsigned long xive_tima_os;
 
 /*
  * Offset in the TM area of our current execution level (provided by
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 3de0d1395c01..6836d38a517c 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -720,4 +720,6 @@ struct kvm_ppc_xive_eq {
 #define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY 0x0002
 #define KVM_XIVE_EQ_FLAG_ESCALATE  0x0004
 
+#define KVM_XIVE_TIMA_PAGE_OFFSET  0
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 675c209cf570..f10087dbcac2 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -170,6 +170,44 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
return rc;
 }
 
+static int xive_native_tima_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+
+   switch (vmf->pgoff - vma->vm_pgoff) {
+   case 0: /* HW - forbid access */
+   case 1: /* HV - forbid access */
+   return VM_FAULT_SIGBUS;
+   case 2: /* OS */
+   vmf_insert_pfn(vma, vmf->address, xive_tima_os >> PAGE_SHIFT);
+   return VM_FAULT_NOPAGE;
+   case 3: /* USER - TODO */
+   default:
+   return VM_FAULT_SIGBUS;
+   }
+}
+
+static const struct vm_operations_struct xive_native_tima_vmops = {
+   .fault = xive_native_tima_fault,
+};
+
+static int kvmppc_xive_native_mmap(struct kvm_device *dev,
+  struct vm_area_struct *vma)
+{
+   /* We only allow mappings at fixed offset for now */
+   if (vma->vm_pgoff == KVM_XIVE_TIMA_PAGE_OFFSET) {
+   if (vma_pages(vma) > 4)
+   return -EINVAL;
+   vma->vm_ops = &xive_native_tima_vmops;
+   } else {
+   return -EINVAL;
+   }
+
+   vma->vm_flags |= VM_IO | VM_PFNMAP;
+   vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
+   return 0;
+}
+
 static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
 u64 addr)
 {
@@ -1038,6 +1076,7 @@ struct kvm_device_ops kvm_xive_native_ops = {
.set_attr = kvmppc_xive_native_set_attr,
.get_attr = kvmppc_xive_native_get_attr,
.has_attr = kvmppc_xive_native_has_attr,
+   .mmap = kvmppc_xive_native_mmap,
 };
 
 void kvmppc_xive_native_init_module(void)
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 0c037e933e55..7782201e5fe8 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -521,6 +521,9 @@ u32 xive_native_default_eq_shift(void)
 }
 EXPORT_SYMBOL_GPL(xive_native_default_eq_shift);
 
+unsigned long xive_tima_os;
+EXPORT_SYMBOL_GPL(xive_tima_os);
+
 bool __init xive_native_init(void)
 {
struct device_node *np;
@@ -573,6 +576,14 @@ bool __init xive_native_init(void)
for_each_possible_cpu(cpu)
kvmppc_set_xive_tima(cpu, r.start, tima);
 
+   /* Resource 2 is OS window */
+   if (of_address_to_resource(np, 2, &r)) {
+   pr_err("Failed to get thread mgmnt area resource\n");
+   return false;
+   }
+
+   xive_tima_os = r.start;
+
/* Grab size of provisionning pages */
xive_parse_provisioning(np);
 
diff --git a/Documentation/virtual/kvm/devices/xive.txt 
b/Documentation/virtual/kvm/devices/xive.txt
index 

[PATCH v3 14/17] KVM: PPC: Book3S HV: XIVE: add passthrough support

2019-03-15 Thread Cédric Le Goater
The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
implement an IRQ space for the guest using the generic IPI interrupts
of the XIVE IC controller. These interrupts are allocated at the OPAL
level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
Interrupt management is performed in the XIVE way: using loads and
stores on the addresses of the XIVE IPI interrupt ESB pages.

Both KVM devices share the same internal structure caching information
on the interrupts, among which the xive_irq_data struct containing the
addresses of the IPI ESB pages and an extra one in case of pass-through.
The later contains the addresses of the ESB pages of the underlying HW
controller interrupts, PHB4 in all cases for now.

A guest, when running in the XICS legacy interrupt mode, lets the KVM
XICS-over-XIVE device "handle" interrupt management, that is to
perform the loads and stores on the addresses of the ESB pages of the
guest interrupts. However, when running in XIVE native exploitation
mode, the KVM XIVE native device exposes the interrupt ESB pages to
the guest and lets the guest perform directly the loads and stores.

The VMA exposing the ESB pages make use of a custom VM fault handler
which role is to populate the VMA with appropriate pages. When a fault
occurs, the guest IRQ number is deduced from the offset, and the ESB
pages of associated XIVE IPI interrupt are inserted in the VMA (using
the internal structure caching information on the interrupts).

Supporting device passthrough in the guest running in XIVE native
exploitation mode adds some extra refinements because the ESB pages
of a different HW controller (PHB4) need to be exposed to the guest
along with the initial IPI ESB pages of the XIVE IC controller. But
the overall mechanic is the same.

When the device HW irqs are mapped into or unmapped from the guest
IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped(), are called to record or clear the
passthrough interrupt information and to perform the switch.

The approach taken by this patch is to clear the ESB pages of the
guest IRQ number being mapped and let the VM fault handler repopulate.
The handler will insert the ESB page corresponding to the HW interrupt
of the device being passed-through or the initial IPI ESB page if the
device is being removed.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2 :

 - extra comment in documentation

 arch/powerpc/kvm/book3s_xive.h |  9 +
 arch/powerpc/kvm/book3s_xive.c | 15 
 arch/powerpc/kvm/book3s_xive_native.c  | 41 ++
 Documentation/virtual/kvm/devices/xive.txt | 19 ++
 4 files changed, 84 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 622f594d93e1..e011622dc038 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -94,6 +94,11 @@ struct kvmppc_xive_src_block {
struct kvmppc_xive_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS];
 };
 
+struct kvmppc_xive;
+
+struct kvmppc_xive_ops {
+   int (*reset_mapped)(struct kvm *kvm, unsigned long guest_irq);
+};
 
 struct kvmppc_xive {
struct kvm *kvm;
@@ -132,6 +137,10 @@ struct kvmppc_xive {
 
/* Flags */
u8  single_escalation;
+
+   struct kvmppc_xive_ops *ops;
+   struct address_space   *mapping;
+   struct mutex mapping_lock;
 };
 
 #define KVMPPC_XIVE_Q_COUNT8
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index c1b7aa7dbc28..480a3fc6b9fd 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -937,6 +937,13 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
/* Turn the IPI hard off */
xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
 
+   /*
+* Reset ESB guest mapping. Needed when ESB pages are exposed
+* to the guest in XIVE native mode
+*/
+   if (xive->ops && xive->ops->reset_mapped)
+   xive->ops->reset_mapped(kvm, guest_irq);
+
/* Grab info about irq */
state->pt_number = hw_irq;
state->pt_data = irq_data_get_irq_handler_data(host_data);
@@ -1022,6 +1029,14 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned 
long guest_irq,
state->pt_number = 0;
state->pt_data = NULL;
 
+   /*
+* Reset ESB guest mapping. Needed when ESB pages are exposed
+* to the guest in XIVE native mode
+*/
+   if (xive->ops && xive->ops->reset_mapped) {
+   xive->ops->reset_mapped(kvm, guest_irq);
+   }
+
/* Reconfigure the IPI */
xive_native_configure_irq(state->ipi_number,
  kvmppc_xive_vp(xive, state->act_server),
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index e465d4c53f5c..67a1bb26a4cc 100644
--- a/arch/powerpc/kvm/book3s_

Re: [PATCH 6/7] ocxl: afu_irq only deals with IRQ IDs, not offsets

2019-03-15 Thread Greg Kurz
On Wed, 13 Mar 2019 15:15:21 +1100
"Alastair D'Silva"  wrote:

> From: Alastair D'Silva 
> 
> The use of offsets is required only in the frontend, so alter
> the IRQ API to only work with IRQ IDs in the backend.
> 
> Signed-off-by: Alastair D'Silva 
> ---
>  drivers/misc/ocxl/afu_irq.c   | 31 +--
>  drivers/misc/ocxl/context.c   |  7 +--
>  drivers/misc/ocxl/file.c  | 13 -
>  drivers/misc/ocxl/ocxl_internal.h | 10 ++
>  drivers/misc/ocxl/trace.h | 12 
>  5 files changed, 36 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/afu_irq.c b/drivers/misc/ocxl/afu_irq.c
> index 11ab996657a2..1885c472df58 100644
> --- a/drivers/misc/ocxl/afu_irq.c
> +++ b/drivers/misc/ocxl/afu_irq.c
> @@ -14,14 +14,14 @@ struct afu_irq {
>   struct eventfd_ctx *ev_ctx;
>  };
>  
> -static int irq_offset_to_id(struct ocxl_context *ctx, u64 offset)
> +int ocxl_irq_offset_to_id(struct ocxl_context *ctx, u64 offset)
>  {
>   return (offset - ctx->afu->irq_base_offset) >> PAGE_SHIFT;
>  }
>  
> -static u64 irq_id_to_offset(struct ocxl_context *ctx, int id)
> +u64 ocxl_irq_id_to_offset(struct ocxl_context *ctx, int irq_id)
>  {
> - return ctx->afu->irq_base_offset + (id << PAGE_SHIFT);
> + return ctx->afu->irq_base_offset + (irq_id << PAGE_SHIFT);
>  }
>  
>  static irqreturn_t afu_irq_handler(int virq, void *data)
> @@ -69,7 +69,7 @@ static void release_afu_irq(struct afu_irq *irq)
>   kfree(irq->name);
>  }
>  
> -int ocxl_afu_irq_alloc(struct ocxl_context *ctx, u64 *irq_offset)
> +int ocxl_afu_irq_alloc(struct ocxl_context *ctx, int *irq_id)
>  {
>   struct afu_irq *irq;
>   int rc;
> @@ -101,10 +101,7 @@ int ocxl_afu_irq_alloc(struct ocxl_context *ctx, u64 
> *irq_offset)
>   if (rc)
>   goto err_alloc;
>  
> - *irq_offset = irq_id_to_offset(ctx, irq->id);

This should be replaced by:

*irq_id = irq->id;

> -
> - trace_ocxl_afu_irq_alloc(ctx->pasid, irq->id, irq->virq, irq->hw_irq,
> - *irq_offset);
> + trace_ocxl_afu_irq_alloc(ctx->pasid, irq->id, irq->virq, irq->hw_irq);
>   mutex_unlock(&ctx->irq_lock);
>   return 0;
>  
> @@ -123,7 +120,7 @@ static void afu_irq_free(struct afu_irq *irq, struct 
> ocxl_context *ctx)
>   trace_ocxl_afu_irq_free(ctx->pasid, irq->id);
>   if (ctx->mapping)
>   unmap_mapping_range(ctx->mapping,
> - irq_id_to_offset(ctx, irq->id),
> + ocxl_irq_id_to_offset(ctx, irq->id),
>   1 << PAGE_SHIFT, 1);
>   release_afu_irq(irq);
>   if (irq->ev_ctx)
> @@ -132,14 +129,13 @@ static void afu_irq_free(struct afu_irq *irq, struct 
> ocxl_context *ctx)
>   kfree(irq);
>  }
>  
> -int ocxl_afu_irq_free(struct ocxl_context *ctx, u64 irq_offset)
> +int ocxl_afu_irq_free(struct ocxl_context *ctx, int irq_id)
>  {
>   struct afu_irq *irq;
> - int id = irq_offset_to_id(ctx, irq_offset);
>  
>   mutex_lock(&ctx->irq_lock);
>  
> - irq = idr_find(&ctx->irq_idr, id);
> + irq = idr_find(&ctx->irq_idr, irq_id);
>   if (!irq) {
>   mutex_unlock(&ctx->irq_lock);
>   return -EINVAL;
> @@ -161,14 +157,14 @@ void ocxl_afu_irq_free_all(struct ocxl_context *ctx)
>   mutex_unlock(&ctx->irq_lock);
>  }
>  
> -int ocxl_afu_irq_set_fd(struct ocxl_context *ctx, u64 irq_offset, int 
> eventfd)
> +int ocxl_afu_irq_set_fd(struct ocxl_context *ctx, int irq_id, int eventfd)
>  {
>   struct afu_irq *irq;
>   struct eventfd_ctx *ev_ctx;
> - int rc = 0, id = irq_offset_to_id(ctx, irq_offset);
> + int rc = 0;
>  
>   mutex_lock(&ctx->irq_lock);
> - irq = idr_find(&ctx->irq_idr, id);
> + irq = idr_find(&ctx->irq_idr, irq_id);
>   if (!irq) {
>   rc = -EINVAL;
>   goto unlock;
> @@ -186,14 +182,13 @@ int ocxl_afu_irq_set_fd(struct ocxl_context *ctx, u64 
> irq_offset, int eventfd)
>   return rc;
>  }
>  
> -u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, u64 irq_offset)
> +u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, int irq_id)
>  {
>   struct afu_irq *irq;
> - int id = irq_offset_to_id(ctx, irq_offset);
>   u64 addr = 0;
>  
>   mutex_lock(&ctx->irq_lock);
> - irq = idr_find(&ctx->irq_idr, id);
> + irq = idr_find(&ctx->irq_idr, irq_id);
>   if (irq)
>   addr = irq->trigger_page;
>   mutex_unlock(&ctx->irq_lock);
> diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c
> index 9a37e9632cd9..c04887591837 100644
> --- a/drivers/misc/ocxl/context.c
> +++ b/drivers/misc/ocxl/context.c
> @@ -93,8 +93,9 @@ static vm_fault_t map_afu_irq(struct vm_area_struct *vma, 
> unsigned long address,
>   u64 offset, struct ocxl_context *ctx)
>  {
>   u64 trigger_addr;
> + int irq_id = ocxl_irq_offset_to_id(ctx, offset);
>  
> - trigger_addr = ocxl_

Mac Mini G4 hang on boot with git master

2019-03-15 Thread Mark Cave-Ayland
Hi all,

I've just done a git pull and rebuilt master on my Mac Mini G4 in order to test
Michael's merge of my KVM PR fix, and unfortunately my kernel now hangs on boot 
:(

My last working git checkout was somewhere around the 5.0-rc stage, so I 
suspect it's
something that's been merged for 5.1.

The hang occurs just after the boot console is disabled which makes me wonder if
something is going wrong during PCI bus enumeration. Does anyone have an idea 
as to
what may be causing this? I can obviously bisect it down, but on slow hardware 
it can
take some time...


ATB,

Mark.


Re: [PATCH v2 10/45] drivers: tty: serial: zs: use devm_* functions

2019-03-15 Thread Greg KH
On Fri, Mar 15, 2019 at 10:06:16AM +0100, Enrico Weigelt, metux IT consult 
wrote:
> On 14.03.19 23:52, Greg KH wrote:
> > On Thu, Mar 14, 2019 at 11:33:40PM +0100, Enrico Weigelt, metux IT consult 
> > wrote:
> >> Use the safer devm versions of memory mapping functions.
> > 
> > What is "safer" about them?
> 
> Garbage collection :)

At times, but not when you do not use the api correctly :)

> Several drivers didn't seem to clean up properly (maybe these're just
> used compiled-in, so nobody noticed yet).

Yes, there are lots of drivers for devices that are never unloaded or
removed from the system.  The fact that no one has reported any problems
with them means that they are never used in situations like this.

> In general, I think devm_* should be the standard case, unless there's
> a really good reason to do otherwise.

No, you need to have a good reason why it needs to be changed, when you
can not verify that this actually resolves a problem.  As this patch
shows, you just replaced one api call with another, so nothing changed
at all, except you actually took up more memory and logic to do the same
thing :(

> > Isn't the whole goal of the devm* functions such that you are not
> > required to call "release" on them?
> 
> Looks that one slipped through, when I was doing that big bulk change
> in the middle of the night and not enough coffe ;-)
> 
> One problem here is that many drivers do this stuff in request/release
> port, instead of probe/remove. I'm not sure yet, whether we should
> rewrite that. There're also cases which do request/release, but no
> ioremap (doing hardcoded register accesses), others do ioremap w/o
> request/release.
> 
> IMHO, we should have a closer look at those and check whether that's
> really okay (just adding request/release blindly could cause trouble)

I agree, please feel free to audit these to verify they are all correct.
But that's not what you did here, so that's not a valid justification
for these patches to be accepted.

> > And also, why make the change, you aren't changing any functionality for
> > these old drivers at all from what I can tell (for the devm calls).
> > What am I missing here?
> 
> Okay, there's a bigger story behind, you can't know yet. Finally, I'd
> like to move everything to using struct resource and corresponding
> helpers consistently, so most of the drivers would be pretty simple
> at that point. (there're of course special cases, like devices w/
> multiple register spaces, etc)
> 
> Here's my wip branch:
> 
> https://github.com/metux/linux/commits/wip/serial-res
> 
> In this consolidation process, I'm trying to move everything to
> devm_*, to have it more generic (for now, I still need two versions
> of the request/release/ioremap/iounmap helpers - one w/ and one
> w/o devm).

Move everything in what part of the kernel?  The whole kernel or just
one subsystem?

> My idea was moving to devm first, so it can be reviewed/tested
> independently, before moving forward. Smaller, easily digestable
> pieces should minimize the risk of breaking anything. But if you
> prefer having this things squashed together, just let me know.

Small pieces are required, that's fine, but those pieces need to have a
justification for why they should be accepted at all points along the
way.

> In the queue are also other minor cleanups like using dev_err()
> instead of printk(), etc. Should I send these separately ?

Of course.

> By the way: do you have some public branch where you're collecting
> accepted patches, which I could base mine on ? (tty.git/tty-next ?)

Yes, that is the tree and branch, but remember that none of my trees can
open up until after -rc1 is out.

thanks,

greg k-h


[PATCH] powerpc: bpf: Fix generation of load/store DW instructions

2019-03-15 Thread Naveen N. Rao
Yauheni Kaliuta pointed out that PTR_TO_STACK store/load verifier test
was failing on powerpc64 BE, and rightfully indicated that the PPC_LD()
macro is not masking away the last two bits of the offset per the ISA,
resulting in the generation of 'lwa' instruction instead of the intended
'ld' instruction.

Segher also pointed out that we can't simply mask away the last two bits
as that will result in loading/storing from/to a memory location that
was not intended.

This patch addresses this by using ldx/stdx if the offset is not
word-aligned. We load the offset into a temporary register (TMP_REG_2)
and use that as the index register in a subsequent ldx/stdx. We fix
PPC_LD() macro to mask off the last two bits, but enhance PPC_BPF_LL()
and PPC_BPF_STL() to factor in the offset value and generate the proper
instruction sequence. We also convert all existing users of PPC_LD() and
PPC_STD() to use these macros. All existing uses of these macros have
been audited to ensure that TMP_REG_2 can be clobbered.

Fixes: 156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended 
BPF")
Cc: sta...@vger.kernel.org # v4.9+

Reported-by: Yauheni Kaliuta 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ppc-opcode.h |  2 ++
 arch/powerpc/net/bpf_jit.h| 17 +
 arch/powerpc/net/bpf_jit32.h  |  4 
 arch/powerpc/net/bpf_jit64.h  | 20 
 arch/powerpc/net/bpf_jit_comp64.c | 12 ++--
 5 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index c5698a523bb1..23f7ed796f38 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -302,6 +302,7 @@
 /* Misc instructions for BPF compiler */
 #define PPC_INST_LBZ   0x8800
 #define PPC_INST_LD0xe800
+#define PPC_INST_LDX   0x7c2a
 #define PPC_INST_LHZ   0xa000
 #define PPC_INST_LWZ   0x8000
 #define PPC_INST_LHBRX 0x7c00062c
@@ -309,6 +310,7 @@
 #define PPC_INST_STB   0x9800
 #define PPC_INST_STH   0xb000
 #define PPC_INST_STD   0xf800
+#define PPC_INST_STDX  0x7c00012a
 #define PPC_INST_STDU  0xf801
 #define PPC_INST_STW   0x9000
 #define PPC_INST_STWU  0x9400
diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 549e9490ff2a..dcac37745b05 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -51,6 +51,8 @@
 #define PPC_LIS(r, i)  PPC_ADDIS(r, 0, i)
 #define PPC_STD(r, base, i)EMIT(PPC_INST_STD | ___PPC_RS(r) |\
 ___PPC_RA(base) | ((i) & 0xfffc))
+#define PPC_STDX(r, base, b)   EMIT(PPC_INST_STDX | ___PPC_RS(r) |   \
+___PPC_RA(base) | ___PPC_RB(b))
 #define PPC_STDU(r, base, i)   EMIT(PPC_INST_STDU | ___PPC_RS(r) |   \
 ___PPC_RA(base) | ((i) & 0xfffc))
 #define PPC_STW(r, base, i)EMIT(PPC_INST_STW | ___PPC_RS(r) |\
@@ -65,7 +67,9 @@
 #define PPC_LBZ(r, base, i)EMIT(PPC_INST_LBZ | ___PPC_RT(r) |\
 ___PPC_RA(base) | IMM_L(i))
 #define PPC_LD(r, base, i) EMIT(PPC_INST_LD | ___PPC_RT(r) | \
-___PPC_RA(base) | IMM_L(i))
+___PPC_RA(base) | ((i) & 0xfffc))
+#define PPC_LDX(r, base, b)EMIT(PPC_INST_LDX | ___PPC_RT(r) |\
+___PPC_RA(base) | ___PPC_RB(b))
 #define PPC_LWZ(r, base, i)EMIT(PPC_INST_LWZ | ___PPC_RT(r) |\
 ___PPC_RA(base) | IMM_L(i))
 #define PPC_LHZ(r, base, i)EMIT(PPC_INST_LHZ | ___PPC_RT(r) |\
@@ -85,17 +89,6 @@
___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_BPF_STDCX(s, a, b) EMIT(PPC_INST_STDCX | ___PPC_RS(s) |  \
___PPC_RA(a) | ___PPC_RB(b))
-
-#ifdef CONFIG_PPC64
-#define PPC_BPF_LL(r, base, i) do { PPC_LD(r, base, i); } while(0)
-#define PPC_BPF_STL(r, base, i) do { PPC_STD(r, base, i); } while(0)
-#define PPC_BPF_STLU(r, base, i) do { PPC_STDU(r, base, i); } while(0)
-#else
-#define PPC_BPF_LL(r, base, i) do { PPC_LWZ(r, base, i); } while(0)
-#define PPC_BPF_STL(r, base, i) do { PPC_STW(r, base, i); } while(0)
-#define PPC_BPF_STLU(r, base, i) do { PPC_STWU(r, base, i); } while(0)
-#endif
-
 #define PPC_CMPWI(a, i)EMIT(PPC_INST_CMPWI | ___PPC_RA(a) | 
IMM_L(i))
 #define PPC_CMPDI(a, i)EMIT(PPC_INST_CMPDI | ___PPC_RA(a) | 
IMM_L(i))
 #define PPC_CMPW(a, b) EMIT(PPC_INST_CMPW | ___PPC_RA(a) |   \
diff --git a/arch/pow

Re: [PATCH] powerpc: use $(origin ARCH) to select KBUILD_DEFCONFIG

2019-03-15 Thread Masahiro Yamada
On Thu, Mar 14, 2019 at 11:27 AM Michael Ellerman  wrote:
>
> Mathieu Malaterre  writes:
> > On Sat, Feb 16, 2019 at 3:26 AM Masahiro Yamada
> >  wrote:
> >>
> >> On Sat, Feb 16, 2019 at 1:11 AM Mathieu Malaterre  wrote:
> >> >
> >> > On Fri, Feb 15, 2019 at 10:41 AM Masahiro Yamada
> >> >  wrote:
> >> > >
> >> > > I often test all Kconfig commands for all architectures. To ease my
> >> > > workflow, I want 'make defconfig' at least working without any cross
> >> > > compiler.
> >> > >
> >> > > Currently, arch/powerpc/Makefile checks CROSS_COMPILE to decide the
> >> > > default defconfig source.
> >> > >
> >> > > If CROSS_COMPILE is unset, it is likely to be the native build, so
> >> > > 'uname -m' is useful to choose the defconfig. If CROSS_COMPILE is set,
> >> > > the user is cross-building (i.e. 'uname -m' is probably x86_64), so
> >> > > it falls back to ppc64_defconfig. Yup, make sense.
> >> > >
> >> > > However, I want to run 'make ARCH=* defconfig' without setting
> >> > > CROSS_COMPILE for each architecture.
> >> > >
> >> > > My suggestion is to check $(origin ARCH).
> >> > >
> >> > > When you cross-compile the kernel, you need to set ARCH from your
> >> > > environment or from the command line.
> >> > >
> >> > > For the native build, you do not need to set ARCH. The default in
> >> > > the top Makefile is used:
> >> > >
> >> > >   ARCH?= $(SUBARCH)
> >> > >
> >> > > Hence, $(origin ARCH) returns 'file'.
> >> > >
> >> > > Before this commit, 'make ARCH=powerpc defconfig' failed:
> >> >
> >> > In case you have not seen it, please check:
> >> >
> >> > http://patchwork.ozlabs.org/patch/1037835/
> >>
> >> I did not know that because I do not subscribe to ppc ML.
> >>
> >> Michael's patch looks good to me.
> >
> > OK
> >
> >>
> >> If you mimic x86, the following will work:
> >>
> >
> > Nice! Michael do you have a preference ?
>
> Yeah I don't like playing games with ARCH. Doing so means auto builders
> and other build scripts need to learn about the special rules for ARCH,
> which is a pain.
>
> So I'll merge my patch, which I think will also work for Masahiro's
> case.
>
> cheers


Yes, works for me.

Thanks!




-- 
Best Regards
Masahiro Yamada


Re: Mac Mini G4 hang on boot with git master

2019-03-15 Thread Mathieu Malaterre
Mark,

On Fri, Mar 15, 2019 at 3:08 PM Mark Cave-Ayland
 wrote:
>
> Hi all,
>
> I've just done a git pull and rebuilt master on my Mac Mini G4 in order to 
> test
> Michael's merge of my KVM PR fix, and unfortunately my kernel now hangs on 
> boot :(

Ouch :(

> My last working git checkout was somewhere around the 5.0-rc stage, so I 
> suspect it's
> something that's been merged for 5.1.

OK. My last kernel is also somewhere around here on same hardware.

> The hang occurs just after the boot console is disabled which makes me wonder 
> if
> something is going wrong during PCI bus enumeration. Does anyone have an idea 
> as to
> what may be causing this? I can obviously bisect it down, but on slow 
> hardware it can
> take some time...

When doing git bisect I compile from my amd64 machine using:

make O=g4 ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- my_defconfig
make -j8 O=g4 ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- bindeb-pkg
scp *image*.deb macminig4:

On Debian simply install:

# apt-get install crossbuild-essential-powerpc

>
> ATB,
>
> Mark.


Re: [RFC/WIP] powerpc: Fix 32-bit handling of MSR_EE on exceptions

2019-03-15 Thread Christophe Leroy




On 02/05/2019 10:10 AM, Michael Ellerman wrote:

Christophe Leroy  writes:

Le 20/12/2018 à 23:35, Benjamin Herrenschmidt a écrit :



/*
 * MSR_KERNEL is > 0x1 on 4xx/Book-E since it include MSR_CE.
@@ -205,20 +208,46 @@ transfer_to_handler_cont:
mflrr9
lwz r11,0(r9)   /* virtual address of handler */
lwz r9,4(r9)/* where to go when done */
+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PERF_EVENTS)
+   mtspr   SPRN_NRI, r0
+#endif


That's not part of your patch, it's already in the tree.


Yup rebase glitch.

   .../...


I tested it on the 8xx with the below changes in addition. No issue seen
so far.


Thanks !

I'll merge that in.


I'm currently working on a refactorisation and simplification of
exception and syscall entry on ppc32.

I plan to take your patch in my serie as it helps quite a bit. I hope
you don't mind. I expect to come out with a series this week.


Ben's AFK so go ahead and pull it in to your series if that helps you.
  

The main obscure area is that business with the irqsoff tracer and thus
the need to create stack frames around calls to trace_hardirqs_* ... we
do it in some places and not others, but I've not managed to make it
crash either. I need to get to the bottom of that, and possibly provide
proper macro helpers like ppc64 has to do it.


I can't see anything special around this in ppc32 code. As far as I
understand, a stack frame is put in place when there is a need to
save and restore some volatile registers. At the places where nothing
needs to be saved, nothing is done. I think that's the normal way for
any function call, isn't it ?


The concern was that the irqsoff tracer was doing
__builtin_return_address(1) (or some number > 0) and that crashes if
there aren't sufficiently many stack frames available.

See ftrace_return_address.

Possibly the answer is that we don't have CONFIG_FRAME_POINTER and so we
get the empty version of that.



Yes indeed, ftrace_return_address(1) is not __builtin_return_address(1) 
but 0ul as CONFIG_FRAME_POINTER is not defined. So the crash can't be 
due to that as it would then crash regardless of whether we set a stack 
frame or not.
And anyway, as far as I understand, if the stack is properly 
initialised, __builtin_return_address(X) returns NULL and don't crash 
when the top of backtrace is reached.


Do you have more details about the said crash ? I think we should file 
an issue for it in our issue databse.


For the time being, I'll get rid of that unneccessary stack frame in 
entry_32.S as part of my syscall prolog optimising series.


Christophe


Re: [PATCH v2 15/16] KVM: introduce a KVM_DESTROY_DEVICE ioctl

2019-03-15 Thread Paolo Bonzini
On 13/03/19 09:02, Cédric Le Goater wrote:
> The 'destroy' method is currently used to destroy all devices when the
> VM is destroyed after the vCPUs have been freed.
> 
> This new KVM ioctl exposes the same KVM device method. It acts as a
> software reset of the VM to 'destroy' selected devices when necessary
> and perform the required cleanups on the vCPUs. Called with the
> kvm->lock.
> 
> The 'destroy' method could be improved by returning an error code.
> 
> Signed-off-by: Cédric Le Goater 

Looks good to me.

Paolo


Re: serial driver cleanups v2

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 11:36:04AM +0100, Enrico Weigelt, metux IT consult 
wrote:
> On 15.03.19 10:12, Andy Shevchenko wrote:
> 
> >> Part II will be about moving the mmio range from mapbase and
> >> mapsize (which are used quite inconsistently) to a struct resource
> >> and using helpers for that. But this one isn't finished yet.
> >> (if somebody likes to have a look at it, I can send it, too)
> > 
> > Let's do that way you are preparing a branch somewhere and anounce
> > here as an RFC, since this was neither tested nor correct.
> 
> Okay, here it is:
> 
> I. https://github.com/metux/linux/tree/submit/serial-clean-v3
>--> general cleanups, as basis for II
> 
> II. https://github.com/metux/linux/tree/wip/serial-res
>--> moving towards using struct resource consistently
> 
> III. https://github.com/metux/linux/tree/hack/serial
> --> the final steps, which are yet completely broken
> (more a notepad for things still to do :o)
> 
> The actual goal is generalizing the whole iomem handling, so individual
> usually just need to call some helpers that do most of the things.
> Finally, I also wanted to have all io region information consolidated
> in struct resource.

That's should be a selling point, not just conversion per se.

> > And selling point for many of them is not true: it doesn't make any
> > difference in the size in code, but increases a time to run
> > (devm_ioremap_resource() does more than plain devm_iomap() call).
> 
> Okay, just seen it. Does the the runtime overhead cause any problems ?

You have to explain that in each commit message, that the change does bring a
possible new error message printed.

The performance side of the deal, you are lucky here, is not significant
because it's slow path.

-- 
With Best Regards,
Andy Shevchenko




Re: serial driver cleanups v2

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 8:44 PM Enrico Weigelt, metux IT consult
 wrote:
> On 15.03.19 19:11, Andy Shevchenko wrote:


> OTOH, the name, IMHO, is a bit misleading. Any chance of ever changing
> it to a more clear name (eg. devm_request_and_ioremap_resource()) ?

Compare  iomap vs. ioREmap.

-- 
With Best Regards,
Andy Shevchenko


Re: serial driver cleanups v2

2019-03-15 Thread Enrico Weigelt, metux IT consult
On 15.03.19 19:11, Andy Shevchenko wrote:

>> The actual goal is generalizing the whole iomem handling, so individual
>> usually just need to call some helpers that do most of the things.
>> Finally, I also wanted to have all io region information consolidated
>> in struct resource.
> 
> That's should be a selling point, not just conversion per se.

hmm, I never was good at selling :o

but I'll try anyways: it shall make the code smaller and easier to
read/understand.

does that count ?

> You have to explain that in each commit message, that the change does bring a
> possible new error message printed.

Ok, I wasn't aware of that. Yes, I missed to read the code of that
function carefully and didn't expect that it does more than just a dumb
wrapper around dev_ioremap() that picks out the params from struct
resource.

OTOH, the name, IMHO, is a bit misleading. Any chance of ever changing
it to a more clear name (eg. devm_request_and_ioremap_resource()) ?

> The performance side of the deal, you are lucky here, is not significant
> because it's slow path.

okay :)


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: [PATCH v2 10/45] drivers: tty: serial: zs: use devm_* functions

2019-03-15 Thread Enrico Weigelt, metux IT consult
On 15.03.19 15:26, Greg KH wrote:
> On Fri, Mar 15, 2019 at 10:06:16AM +0100, Enrico Weigelt, metux IT consult 
> wrote:
>> On 14.03.19 23:52, Greg KH wrote:
>>> On Thu, Mar 14, 2019 at 11:33:40PM +0100, Enrico Weigelt, metux IT consult 
>>> wrote:
 Use the safer devm versions of memory mapping functions.
>>>
>>> What is "safer" about them?
>>
>> Garbage collection :)
> 
> At times, but not when you do not use the api correctly :)

Okay, my fault that I didn't read the code careful enough :o

But still, I think the name is a bit misleading as it *sounds* as just
a wrapper around devm_ioremap() that just picks the params from a
struct resource. I guess we can't change the name easily ?

> Yes, there are lots of drivers for devices that are never unloaded or
> removed from the system.  The fact that no one has reported any problems
> with them means that they are never used in situations like this.

So, never touch a running system ?

> No, you need to have a good reason why it needs to be changed, when you
> can not verify that this actually resolves a problem.  As this patch
> shows, you just replaced one api call with another, so nothing changed
> at all, except you actually took up more memory and logic to do the same
> thing :(

Okay, I was on a wrong track here - I had the silly idea that it would
make things easier if we'd do it the same way everywhere.

>> IMHO, we should have a closer look at those and check whether that's
>> really okay (just adding request/release blindly could cause trouble)
> 
> I agree, please feel free to audit these to verify they are all correct.
> But that's not what you did here, so that's not a valid justification
> for these patches to be accepted.

Understood. Assuming I've found some of these cases, shall I use devm
oder just add the missing release ?

>> In this consolidation process, I'm trying to move everything to
>> devm_*, to have it more generic (for now, I still need two versions
>> of the request/release/ioremap/iounmap helpers - one w/ and one
>> w/o devm).
> 
> Move everything in what part of the kernel?  The whole kernel or just
> one subsystem?

Just talking about the serial drivers.

>> My idea was moving to devm first, so it can be reviewed/tested
>> independently, before moving forward. Smaller, easily digestable
>> pieces should minimize the risk of breaking anything. But if you
>> prefer having this things squashed together, just let me know.
> 
> Small pieces are required, that's fine, but those pieces need to have a
> justification for why they should be accepted at all points along the
> way.

Hmm, okay, in these cases, I agree there's no real justification if
we're not seeing it as an intermediate step to the upcoming stuff.
Having thought a bit more about this, my underlying dumbness was
setting everything on the devm horse when converting introducing the
helpers, and then splitted out the change to devm in even more patches
... Silly me, I better should have catched some sleep instead :o

>> In the queue are also other minor cleanups like using dev_err()
>> instead of printk(), etc. Should I send these separately ?
> 
> Of course.

Ok. I'll collect those things in a separate branch and send out the
queue from time to time:

https://github.com/metux/linux/tree/wip/serial/charlady

>> By the way: do you have some public branch where you're collecting
>> accepted patches, which I could base mine on ? (tty.git/tty-next ?)
> 
> Yes, that is the tree and branch, but remember that none of my trees can
> open up until after -rc1 is out.

So, within a merge window, you put everything else on hold ?


--mtx

-- 
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287


Re: [PATCH] powerpc: bpf: Fix generation of load/store DW instructions

2019-03-15 Thread Daniel Borkmann
On 03/15/2019 03:51 PM, Naveen N. Rao wrote:
> Yauheni Kaliuta pointed out that PTR_TO_STACK store/load verifier test
> was failing on powerpc64 BE, and rightfully indicated that the PPC_LD()
> macro is not masking away the last two bits of the offset per the ISA,
> resulting in the generation of 'lwa' instruction instead of the intended
> 'ld' instruction.
> 
> Segher also pointed out that we can't simply mask away the last two bits
> as that will result in loading/storing from/to a memory location that
> was not intended.
> 
> This patch addresses this by using ldx/stdx if the offset is not
> word-aligned. We load the offset into a temporary register (TMP_REG_2)
> and use that as the index register in a subsequent ldx/stdx. We fix
> PPC_LD() macro to mask off the last two bits, but enhance PPC_BPF_LL()
> and PPC_BPF_STL() to factor in the offset value and generate the proper
> instruction sequence. We also convert all existing users of PPC_LD() and
> PPC_STD() to use these macros. All existing uses of these macros have
> been audited to ensure that TMP_REG_2 can be clobbered.
> 
> Fixes: 156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended 
> BPF")
> Cc: sta...@vger.kernel.org # v4.9+
> 
> Reported-by: Yauheni Kaliuta 
> Signed-off-by: Naveen N. Rao 

Applied, thanks!


Re: [PATCH v2 10/45] drivers: tty: serial: zs: use devm_* functions

2019-03-15 Thread Greg KH
On Fri, Mar 15, 2019 at 08:12:47PM +0100, Enrico Weigelt, metux IT consult 
wrote:
> On 15.03.19 15:26, Greg KH wrote:
> > Yes, there are lots of drivers for devices that are never unloaded or
> > removed from the system.  The fact that no one has reported any problems
> > with them means that they are never used in situations like this.
> 
> So, never touch a running system ?

No, it's just that those systems do not allow those devices to be
removed because they are probably not on a removable bus.

> > No, you need to have a good reason why it needs to be changed, when you
> > can not verify that this actually resolves a problem.  As this patch
> > shows, you just replaced one api call with another, so nothing changed
> > at all, except you actually took up more memory and logic to do the same
> > thing :(
> 
> Okay, I was on a wrong track here - I had the silly idea that it would
> make things easier if we'd do it the same way everywhere.

"Consistent" is good, and valid, but touching old drivers that have few
users is always risky, and you need a solid reason to do so.

> >> IMHO, we should have a closer look at those and check whether that's
> >> really okay (just adding request/release blindly could cause trouble)
> > 
> > I agree, please feel free to audit these to verify they are all correct.
> > But that's not what you did here, so that's not a valid justification
> > for these patches to be accepted.
> 
> Understood. Assuming I've found some of these cases, shall I use devm
> oder just add the missing release ?

If it actually makes the code "simpler" or "more obvious", sure, that's
fine.  But churn for churns sake is not ok.

> >> By the way: do you have some public branch where you're collecting
> >> accepted patches, which I could base mine on ? (tty.git/tty-next ?)
> > 
> > Yes, that is the tree and branch, but remember that none of my trees can
> > open up until after -rc1 is out.
> 
> So, within a merge window, you put everything else on hold ?

I put the review of new patch submissions on hold, yes.  Almost all
maintainers do that as we can not add new patches to our trees at that
point in time.

And I do have other things I do during that period so it's not like I'm
just sitting around doing nothing :)

thanks,

greg k-h