date:20150604

Re: [PATCH 2/3] ARM: rockchip: ensure CPU to enter WIF state

2015-06-04 Thread Caesar Wang




在 2015年06月05日 14:32, Kever Yang 写道:

Hi Caesar,

Subject typo WIF/WFI.

OK



On 06/05/2015 12:47 PM, Caesar Wang wrote:

In idle mode, core1/2/3 of Cortex-A17 should be either power off or in
WFI/WFE state.
we can delay 1ms to ensure the CPU enter WFI state.

Signed-off-by: Caesar Wang 
---

  arch/arm/mach-rockchip/platsmp.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/arm/mach-rockchip/platsmp.c 
b/arch/arm/mach-rockchip/platsmp.c

index 1230d3d..978c357 100644
--- a/arch/arm/mach-rockchip/platsmp.c
+++ b/arch/arm/mach-rockchip/platsmp.c
@@ -316,6 +316,9 @@ static void __init 
rockchip_smp_prepare_cpus(unsigned int max_cpus)

  #ifdef CONFIG_HOTPLUG_CPU
  static int rockchip_cpu_kill(unsigned int cpu)
  {
+/* ensure CPU can enter the WFI/WFE state */
+mdelay(1);
+

Does it matter if core is not in WFI state when we want to power down it?


As HuangTao suggestion,

In gerenal, we need enter the WFI state when core power down, right?
That will be more better if the hardware can judge the state.

Anyway, we can delay 1ms or more to wait the WFI state.
That should be more better, right?


Thanks,
- Kever

  pmu_set_power_domain(0 + cpu, false);
  return 1;
  }








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Getting rid of i7300_idle's idle notifier?

2015-06-04 Thread Ingo Molnar


* Andy Lutomirski  wrote:

> On Thu, Jun 4, 2015 at 4:32 PM, Andy Lutomirski  wrote:
>
> > AFAICT the sole purpose for the hideous x86_64 idle_notifier mess is to 
> > support i7300_idle.  IMO this junk does not belong in IRQ handling, etc.  
> > Can 
> > we redo this to work in some kind of generic way?
> >
> > I have no idea why it makes sense to twiddle I/O AT registers in the 
> > beginning 
> > of whatever IRQ wakes up the CPU.
> >
> > Note that, if absolutely necessary, the ECX bit 0 MWAIT extension can be 
> > used 
> > to reliably execute code before handling interrupts that wake us from idle. 
> >  
> > That is, there could be a real cpuidle driver for that chip that does:
> >
> > cli;
> > poke ioat;
> > mwait(ecx = 1);
> > poke ioat;
> > sti;
> >
> > Or we could delete the driver entirely.
> 
> It's even easier than that.  Just shove the hooks into acpi_idle_do_entry or 
> similar and remove them from every other exit_idle call site in the kernel.

Yes!

Interested in doing a patch?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/2] drivers: ata: add support for Ceva sata host controller

2015-06-04 Thread Michal Simek

On 06/05/2015 08:02 AM, Suneel Garapati wrote:
> Adds support for Ceva sata host controller on Xilinx
> Zynq UltraScale+ MPSoC.
> 
> Signed-off-by: Suneel Garapati 
> ---
> Changes v2
>  - Change module license string to GPL v2
> ---
>  drivers/ata/Kconfig |   9 ++
>  drivers/ata/Makefile|   1 +
>  drivers/ata/ahci_ceva.c | 225 
> 
>  3 files changed, 235 insertions(+)
>  create mode 100644 drivers/ata/ahci_ceva.c
> 
> diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
> index b4524f4..6d17a3b 100644
> --- a/drivers/ata/Kconfig
> +++ b/drivers/ata/Kconfig
> @@ -133,6 +133,15 @@ config AHCI_IMX
> 
> If unsure, say N.
> 
> +config AHCI_CEVA
> + tristate "CEVA AHCI SATA support"
> + depends on OF
> + help
> +   This option enables support for the CEVA AHCI SATA.
> +   It can be found on the Xilinx Zynq UltraScale+ MPSoC.
> +
> +   If unsure, say N.
> +
>  config AHCI_MVEBU
>   tristate "Marvell EBU AHCI SATA support"
>   depends on ARCH_MVEBU
> diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile
> index 5154753..af70919 100644
> --- a/drivers/ata/Makefile
> +++ b/drivers/ata/Makefile
> @@ -11,6 +11,7 @@ obj-$(CONFIG_SATA_SIL24)+= sata_sil24.o
>  obj-$(CONFIG_SATA_DWC)   += sata_dwc_460ex.o
>  obj-$(CONFIG_SATA_HIGHBANK)  += sata_highbank.o libahci.o
>  obj-$(CONFIG_AHCI_BRCMSTB)   += ahci_brcmstb.o libahci.o libahci_platform.o
> +obj-$(CONFIG_AHCI_CEVA)  += ahci_ceva.o libahci.o 
> libahci_platform.o
>  obj-$(CONFIG_AHCI_DA850) += ahci_da850.o libahci.o libahci_platform.o
>  obj-$(CONFIG_AHCI_IMX)   += ahci_imx.o libahci.o 
> libahci_platform.o
>  obj-$(CONFIG_AHCI_MVEBU) += ahci_mvebu.o libahci.o libahci_platform.o
> diff --git a/drivers/ata/ahci_ceva.c b/drivers/ata/ahci_ceva.c
> new file mode 100644
> index 000..559d960
> --- /dev/null
> +++ b/drivers/ata/ahci_ceva.c
> @@ -0,0 +1,225 @@
> +/*
> + * Copyright (C) 2015 Xilinx, Inc.
> + * CEVA AHCI SATA platform driver
> + *
> + * based on the AHCI SATA platform driver by Jeff Garzik and Anton Vorontsov
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along 
> with
> + * this program. If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "ahci.h"
> +
> +/* Vendor Specific Register Offsets */
> +#define AHCI_VEND_PCFG  0xA4
> +#define AHCI_VEND_PPCFG 0xA8
> +#define AHCI_VEND_PP2C  0xAC
> +#define AHCI_VEND_PP3C  0xB0
> +#define AHCI_VEND_PP4C  0xB4
> +#define AHCI_VEND_PP5C  0xB8
> +#define AHCI_VEND_PAXIC 0xC0
> +#define AHCI_VEND_PTC   0xC8
> +
> +/* Vendor Specific Register bit definitions */
> +#define PAXIC_ADBW_BW64 0x1
> +#define PAXIC_MAWIDD (1 << 8)
> +#define PAXIC_MARIDD (1 << 16)
> +#define PAXIC_OTL(0x4 << 20)
> +
> +#define PCFG_TPSS_VAL(0x32 << 16)
> +#define PCFG_TPRS_VAL(0x2 << 12)
> +#define PCFG_PAD_VAL 0x2
> +
> +#define PPCFG_TTA0x1FFFE
> +#define PPCFG_PSSO_EN(1 << 28)
> +#define PPCFG_PSS_EN (1 << 29)
> +#define PPCFG_ESDF_EN(1 << 31)
> +
> +#define PP2C_CIBGMN  0x0F
> +#define PP2C_CIBGMX  (0x25 << 8)
> +#define PP2C_CIBGN   (0x18 << 16)
> +#define PP2C_CINMP   (0x29 << 24)
> +
> +#define PP3C_CWBGMN  0x04
> +#define PP3C_CWBGMX  (0x0B << 8)
> +#define PP3C_CWBGN   (0x08 << 16)
> +#define PP3C_CWNMP   (0x0F << 24)
> +
> +#define PP4C_BMX 0x0a
> +#define PP4C_BNM (0x08 << 8)
> +#define PP4C_SFD (0x4a << 16)
> +#define PP4C_PTST(0x06 << 24)
> +
> +#define PP5C_RIT 0x60216
> +#define PP5C_RCT (0x7f0 << 20)
> +
> +#define PTC_RX_WM_VAL0x40
> +#define PTC_RSVD (1 << 27)
> +
> +#define PORT0_BASE   0x100
> +#define PORT1_BASE   0x180
> +
> +/* Port Control Register Bit Definitions */
> +#define PORT_SCTL_SPD_GEN2   (0x2 << 4)
> +#define PORT_SCTL_SPD_GEN1   (0x1 << 4)
> +#define PORT_SCTL_IPM(0x3 << 8)
> +
> +#define PORT_BASE0x100
> +#define PORT_OFFSET  0x80
> +#define NR_PORTS 2
> +#define DRV_NAME "ahci-ceva"
> +#define CEVA_FLAG_BROKEN_GEN21
> +
> +struct ceva_ahci_priv {
> + struct platform_device *ahci_pdev;
> + int flags;
> +};
> +
> +static struct ata_port_operations ahci_ceva_ops = {
> + .inherits = &ahci_platform_ops,
> +};
> +
> +static const struct ata_port_info ahci_ceva_port_info = {
> + .flags  = AHCI_FLAG_COMMON,
> + .pio_mask   = ATA_PIO4,
> +

[PATCH kernel v12 10/34] vfio: powerpc/spapr: Disable DMA mappings on disabled container

2015-06-04 Thread Alexey Kardashevskiy

At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.

This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 6e2e15f..5bbdf37 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -318,6 +318,9 @@ static long tce_iommu_ioctl(void *iommu_data,
struct iommu_table *tbl = container->tbl;
unsigned long tce;
 
+   if (!container->enabled)
+   return -EPERM;
+
if (!tbl)
return -ENXIO;
 
@@ -362,6 +365,9 @@ static long tce_iommu_ioctl(void *iommu_data,
struct vfio_iommu_type1_dma_unmap param;
struct iommu_table *tbl = container->tbl;
 
+   if (!container->enabled)
+   return -EPERM;
+
if (WARN_ON(!tbl))
return -ENXIO;
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] devicetree:bindings: add devicetree bindings for ceva ahci

2015-06-04 Thread Michal Simek

On 06/05/2015 08:02 AM, Suneel Garapati wrote:
> adds bindings for CEVA AHCI SATA controller. optional property
> broken-gen2 is useful incase of hardware speed limitation.
> 
> Signed-off-by: Suneel Garapati 
> ---
>  Documentation/devicetree/bindings/ata/ahci-ceva.txt | 20 
>  1 file changed, 20 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/ata/ahci-ceva.txt
> 
> diff --git a/Documentation/devicetree/bindings/ata/ahci-ceva.txt 
> b/Documentation/devicetree/bindings/ata/ahci-ceva.txt
> new file mode 100644
> index 000..7ca8b97
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/ata/ahci-ceva.txt
> @@ -0,0 +1,20 @@
> +Binding for CEVA AHCI SATA Controller
> +
> +Required properties:
> +  - reg: Physical base address and size of the controller's register area.
> +  - compatible: Compatibility string. Must be 'ceva,ahci-1v84'.
> +  - clocks: Input clock specifier. Refer to common clock bindings.
> +  - interrupts: Interrupt specifier. Refer to interrupt binding.
> +
> +Optional properties:
> +  - ceva,broken-gen2: limit to gen1 speed instead of gen2.
> +
> +Examples:
> + ahci@fd0c {
> + compatible = "ceva,ahci-1v84";
> + reg = <0xfd0c 0x200>;
> + interrupt-parent = <&gic>;
> + interrupts = <0 133 4>;
> + clocks = <&clkc SATA_CLK_ID>;
> + ceva,broken-gen2;
> + };
> --
> 2.1.2

Acked-by: Michal Simek 

FYI: Adding ceva prefix to vendor-prefixes is already in arm-soc tree.
And
ceva,broken-gen2 targets hardware limitation.

Thanks,
Michal


-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Microblaze cpu - http://www.monstr.eu/fdt/
Maintainer of Linux kernel - Xilinx Zynq ARM architecture
Microblaze U-BOOT custodian and responsible for u-boot arm zynq platform




signature.asc
Description: OpenPGP digital signature

[PATCH kernel v12 20/34] powerpc/powernv/ioda2: Move TCE kill register address to PE

2015-06-04 Thread Alexey Kardashevskiy

At the moment the DMA setup code looks for the "ibm,opal-tce-kill"
property which contains the TCE kill register address. Writing to
this register invalidates TCE cache on IODA/IODA2 hub.

This moves the register address from iommu_table to pnv_pnb as this
register belongs to PHB and invalidates TCE cache for all tables of
all attached PEs.

This moves the property reading/remapping code to a helper which is
called when DMA is being configured for PE and which does DMA setup
for both IODA1 and IODA2.

This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which
invalidates cache for the entire table. It should be called after
every call to opal_pci_map_pe_dma_window(). It was not required before
because there was just a single TCE table and 64bit DMA was handled via
bypass window (which has no table so no cache was used) but this is going
to change with Dynamic DMA windows (DDW).

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v11:
* s/pnv_pci_ioda2_tvt_invalidate/pnv_pci_ioda2_tce_invalidate_entire/g
(cannot think of better-and-shorter name)
* moved tce_inval_reg_phys/tce_inval_reg to pnv_phb

v10:
* fixed error from checkpatch.pl
* removed comment at "ibm,opal-tce-kill" parsing as irrelevant
* s/addr/val/ in pnv_pci_ioda2_tvt_invalidate() as it was not a kernel address

v9:
* new in the series
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 66 ++-
 arch/powerpc/platforms/powernv/pci.h  |  7 +++-
 2 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1d0bb5b..3fd8b18 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1679,8 +1679,8 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
struct pnv_ioda_pe *pe = container_of(tgl->table_group,
struct pnv_ioda_pe, table_group);
__be64 __iomem *invalidate = rm ?
-   (__be64 __iomem *)pe->tce_inval_reg_phys :
-   (__be64 __iomem *)tbl->it_index;
+   (__be64 __iomem *)pe->phb->ioda.tce_inval_reg_phys :
+   pe->phb->ioda.tce_inval_reg;
unsigned long start, end, inc;
const unsigned shift = tbl->it_page_shift;
 
@@ -1751,6 +1751,19 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = {
.get = pnv_tce_get,
 };
 
+static inline void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe)
+{
+   /* 01xb - invalidate TCEs that match the specified PE# */
+   unsigned long val = (0x4ull << 60) | (pe->pe_number & 0xFF);
+   struct pnv_phb *phb = pe->phb;
+
+   if (!phb->ioda.tce_inval_reg)
+   return;
+
+   mb(); /* Ensure above stores are visible */
+   __raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
+}
+
 static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
unsigned long index, unsigned long npages, bool rm)
 {
@@ -1761,8 +1774,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
struct pnv_ioda_pe, table_group);
unsigned long start, end, inc;
__be64 __iomem *invalidate = rm ?
-   (__be64 __iomem *)pe->tce_inval_reg_phys :
-   (__be64 __iomem *)tbl->it_index;
+   (__be64 __iomem *)pe->phb->ioda.tce_inval_reg_phys :
+   pe->phb->ioda.tce_inval_reg;
const unsigned shift = tbl->it_page_shift;
 
/* We'll invalidate DMA address in PE scope */
@@ -1820,7 +1833,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 {
 
struct page *tce_mem = NULL;
-   const __be64 *swinvp;
struct iommu_table *tbl;
unsigned int i;
int64_t rc;
@@ -1877,20 +1889,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
  base << 28, IOMMU_PAGE_SHIFT_4K);
 
/* OPAL variant of P7IOC SW invalidated TCEs */
-   swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
-   if (swinvp) {
-   /* We need a couple more fields -- an address and a data
-* to or.  Since the bus is only printed out on table free
-* errors, and on the first pass the data will be a relative
-* bus number, print that out instead.
-*/
-   pe->tce_inval_reg_phys = be64_to_cpup(swinvp);
-   tbl->it_index = (unsigned long)ioremap(pe->tce_inval_reg_phys,
-   8);
+   if (phb->ioda.tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE |
 TCE_PCI_SWINV_FREE   |
 TCE_PCI_SWINV_PAIR);
-   }
+
tbl->it_ops = &pnv_ioda1_iommu_ops;
iommu_init_table(tbl, phb->hose->node);
 
@@ -1971,12 +1974,24 @@ static struct iommu_table_group_ops p

[PATCH kernel v12 23/34] powerpc/iommu/powernv: Release replaced TCE

2015-06-04 Thread Alexey Kardashevskiy

At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.

This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.

This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
---
Changes:
v10:
* did s/tce/hpa/ in iommu_table_ops::exchange and tce_iommu_unuse_page()
* removed permission bits check from iommu_tce_put_param_check as
permission bits are not allowed in the address
* added BUG_ON(*hpa & ~IOMMU_PAGE_MASK(tbl)) to pnv_tce_xchg()

v9:
* changed exchange() to work with physical addresses as these addresses
are never accessed by the code and physical addresses are actual values
we put into the IOMMU table
---
 arch/powerpc/include/asm/iommu.h| 22 --
 arch/powerpc/kernel/iommu.c | 59 +--
 arch/powerpc/platforms/powernv/pci-ioda.c   | 34 
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  3 ++
 arch/powerpc/platforms/powernv/pci.c| 18 +
 arch/powerpc/platforms/powernv/pci.h|  2 +
 drivers/vfio/vfio_iommu_spapr_tce.c | 63 +
 7 files changed, 132 insertions(+), 69 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 489133c..4636734 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -45,13 +45,29 @@ extern int iommu_is_off;
 extern int iommu_force_on;
 
 struct iommu_table_ops {
+   /*
+* When called with direction==DMA_NONE, it is equal to clear().
+* uaddr is a linear map address.
+*/
int (*set)(struct iommu_table *tbl,
long index, long npages,
unsigned long uaddr,
enum dma_data_direction direction,
struct dma_attrs *attrs);
+#ifdef CONFIG_IOMMU_API
+   /*
+* Exchanges existing TCE with new TCE plus direction bits;
+* returns old TCE and DMA direction mask.
+* @tce is a physical address.
+*/
+   int (*exchange)(struct iommu_table *tbl,
+   long index,
+   unsigned long *hpa,
+   enum dma_data_direction *direction);
+#endif
void (*clear)(struct iommu_table *tbl,
long index, long npages);
+   /* get() returns a physical address */
unsigned long (*get)(struct iommu_table *tbl, long index);
void (*flush)(struct iommu_table *tbl);
 };
@@ -153,6 +169,8 @@ extern void iommu_register_group(struct iommu_table_group 
*table_group,
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
+extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
+   unsigned long *hpa, enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
@@ -225,10 +243,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table 
*tbl,
unsigned long npages);
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
-extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-   unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git

[PATCH kernel v12 19/34] powerpc/iommu: Fix IOMMU ownership control functions

2015-06-04 Thread Alexey Kardashevskiy

This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

In order to use bitmap_empty(), the existing code clears bit#0 which
is set even in an empty table if it is bus-mapped at 0 as
iommu_init_table() reserves page#0 to prevent buggy drivers
from crashing when allocated page is bus-mapped at zero
(which is correct). This restores the bit in the case of failure
to bring the it_map to the state it was in when we called
iommu_take_ownership().

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v9:
* iommu_table_take_ownership() did not return @ret (and ignored EBUSY),
now it does return correct error.
* updated commit log about setting bit#0 in the case of failure

v5:
* do not store bit#0 value, it has to be set for zero-based table
anyway
* removed test_and_clear_bit
---
 arch/powerpc/kernel/iommu.c | 30 +-
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index e7f81b7..0fb8800 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1035,31 +1035,51 @@ EXPORT_SYMBOL_GPL(iommu_tce_build);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl->it_size + 7) >> 3;
+   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+   int ret = 0;
+
+   spin_lock_irqsave(&tbl->large_pool.lock, flags);
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_lock(&tbl->pools[i].lock);
 
if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);
 
if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
pr_err("iommu_tce: it_map is not empty");
-   return -EBUSY;
+   ret = -EBUSY;
+   /* Restore bit#0 set by iommu_init_table() */
+   if (tbl->it_offset == 0)
+   set_bit(0, tbl->it_map);
+   } else {
+   memset(tbl->it_map, 0xff, sz);
}
 
-   memset(tbl->it_map, 0xff, sz);
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_unlock(&tbl->pools[i].lock);
+   spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 
-   return 0;
+   return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl->it_size + 7) >> 3;
+   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+
+   spin_lock_irqsave(&tbl->large_pool.lock, flags);
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_lock(&tbl->pools[i].lock);
 
memset(tbl->it_map, 0, sz);
 
/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
+
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_unlock(&tbl->pools[i].lock);
+   spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 27/34] powerpc/powernv: Implement multilevel TCE tables

2015-06-04 Thread Alexey Kardashevskiy

TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
on huge guests (hundreds of GB of RAM) so the kernel might be unable to
allocate contiguous chunk of physical memory to store the TCE table.

To address this, POWER8 CPU (actually, IODA2) supports multi-level
TCE tables, up to 5 levels which splits the table into a tree of
smaller subtables.

This adds multi-level TCE tables support to
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
helpers.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v12:
* changed pnv_pci_ioda2_table_do_alloc_pages() to return NULL to
pnv_pci_ioda2_table_alloc_pages() only if the first level allocation
failed, otherwise it always returns non zero value
* pnv_pci_ioda2_table_do_free_pages() now takes __be64* rather than
uinsigned long
* s/tce_table_allocated/current_offset/

v10:
* fixed multiple comments received for v9

v9:
* moved from ioda2 to common powernv pci code
* fixed cleanup if allocation fails in a middle
* removed check for the size - all boundary checks happen in the calling code
anyway
---
 arch/powerpc/include/asm/iommu.h  |   2 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 105 +++---
 arch/powerpc/platforms/powernv/pci.c  |  13 
 3 files changed, 111 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 4636734..706cfc0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -96,6 +96,8 @@ struct iommu_pool {
 struct iommu_table {
unsigned long  it_busno; /* Bus number this table belongs to */
unsigned long  it_size;  /* Size of iommu table in entries */
+   unsigned long  it_indirect_levels;
+   unsigned long  it_level_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index da14043..a253dda 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -50,6 +50,9 @@
 /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
 #define TCE32_TABLE_SIZE   ((0x1000 / 0x1000) * 8)
 
+#define POWERNV_IOMMU_DEFAULT_LEVELS   1
+#define POWERNV_IOMMU_MAX_LEVELS   5
+
 static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
 
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
@@ -1976,6 +1979,8 @@ static long pnv_pci_ioda2_set_window(struct 
iommu_table_group *table_group,
table_group);
struct pnv_phb *phb = pe->phb;
int64_t rc;
+   const unsigned long size = tbl->it_indirect_levels ?
+   tbl->it_level_size : tbl->it_size;
const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
const __u64 win_size = tbl->it_size << tbl->it_page_shift;
 
@@ -1990,9 +1995,9 @@ static long pnv_pci_ioda2_set_window(struct 
iommu_table_group *table_group,
rc = opal_pci_map_pe_dma_window(phb->opal_id,
pe->pe_number,
pe->pe_number << 1,
-   1,
+   tbl->it_indirect_levels + 1,
__pa(tbl->it_base),
-   tbl->it_size << 3,
+   size << 3,
IOMMU_PAGE_SIZE(tbl));
if (rc) {
pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
@@ -2072,11 +2077,16 @@ static void pnv_pci_ioda_setup_opal_tce_kill(struct 
pnv_phb *phb)
phb->ioda.tce_inval_reg = ioremap(phb->ioda.tce_inval_reg_phys, 8);
 }
 
-static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift)
+static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift,
+   unsigned levels, unsigned long limit,
+   unsigned long *current_offset)
 {
struct page *tce_mem = NULL;
-   __be64 *addr;
+   __be64 *addr, *tmp;
unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
+   unsigned long allocated = 1UL << (order + PAGE_SHIFT);
+   unsigned entries = 1UL << (shift - 3);
+   long i;
 
tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
if (!tce_mem) {
@@ -2084,31 +2094,79 @@ static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int 
nid, unsigned shift)
return NULL;
}
addr = page_address(tce_mem);
-   memset(addr, 0, 1UL << (order + PAGE_SHIFT));
+   memset(addr, 0, allocated);
+
+   --levels;
+   if (!levels) {
+   *current_offset += allocated;
+   return addr;
+   }
+
+   for (i = 0; i < entries; ++i) {
+   tmp = pnv_pci_ioda2_table_do_alloc_pages(nid, shift,
+   levels, limit, cur

[PATCH kernel v12 13/34] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE

2015-06-04 Thread Alexey Kardashevskiy

Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag as the old TCE value (more precisely -
the permission bits) will be used to decide whether to put the page or not.

This adds iommu_direction_to_tce_perm() (its counterpart is there already)
and uses it for powernv's pnv_tce_build().

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v9:
* added comment why we must put only valid permission bits
---
 arch/powerpc/include/asm/iommu.h |  1 +
 arch/powerpc/kernel/iommu.c  | 15 +++
 arch/powerpc/platforms/powernv/pci.c |  7 +--
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index e94a5e3..d91bd69 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -200,6 +200,7 @@ extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
 extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
+extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 0019c80..ac2f959 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -866,6 +866,21 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
}
 }
 
+unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir)
+{
+   switch (dir) {
+   case DMA_BIDIRECTIONAL:
+   return TCE_PCI_READ | TCE_PCI_WRITE;
+   case DMA_FROM_DEVICE:
+   return TCE_PCI_WRITE;
+   case DMA_TO_DEVICE:
+   return TCE_PCI_READ;
+   default:
+   return 0;
+   }
+}
+EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm);
+
 #ifdef CONFIG_IOMMU_API
 /*
  * SPAPR TCE API
diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index bca2aeb..b7ea245 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -576,15 +576,10 @@ static int pnv_tce_build(struct iommu_table *tbl, long 
index, long npages,
 unsigned long uaddr, enum dma_data_direction direction,
 struct dma_attrs *attrs, bool rm)
 {
-   u64 proto_tce;
+   u64 proto_tce = iommu_direction_to_tce_perm(direction);
__be64 *tcep, *tces;
u64 rpn;
 
-   proto_tce = TCE_PCI_READ; // Read allowed
-
-   if (direction != DMA_TO_DEVICE)
-   proto_tce |= TCE_PCI_WRITE;
-
tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
rpn = __pa(uaddr) >> tbl->it_page_shift;
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 15/34] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()

2015-06-04 Thread Alexey Kardashevskiy

The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table.

IODA2 actually needs PCI addresses to invalidate the cache. Those
can be calculated from host addresses but since we are going
to implement multi-level TCE tables, calculating PCI address from
a host address might get either tricky or ugly as TCE table remains flat
on PCI bus but not in RAM.

This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/
pnt_tce_free and defines IODA1/2-specific callbacks which call generic
ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps
using generic callbacks as before.

This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and
number of pages which are PCI addresses shifted by IOMMU page shift.

No change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v11:
* changed type of some "ret" to int as everywhere else

v10:
* moved before "Switch from iommu_table to new iommu_table_group" as it adds
list of groups to iommu_table and tce invalidation depends on it

v9:
* removed confusing comment from commit log about unintentional calling of
pnv_pci_ioda_tce_invalidate()
* moved mechanical changes away to "powerpc/iommu: Move tce_xxx callbacks from 
ppc_md to iommu_table"
* fixed bug with broken invalidation in pnv_pci_ioda2_tce_invalidate -
@index includes @tbl->it_offset but old code added it anyway which later broke
DDW
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 81 ++-
 arch/powerpc/platforms/powernv/pci.c  | 17 ++-
 2 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2924abe..3d32c37 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1678,18 +1678,19 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe 
*pe,
}
 }
 
-static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
-struct iommu_table *tbl,
-__be64 *startp, __be64 *endp, bool rm)
+static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
+   unsigned long index, unsigned long npages, bool rm)
 {
+   struct pnv_ioda_pe *pe = tbl->data;
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
(__be64 __iomem *)tbl->it_index;
unsigned long start, end, inc;
const unsigned shift = tbl->it_page_shift;
 
-   start = __pa(startp);
-   end = __pa(endp);
+   start = __pa(((__be64 *)tbl->it_base) + index - tbl->it_offset);
+   end = __pa(((__be64 *)tbl->it_base) + index - tbl->it_offset +
+   npages - 1);
 
/* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */
if (tbl->it_busno) {
@@ -1725,16 +1726,39 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
pnv_ioda_pe *pe,
 */
 }
 
+static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index,
+   long npages, unsigned long uaddr,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs)
+{
+   int ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
+   attrs);
+
+   if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE))
+   pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+
+   return ret;
+}
+
+static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index,
+   long npages)
+{
+   pnv_tce_free(tbl, index, npages);
+
+   if (tbl->it_type & TCE_PCI_SWINV_FREE)
+   pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false);
+}
+
 static struct iommu_table_ops pnv_ioda1_iommu_ops = {
-   .set = pnv_tce_build,
-   .clear = pnv_tce_free,
+   .set = pnv_ioda1_tce_build,
+   .clear = pnv_ioda1_tce_free,
.get = pnv_tce_get,
 };
 
-static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
-struct iommu_table *tbl,
-__be64 *startp, __be64 *endp, bool rm)
+static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
+   unsigned long index, unsigned long npages, bool rm)
 {
+   struct pnv_ioda_pe *pe = tbl->data;
unsigned long start, end, inc;
__be64 __iomem *invalidate = rm ?
(__be64 __iomem *)pe->tce_inval_reg_phys :
@@ -1747,10 +1771,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
end = start;
 
/* Figure out the start, end and step */
-   inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64));
-   start |= (inc << shift);
-   inc = tbl->it_offset + (((u64)e

[PATCH kernel v12 09/34] vfio: powerpc/spapr: Move locked_vm accounting to helpers

2015-06-04 Thread Alexey Kardashevskiy

There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

This reworks debug messages to show the current value and the limit.

This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v12:
* added WARN_ON_ONCE() to decrement_locked_vm() for the sake of documentation

v4:
* new helpers do nothing if @npages == 0
* tce_iommu_disable() now can decrement the counter if the group was
detached (not possible now but will be in the future)
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 82 -
 1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 64300cc..6e2e15f 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,51 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
+static long try_increment_locked_vm(long npages)
+{
+   long ret = 0, locked, lock_limit;
+
+   if (!current || !current->mm)
+   return -ESRCH; /* process exited */
+
+   if (!npages)
+   return 0;
+
+   down_write(¤t->mm->mmap_sem);
+   locked = current->mm->locked_vm + npages;
+   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+   ret = -ENOMEM;
+   else
+   current->mm->locked_vm += npages;
+
+   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
+   npages << PAGE_SHIFT,
+   current->mm->locked_vm << PAGE_SHIFT,
+   rlimit(RLIMIT_MEMLOCK),
+   ret ? " - exceeded" : "");
+
+   up_write(¤t->mm->mmap_sem);
+
+   return ret;
+}
+
+static void decrement_locked_vm(long npages)
+{
+   if (!current || !current->mm || !npages)
+   return; /* process exited */
+
+   down_write(¤t->mm->mmap_sem);
+   if (WARN_ON_ONCE(npages > current->mm->locked_vm))
+   npages = current->mm->locked_vm;
+   current->mm->locked_vm -= npages;
+   pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
+   npages << PAGE_SHIFT,
+   current->mm->locked_vm << PAGE_SHIFT,
+   rlimit(RLIMIT_MEMLOCK));
+   up_write(¤t->mm->mmap_sem);
+}
+
 /*
  * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
  *
@@ -45,6 +90,7 @@ struct tce_container {
struct mutex lock;
struct iommu_table *tbl;
bool enabled;
+   unsigned long locked_pages;
 };
 
 static bool tce_page_is_contained(struct page *page, unsigned page_shift)
@@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, 
unsigned page_shift)
 static int tce_iommu_enable(struct tce_container *container)
 {
int ret = 0;
-   unsigned long locked, lock_limit, npages;
+   unsigned long locked;
struct iommu_table *tbl = container->tbl;
 
if (!container->tbl)
@@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container 
*container)
 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 * that would effectively kill the guest at random points, much better
 * enforcing the limit based on the max that the guest can map.
+*
+* Unfortunately at the moment it counts whole tables, no matter how
+* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+* each with 2GB DMA window, 8GB will be counted here. The reason for
+* this is that we cannot tell here the amount of RAM used by the guest
+* as this information is only available from KVM and VFIO is
+* KVM agnostic.
 */
-   down_write(¤t->mm->mmap_sem);
-   npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
-   locked = current->mm->locked_vm + npages;
-   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-   if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-   pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
-   rlimit(RLIMIT_MEMLOCK));
-   ret = -ENOMEM;
-   } else {
+   locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
+   ret = try_increment_locked_v

[PATCH kernel v12 06/34] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver

2015-06-04 Thread Alexey Kardashevskiy

This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.

This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.

This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.

Besides the last part, the rest of the patch is mechanical.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v9:
* added missing tce_iommu_clear call after iommu_release_ownership()
* brought @offset (a local variable) back to make patch even more
mechanical

v4:
* s/iommu_tce_build(tbl, entry + 1/iommu_tce_build(tbl, entry + i/
---
 arch/powerpc/include/asm/iommu.h|  4 --
 arch/powerpc/kernel/iommu.c | 55 -
 drivers/vfio/vfio_iommu_spapr_tce.c | 80 +++--
 3 files changed, 67 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 8353c86..e94a5e3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -194,10 +194,6 @@ extern int iommu_tce_build(struct iommu_table *tbl, 
unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction);
 extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
unsigned long entry);
-extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
-   unsigned long entry, unsigned long pages);
-extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
-   unsigned long entry, unsigned long tce);
 
 extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 73eb39a..0019c80 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -986,30 +986,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, 
unsigned long entry)
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tce);
 
-int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
-   unsigned long entry, unsigned long pages)
-{
-   unsigned long oldtce;
-   struct page *page;
-
-   for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce >> PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce & TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
-   }
-
-   return 0;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
-
 /*
  * hwaddr is a kernel virtual address here (0xc... bazillion),
  * tce_build converts it to a physical address.
@@ -1039,35 +1015,6 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned 
long entry,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_build);
 
-int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
-   unsigned long tce)
-{
-   int ret;
-   struct page *page = NULL;
-   unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
-   enum dma_data_direction direction = iommu_tce_direction(tce);
-
-   ret = get_user_pages_fast(tce & PAGE_MASK, 1,
-   direction != DMA_TO_DEVICE, &page);
-   if (unlikely(ret != 1)) {
-   /* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx 
ioba=%lx ret=%d\n",
-   tce, entry << tbl->it_page_shift, ret); */
-   return -EFAULT;
-   }
-   hwaddr = (unsigned long) page_address(page) + offset;
-
-   ret = iommu_tce_build(tbl, entry, hwaddr, direction);
-   if (ret)
-   put_page(page);
-
-   if (ret < 0)
-   pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
-   __func__, entry << tbl->it_page_shift, tce, ret);
-
-   return ret;
-}
-EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-
 int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long sz = (tbl->it_size + 7) >> 3;
@@ -1081,7 +1028,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
}
 
memset(tbl->it_map, 0xff, sz);
-   iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
/*
 * Disable iommu bypass, otherwise the user can DMA to all of
@@ -1099,7 +1045,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
 {
unsigned long sz = (tbl->it_size + 7) >> 3;
 
-   iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
memset(tbl->it_map, 0, sz);
 
/* Restore bit#0 set b

[PATCH kernel v12 14/34] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table

2015-06-04 Thread Alexey Kardashevskiy

This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to make sure that we do not leave any IOMMU table
with iommu_table_ops uninitialized. This is not a parameter of
iommu_init_table() though as there will be cases when iommu_init_table()
will not be called on TCE tables, for example - VFIO.

This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
redundant prefixes.

This removes tce_xxx_rm handlers from ppc_md but does not add
them to iommu_table_ops as this will be done later if we decide to
support TCE hypercalls in real mode. This removes _vm callbacks as
only virtual mode is supported by now so this also removes @rm parameter.

For pSeries, this always uses tce_buildmulti_pSeriesLP/
tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.

For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
This makes the callbacks for them public. Later patches will extend
callbacks for IODA1/2.

No change in behaviour is expected.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v9:
* pnv_tce_build/pnv_tce_free/pnv_tce_get have been made public and lost
"rm" parameters to make following patches simpler (realmode is not
supported here anyway)
* got rid of _vm versions of callbacks
---
 arch/powerpc/include/asm/iommu.h| 17 +++
 arch/powerpc/include/asm/machdep.h  | 25 ---
 arch/powerpc/kernel/iommu.c | 46 ++--
 arch/powerpc/kernel/vio.c   |  5 +++
 arch/powerpc/platforms/cell/iommu.c |  8 +++--
 arch/powerpc/platforms/pasemi/iommu.c   |  7 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   | 14 +
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  7 +
 arch/powerpc/platforms/powernv/pci.c| 47 +
 arch/powerpc/platforms/powernv/pci.h|  5 +++
 arch/powerpc/platforms/pseries/iommu.c  | 34 -
 arch/powerpc/sysdev/dart_iommu.c| 12 +---
 12 files changed, 116 insertions(+), 111 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index d91bd69..e2a45c3 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -44,6 +44,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+   int (*set)(struct iommu_table *tbl,
+   long index, long npages,
+   unsigned long uaddr,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs);
+   void (*clear)(struct iommu_table *tbl,
+   long index, long npages);
+   unsigned long (*get)(struct iommu_table *tbl, long index);
+   void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -78,6 +94,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
+   struct iommu_table_ops *it_ops;
void (*set_bypass)(struct iommu_table *tbl, bool enable);
 #ifdef CONFIG_PPC_POWERNV
void   *data;
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index ef889943..ab721b4 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 * destroyed as well */
void(*hpte_clear_all)(void);
 
-   int (*tce_build)(struct iommu_table *tbl,
-long index,
-long npages,
-unsigned long uaddr,
-enum dma_data_direction direction,
-struct dma_attrs *attrs);
-   void(*tce_free)(struct iommu_table *tbl,
-   long index,
-   long npages);
-   unsigned long   (*tce_get)(struct iommu_table *tbl,
-   long index);
-   void(*tce_flush)(struct iommu_table *tbl);
-
-   /* _rm versions are for real mode use only */
-   int

[PATCH kernel v12 07/34] vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page

2015-06-04 Thread Alexey Kardashevskiy

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v8: changed subject

v6:
* the helper is simplified to one line

v4:
* s/tce_check_page_size/tce_page_is_contained/
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index b95fa2b..735b308 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -47,6 +47,16 @@ struct tce_container {
bool enabled;
 };
 
+static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+{
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
int ret = 0;
@@ -189,6 +199,12 @@ static long tce_iommu_build(struct tce_container 
*container,
ret = -EFAULT;
break;
}
+
+   if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+   ret = -EPERM;
+   break;
+   }
+
hva = (unsigned long) page_address(page) + offset;
 
ret = iommu_tce_build(tbl, entry + i, hva, direction);
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 25/34] powerpc/powernv/ioda2: Introduce helpers to allocate TCE pages

2015-06-04 Thread Alexey Kardashevskiy

This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.

This moves the code which allocates the actual TCE tables to helpers:
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
These do not allocate/free the iommu_table struct.

This enforces window size to be a power of two.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Gavin Shan 
Reviewed-by: David Gibson 
---
Changes:
v10:
* removed @table_group parameter from pnv_pci_create_table as it was not used
* removed *tce_table_allocated from pnv_alloc_tce_table_pages()
* pnv_pci_create_table/pnv_pci_free_table renamed to
pnv_pci_ioda2_table_alloc_pages/pnv_pci_ioda2_table_free_pages and moved
back to pci-ioda.c as these only allocate pages for IODA2 and there is
no chance they will be reused for IODA1/P5IOC2
* shortened subject line

v9:
* moved helpers to the common powernv pci.c file from pci-ioda.c
* moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages()
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 83 +++
 1 file changed, 63 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 95d3121..38d53dc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -49,6 +50,8 @@
 /* 256M DMA window, 4K TCE pages, 8 bytes TCE */
 #define TCE32_TABLE_SIZE   ((0x1000 / 0x1000) * 8)
 
+static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl);
+
 static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
const char *fmt, ...)
 {
@@ -1313,8 +1316,8 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
}
+   pnv_pci_ioda2_table_free_pages(tbl);
iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
-   free_pages(addr, get_order(TCE32_TABLE_SIZE));
 }
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs)
@@ -2032,13 +2035,62 @@ static void pnv_pci_ioda_setup_opal_tce_kill(struct 
pnv_phb *phb)
phb->ioda.tce_inval_reg = ioremap(phb->ioda.tce_inval_reg_phys, 8);
 }
 
-static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
-  struct pnv_ioda_pe *pe)
+static __be64 *pnv_pci_ioda2_table_do_alloc_pages(int nid, unsigned shift)
 {
struct page *tce_mem = NULL;
+   __be64 *addr;
+   unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT;
+
+   tce_mem = alloc_pages_node(nid, GFP_KERNEL, order);
+   if (!tce_mem) {
+   pr_err("Failed to allocate a TCE memory, order=%d\n", order);
+   return NULL;
+   }
+   addr = page_address(tce_mem);
+   memset(addr, 0, 1UL << (order + PAGE_SHIFT));
+
+   return addr;
+}
+
+static long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 bus_offset,
+   __u32 page_shift, __u64 window_size, struct iommu_table *tbl)
+{
void *addr;
+   const unsigned window_shift = ilog2(window_size);
+   unsigned entries_shift = window_shift - page_shift;
+   unsigned table_shift = max_t(unsigned, entries_shift + 3, PAGE_SHIFT);
+   const unsigned long tce_table_size = 1UL << table_shift;
+
+   if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
+   return -EINVAL;
+
+   /* Allocate TCE table */
+   addr = pnv_pci_ioda2_table_do_alloc_pages(nid, table_shift);
+   if (!addr)
+   return -ENOMEM;
+
+   /* Setup linux iommu table */
+   pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, bus_offset,
+   page_shift);
+
+   pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
+   window_size, tce_table_size, bus_offset);
+
+   return 0;
+}
+
+static void pnv_pci_ioda2_table_free_pages(struct iommu_table *tbl)
+{
+   if (!tbl->it_size)
+   return;
+
+   free_pages(tbl->it_base, get_order(tbl->it_size << 3));
+}
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+  struct pnv_ioda_pe *pe)
+{
struct iommu_table *tbl;
-   unsigned int tce_table_size, end;
int64_t rc;
 
/* We shouldn't already have a 32-bit DMA associated */
@@ -2055,24 +2107,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
 
/* The PE will reserve all possible 32-bits space */
pe->tce32_seg = 0;
-   end = (1 << ilog2(phb->ioda.m32_pci_base));
-   tce_table_size = (end / 0x1000) * 8;
pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n",
-   end);
+

[PATCH kernel v12 18/34] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control

2015-06-04 Thread Alexey Kardashevskiy

This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.

This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.

The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.

As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: Gavin Shan 
Reviewed-by: David Gibson 
---
Changes:
v10:
* fixed comments around take_ownership/release_ownership in 
iommu_table_group_ops

v9:
* squashed "vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control"
and "vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control"
into a single patch
* moved helpers with a loop through tables in a group
to vfio_iommu_spapr_tce.c to keep the platform code free of IOMMU table
groups as much as possible
* added missing tce_iommu_clear() to tce_iommu_release_ownership()
* replaced the set_ownership(enable) callback with take_ownership() and
release_ownership()
---
 arch/powerpc/include/asm/iommu.h  | 11 -
 arch/powerpc/kernel/iommu.c   | 12 -
 arch/powerpc/platforms/powernv/pci-ioda.c | 73 ++-
 drivers/vfio/vfio_iommu_spapr_tce.c   | 70 ++---
 4 files changed, 118 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 44a20cc..489133c 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -93,7 +93,6 @@ struct iommu_table {
unsigned long  it_page_shift;/* table iommu page size */
struct list_head it_group_list;/* List of iommu_table_group_link */
struct iommu_table_ops *it_ops;
-   void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
@@ -126,6 +125,15 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
int nid);
 #define IOMMU_TABLE_GROUP_MAX_TABLES   1
 
+struct iommu_table_group;
+
+struct iommu_table_group_ops {
+   /* Switch ownership from platform code to external user (e.g. VFIO) */
+   void (*take_ownership)(struct iommu_table_group *table_group);
+   /* Switch ownership from external user (e.g. VFIO) back to core */
+   void (*release_ownership)(struct iommu_table_group *table_group);
+};
+
 struct iommu_table_group_link {
struct list_head next;
struct rcu_head rcu;
@@ -135,6 +143,7 @@ struct iommu_table_group_link {
 struct iommu_table_group {
struct iommu_group *group;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+   struct iommu_table_group_ops *ops;
 };
 
 #ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index be258b2..e7f81b7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1047,14 +1047,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
 
memset(tbl->it_map, 0xff, sz);
 
-   /*
-* Disable iommu bypass, otherwise the user can DMA to all of
-* our physical memory via the bypass window instead of just
-* the pages that has been explicitly mapped into the iommu
-*/
-   if (tbl->set_bypass)
-   tbl->set_bypass(tbl, false);
-
return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1068,10 +1060,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
-
-   /* The kernel owns the device now, we can restore the iommu bypass */
-   if (tbl->set_bypass)
-   tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc

[PATCH kernel v12 26/34] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window

2015-06-04 Thread Alexey Kardashevskiy

This is a part of moving DMA window programming to an iommu_ops
callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
a first parameter (not pnv_ioda_pe) as it is going to be used as
a callback for VFIO DDW code.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v12:
* removed comment from commit log about pnv_pci_ioda2_tvt_invalidate()/
pnv_pci_ioda2_invalidate_entire()

v11:
* replaced some 1phb;
+   int64_t rc;
+   const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
+   const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+
+   pe_info(pe, "Setting up window %llx..%llx pg=%x\n",
+   start_addr, start_addr + win_size - 1,
+   IOMMU_PAGE_SIZE(tbl));
+
+   /*
+* Map TCE table through TVT. The TVE index is the PE number
+* shifted by 1 bit for 32-bits DMA space.
+*/
+   rc = opal_pci_map_pe_dma_window(phb->opal_id,
+   pe->pe_number,
+   pe->pe_number << 1,
+   1,
+   __pa(tbl->it_base),
+   tbl->it_size << 3,
+   IOMMU_PAGE_SIZE(tbl));
+   if (rc) {
+   pe_err(pe, "Failed to configure TCE table, err %ld\n", rc);
+   return rc;
+   }
+
+   pnv_pci_link_table_and_group(phb->hose->node, num,
+   tbl, &pe->table_group);
+   pnv_pci_ioda2_tce_invalidate_entire(pe);
+
+   return 0;
+}
+
 static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
uint16_t window_id = (pe->pe_number << 1 ) + 1;
@@ -2124,21 +2161,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
pe->table_group.ops = &pnv_pci_ioda2_ops;
 #endif
 
-   /*
-* Map TCE table through TVT. The TVE index is the PE number
-* shifted by 1 bit for 32-bits DMA space.
-*/
-   rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-   pe->pe_number << 1, 1, __pa(tbl->it_base),
-   tbl->it_size << 3, 1ULL << tbl->it_page_shift);
+   rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
   " err %ld\n", rc);
goto fail;
}
 
-   pnv_pci_ioda2_tce_invalidate_entire(pe);
-
/* OPAL variant of PHB3 invalidated TCEs */
if (phb->ioda.tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 14/15] KVM: MTRR: do not map huage page for non-consistent range

2015-06-04 Thread Xiao Guangrong



[ CCed Zhang Yang ]

On 06/04/2015 04:36 PM, Paolo Bonzini wrote:



On 04/06/2015 10:23, Xiao Guangrong wrote:


So, why do you need to always use IPAT=0?  Can patch 15 keep the current
logic for RAM, like this:

 if (is_mmio || kvm_arch_has_noncoherent_dma(vcpu->kvm))
 ret = kvm_mtrr_get_guest_memory_type(vcpu, gfn) <<
   VMX_EPT_MT_EPTE_SHIFT;
 else
 ret = (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
 | VMX_EPT_IPAT_BIT;


Yeah, it's okay, actually we considered this way, however
- it's light enough, it did not hurt guest performance based on our
   benchmark.
- the logic has always used for noncherent_dma case, extend it to
   normal case should have low risk and also help us to check the logic.


But noncoherent_dma is not the common case, so it's not necessarily true
that the risk is low.


I thought noncoherent_dma exists on 1st generation(s) IOMMU, it should
be fully tested at that time.




- completely follow MTRRS spec would be better than host hides it.


We are a virtualization platform, we know well when MTRRs are necessary.

Tis a risk from blindly obeying the guest MTRRs: userspace can see stale
data if the guest's accesses bypass the cache.  AMD bypasses this by
enabling snooping even in cases that ordinarily wouldn't snoop; for
Intel the solution is that RAM-backed areas should always use IPAT.


Not sure if UC and other cacheable type combinations on guest and host
will cause problem. The SMD mentioned that snoop is not required only when
"The UC attribute comes from the MTRRs and the processors are not required
 to snoop their caches since the data could never have been cached."
(Vol 3. 11.5.2.2)
VMX do not touch hardware MTRR MSRs and i guess snoop works under this case.

I also noticed if SS (self-snooping) is supported we need not to invalidate
cache when programming memory type (Vol 3. 11.11.8), so that means CPU works
well on the page which has different cache types i guess.

After think it carefully, we (Zhang Yang) doubt if always set WB for DMA
memory is really a good idea because we can not assume WB DMA works well for
all devices. One example is that audio DMA (not a MMIO region) is required WC
to improve its performance.

However, we think the SDM is not clear enough so let's do full vMTRR on MMIO
and noncoherent_dma first.　:)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 22/34] powerpc/powernv: Implement accessor to TCE entry

2015-06-04 Thread Alexey Kardashevskiy

This replaces direct accesses to TCE table with a helper which
returns an TCE entry address. This does not make difference now but will
when multi-level TCE tables get introduces.

No change in behavior is expected.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v9:
* new patch in the series to separate this mechanical change from
functional changes; this is not right before
"powerpc/powernv: Implement multilevel TCE tables" but here in order
to let the next patch - "powerpc/iommu/powernv: Release replaced TCE" -
use pnv_tce() and avoid changing the same code twice
---
 arch/powerpc/platforms/powernv/pci.c | 34 +-
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index 4b4c583..b2a32d0 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -572,38 +572,46 @@ struct pci_ops pnv_pci_ops = {
.write = pnv_pci_write_config,
 };
 
+static __be64 *pnv_tce(struct iommu_table *tbl, long idx)
+{
+   __be64 *tmp = ((__be64 *)tbl->it_base);
+
+   return tmp + idx;
+}
+
 int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction,
struct dma_attrs *attrs)
 {
u64 proto_tce = iommu_direction_to_tce_perm(direction);
-   __be64 *tcep;
-   u64 rpn;
+   u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+   long i;
 
-   tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
-   rpn = __pa(uaddr) >> tbl->it_page_shift;
-
-   while (npages--)
-   *(tcep++) = cpu_to_be64(proto_tce |
-   (rpn++ << tbl->it_page_shift));
+   for (i = 0; i < npages; i++) {
+   unsigned long newtce = proto_tce |
+   ((rpn + i) << tbl->it_page_shift);
+   unsigned long idx = index - tbl->it_offset + i;
 
+   *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce);
+   }
 
return 0;
 }
 
 void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
 {
-   __be64 *tcep;
+   long i;
 
-   tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
+   for (i = 0; i < npages; i++) {
+   unsigned long idx = index - tbl->it_offset + i;
 
-   while (npages--)
-   *(tcep++) = cpu_to_be64(0);
+   *(pnv_tce(tbl, idx)) = cpu_to_be64(0);
+   }
 }
 
 unsigned long pnv_tce_get(struct iommu_table *tbl, long index)
 {
-   return ((u64 *)tbl->it_base)[index - tbl->it_offset];
+   return *(pnv_tce(tbl, index - tbl->it_offset));
 }
 
 struct iommu_table *pnv_pci_table_alloc(int nid)
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 31/34] vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control

2015-06-04 Thread Alexey Kardashevskiy

Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.

This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.

Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.

This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
---
Changes:
v10:
* created pnv_pci_ioda2_setup_default_config() helper

v9:
* fixed crash in tce_iommu_detach_group() on tbl->it_ops->free as
tce_iommu_attach_group() used to initialize the table from a descriptor
on stack (it does not matter for the series as this bit is changed later anyway
but it ruing bisectability)

v6:
* fixed commit log that VFIO removes tables before passing ownership
back to the platform code, not userspace
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 101 --
 drivers/vfio/vfio_iommu_spapr_tce.c   |  88 +-
 2 files changed, 141 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1cb96f0..dfd43ac 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2072,6 +2072,49 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
return 0;
 }
 
+static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
+{
+   struct iommu_table *tbl = NULL;
+   long rc;
+
+   rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
+   IOMMU_PAGE_SHIFT_4K,
+   pe->table_group.tce32_size,
+   POWERNV_IOMMU_DEFAULT_LEVELS, &tbl);
+   if (rc) {
+   pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
+   rc);
+   return rc;
+   }
+
+   iommu_init_table(tbl, pe->phb->hose->node);
+
+   rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
+   if (rc) {
+   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
+   rc);
+   pnv_ioda2_table_free(tbl);
+   return rc;
+   }
+
+   if (!pnv_iommu_bypass_disabled)
+   pnv_pci_ioda2_set_bypass(pe, true);
+
+   /* OPAL variant of PHB3 invalidated TCEs */
+   if (pe->phb->ioda.tce_inval_reg)
+   tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
+
+   /*
+* Setting table base here only for carrying iommu_group
+* further down to let iommu_add_device() do the job.
+* pnv_pci_ioda_dma_dev_setup will override it later anyway.
+*/
+   if (pe->flags & PNV_IODA_PE_DEV)
+   set_iommu_table_base(&pe->pdev->dev, tbl);
+
+   return 0;
+}
+
 #ifdef CONFIG_IOMMU_API
 static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
__u64 window_size, __u32 levels)
@@ -2133,9 +2176,12 @@ static void pnv_ioda2_take_ownership(struct 
iommu_table_group *table_group)
 {
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
+   /* Store @tbl as pnv_pci_ioda2_unset_window() resets it */
+   struct iommu_table *tbl = pe->table_group.tables[0];
 
-   iommu_take_ownership(table_group->tables[0]);
pnv_pci_ioda2_set_bypass(pe, false);
+   pnv_pci_ioda2_unset_window(&pe->table_group, 0);
+   pnv_ioda2_table_free(tbl);
 }
 
 static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group)
@@ -2143,8 +2189,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
 
-   iommu_releas

[PATCH kernel v12 16/34] powerpc/spapr: vfio: Replace iommu_table with iommu_table_group

2015-06-04 Thread Alexey Kardashevskiy

Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.

This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;

This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.

For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.

For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.

For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.

This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v11:
* iommu_table_group moved outside #ifdef CONFIG_IOMMU_API as iommu_table
is dynamically allocated and it needs a pointer to PE and
iommu_table_group is this pointer

v10:
* new to the series, separated from
"powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group"
* iommu_table is not embedded into iommu_table_group but allocated
dynamically in most cases
* iommu_table allocation is moved to a single place for IODA2's
pnv_pci_ioda_setup_dma_pe where it belongs to
* added list of groups into iommu_table; most of the code just looks at
the first item to keep the patch simpler
---
 arch/powerpc/include/asm/iommu.h|  19 ++---
 arch/powerpc/include/asm/pci-bridge.h   |   2 +-
 arch/powerpc/kernel/iommu.c |  17 ++---
 arch/powerpc/platforms/powernv/pci-ioda.c   |  55 +++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  18 +++--
 arch/powerpc/platforms/powernv/pci.h|   3 +-
 arch/powerpc/platforms/pseries/iommu.c  | 107 +++-
 drivers/vfio/vfio_iommu_spapr_tce.c |  23 +++---
 8 files changed, 152 insertions(+), 92 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index e2a45c3..5a7267f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -91,14 +91,9 @@ struct iommu_table {
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
-#ifdef CONFIG_IOMMU_API
-   struct iommu_group *it_group;
-#endif
+   struct iommu_table_group *it_table_group;
struct iommu_table_ops *it_ops;
void (*set_bypass)(struct iommu_table *tbl, bool enable);
-#ifdef CONFIG_PPC_POWERNV
-   void   *data;
-#endif
 };
 
 /* Pure 2^n version of get_order */
@@ -129,14 +124,22 @@ extern void iommu_free_table(struct iommu_table *tbl, 
const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+#define IOMMU_TABLE_GROUP_MAX_TABLES   1
+
+struct iommu_table_group {
+   struct iommu_group *group;
+   struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
+};
+
 #ifdef CONFIG_IOMMU_API
-extern void iommu_register_group(struct iommu_table *tbl,
+
+extern void iommu_register_group(struct iommu_table_group *table_group,
 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
 #else
-static inline void iommu_register_group(struct iommu_table *tbl,
+static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
unsigned long pe_num)
 {
diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 1811c44..e2d7479 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -185,7 +185,7 @@ struct pci_dn {
 
struct  pci_dn *parent;
struct  pci_controller *phb;/* for pci devices */
-   struct  iommu_table *iom

[PATCH kernel v12 04/34] powerpc/iommu: Put IOMMU group explicitly

2015-06-04 Thread Alexey Kardashevskiy

So far an iommu_table lifetime was the same as PE. Dynamic DMA windows
will change this and iommu_free_table() will not always require
the group to be released.

This moves iommu_group_put() out of iommu_free_table().

This adds a iommu_pseries_free_table() helper which does
iommu_group_put() and iommu_free_table(). Later it will be
changed to receive a table_group and we will have to change less
lines then.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Gavin Shan 
Reviewed-by: David Gibson 
---
 arch/powerpc/kernel/iommu.c   |  7 ---
 arch/powerpc/platforms/powernv/pci-ioda.c |  5 +
 arch/powerpc/platforms/pseries/iommu.c| 16 +++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b054f33..3d47eb3 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -726,13 +726,6 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);
 
-#ifdef CONFIG_IOMMU_API
-   if (tbl->it_group) {
-   iommu_group_put(tbl->it_group);
-   BUG_ON(tbl->it_group);
-   }
-#endif
-
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8ca7abd..8c3c4bf 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1310,6 +1311,10 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
*dev, struct pnv_ioda_pe
if (rc)
pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
 
+   if (tbl->it_group) {
+   iommu_group_put(tbl->it_group);
+   BUG_ON(tbl->it_group);
+   }
iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
free_pages(addr, get_order(TCE32_TABLE_SIZE));
pe->tce32_table = NULL;
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 05ab06d..fe5117b 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,18 @@
 
 #include "pseries.h"
 
+static void iommu_pseries_free_table(struct iommu_table *tbl,
+   const char *node_name)
+{
+#ifdef CONFIG_IOMMU_API
+   if (tbl->it_group) {
+   iommu_group_put(tbl->it_group);
+   BUG_ON(tbl->it_group);
+   }
+#endif
+   iommu_free_table(tbl, node_name);
+}
+
 static void tce_invalidate_pSeries_sw(struct iommu_table *tbl,
  __be64 *startp, __be64 *endp)
 {
@@ -1271,7 +1284,8 @@ static int iommu_reconfig_notifier(struct notifier_block 
*nb, unsigned long acti
 */
remove_ddw(np, false);
if (pci && pci->iommu_table)
-   iommu_free_table(pci->iommu_table, np->full_name);
+   iommu_pseries_free_table(pci->iommu_table,
+   np->full_name);
 
spin_lock(&direct_window_list_lock);
list_for_each_entry(window, &direct_window_list, list) {
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL 0/6] perf/core improvements and fixes

2015-06-04 Thread Ingo Molnar

* Alexei Starovoitov  wrote:

> On 6/4/15 7:04 AM, Ingo Molnar wrote:
> >>>  # perf record -e bpf_source.c cmdline
> >>>
> >>>  to create a eBPF filter from source,
> >>>
> >>>Use
> >>>
> >>># perf record -e bpf_object.o cmdline
> >>>
> >>>to create a eBPF filter from object intermedia.
> >>>
> >>>Use
> >>>
> >>># perf bpf compile bpf_source.c --kbuild=kernel-build-dir -o bpf_object.o
> >>>
> >>>to create the .o
> >>>
> >>>I think this should be enough. Currently only the second case has been 
> >>>implemented.
> >
> > So if users cannot actually generate .o files then it's premature to merge 
> > this in such an incomplete form!
> >
> > It should be possible to use a feature that we are merging.
> 
> of course it's usable :) There is some confusion here.
> To compile .c into .o one can easily use
> clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o

There's no confusion here: you guys are trying to sell me what at this stage is 
incomplete and hard to use, and I'm resisting it as I should! :-)

We also have different definitions of 'easily'. It might be 'easy' to type:

clang -O2 -emit-llvm -c file.c -o - | llc -march=bpf -o file.o

... for some tooling developer intimate with eBPF, but to the first time user 
who 
found an interesting looking eBPF scriptlet on the net or in the documentation 
and 
wants to try his luck? It's absolutely non-obvious!

The current usage to get a _minimal_ eBPF script running is non-obvious and 
obscure to the level of being a show stopper.

I don't understand why you guys are even wasting time arguing about it: it's 
not 
that hard to auto-build from source code. It's one of the basic features of 
tooling. If you ever built perf you'll know that typing 'make install' will 
type 
in all those quirky build lines automatically for you, without requiring you to 
perform any other step, no matter how trivial.

Doubly annoying, you seem to have the UI principles wrong, you seem to think 
that 
a .o is a proper user interface. It absolutely is _not_ okay.

The Linux kernel project and as an extension the perf project deals with source 
code, and I'm 100% suspicious of approaches that somehow think that .o objects 
are 
the right UI for _anything_ except temporary files that sometimes show up in 
object directories...

Fix the 'newbie user' UI flow as a _first_ priority, not as a second thought!

Every single quirky line or nonsensical option you require a first time user to 
type halves the number of new users we'll get. You need to understand why 
dtrace 
is so popular:

   - it's bloody easy to use

   - it's a safe environment you can deploy in critical environments

   - it's flexible

   - instrumentation hacks are very easy to share

eBPF based scripting got 3 out of those 4 right, but please don't forget item 1 
either, because without that we have nothing but a bunch of unusable 
functionality 
in the kernel and in tooling that benefits only very few people. Okay?

> So I think we need to support both 'perf record -e file.[co]'

Why do you even need to ask? Of course!

Think through how users will meet eBPF scripts and how they will interact with 
them:

  - they'll see or download an eBPF scriptlet somewhere and will have a .c file.

  - ideally there will be built-in eBPF scriptlets just like we have tracing 
plugins, and there's a good UI to query them and see their description and 
source code.

  - then they will want to use it all with the minimum amount of fuss

  - they don't care how the eBPF scriptlet gets to the kernel: whether the 
kernel 
can read and build the .c files, or whether there's some user tooling that
turns it into bytecode. Most humans don't read bytecode!

  - they will absolutely not download random .o's and we should not encourage 
that
in any case - these things should be source code based.

These things compile in an eye blink, there's very little reason to ever deal 
with 
a .o, except some weird and rare usecases...

In fact I'm NAK-ing the whole .o based interface until the .c interface is made 
the _primary_ one and works well and until I see that you have thought through 
basic usability questions...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 03/34] powerpc/powernv/ioda: Clean up IOMMU group registration

2015-06-04 Thread Alexey Kardashevskiy

The existing code has 3 calls to iommu_register_group() and
all 3 branches actually cover all possible cases.

This replaces 3 calls with one and moves the registration earlier;
the latter will make more sense when we add TCE table sharing.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Gavin Shan 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 28 
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9a77f3c..8ca7abd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1784,6 +1784,9 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
if (WARN_ON(pe->tce32_seg >= 0))
return;
 
+   tbl = pe->tce32_table;
+   iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+
/* Grab a 32-bit TCE table */
pe->tce32_seg = base;
pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n",
@@ -1818,7 +1821,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
}
 
/* Setup linux iommu table */
-   tbl = pe->tce32_table;
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
  base << 28, IOMMU_PAGE_SHIFT_4K);
 
@@ -1840,8 +1842,6 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
iommu_init_table(tbl, phb->hose->node);
 
if (pe->flags & PNV_IODA_PE_DEV) {
-   iommu_register_group(tbl, phb->hose->global_number,
-pe->pe_number);
/*
 * Setting table base here only for carrying iommu_group
 * further down to let iommu_add_device() do the job.
@@ -1849,14 +1849,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
 */
set_iommu_table_base(&pe->pdev->dev, tbl);
iommu_add_device(&pe->pdev->dev);
-   } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
-   iommu_register_group(tbl, phb->hose->global_number,
-pe->pe_number);
+   } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
pnv_ioda_setup_bus_dma(pe, pe->pbus);
-   } else if (pe->flags & PNV_IODA_PE_VF) {
-   iommu_register_group(tbl, phb->hose->global_number,
-pe->pe_number);
-   }
 
return;
  fail:
@@ -1923,6 +1917,9 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
if (WARN_ON(pe->tce32_seg >= 0))
return;
 
+   tbl = pe->tce32_table;
+   iommu_register_group(tbl, phb->hose->global_number, pe->pe_number);
+
/* The PE will reserve all possible 32-bits space */
pe->tce32_seg = 0;
end = (1 << ilog2(phb->ioda.m32_pci_base));
@@ -1954,7 +1951,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
}
 
/* Setup linux iommu table */
-   tbl = pe->tce32_table;
pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
IOMMU_PAGE_SHIFT_4K);
 
@@ -1974,8 +1970,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
iommu_init_table(tbl, phb->hose->node);
 
if (pe->flags & PNV_IODA_PE_DEV) {
-   iommu_register_group(tbl, phb->hose->global_number,
-pe->pe_number);
/*
 * Setting table base here only for carrying iommu_group
 * further down to let iommu_add_device() do the job.
@@ -1983,14 +1977,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
 */
set_iommu_table_base(&pe->pdev->dev, tbl);
iommu_add_device(&pe->pdev->dev);
-   } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
-   iommu_register_group(tbl, phb->hose->global_number,
-pe->pe_number);
+   } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
pnv_ioda_setup_bus_dma(pe, pe->pbus);
-   } else if (pe->flags & PNV_IODA_PE_VF) {
-   iommu_register_group(tbl, phb->hose->global_number,
-pe->pe_number);
-   }
 
/* Also create a bypass window */
if (!pnv_iommu_bypass_disabled)
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 30/34] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table

2015-06-04 Thread Alexey Kardashevskiy

This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.

This stores the allocated table size in pnv_pci_ioda2_get_table_size()
so the locked_vm counter can be updated correctly when a table is
being disposed.

This defines an iommu_table_group_ops callback to let VFIO know
how much memory will be locked if a table is created.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v10:
* s/ROUND_UP/_ALIGN_UP/
* fixed rounding up for @entries_shift (used to use ROUND_UP)

v9:
* reimplemented the whole patch
---
 arch/powerpc/include/asm/iommu.h  |  5 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 34 +++
 2 files changed, 39 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index e554175..9d37492 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -99,6 +99,7 @@ struct iommu_table {
unsigned long  it_size;  /* Size of iommu table in entries */
unsigned long  it_indirect_levels;
unsigned long  it_level_size;
+   unsigned long  it_allocated_size;
unsigned long  it_offset;/* Offset into global table */
unsigned long  it_base;  /* mapped address of tce table */
unsigned long  it_index; /* which iommu table this is */
@@ -147,6 +148,10 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
 struct iommu_table_group;
 
 struct iommu_table_group_ops {
+   unsigned long (*get_table_size)(
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels);
long (*create_table)(struct iommu_table_group *table_group,
int num,
__u32 page_shift,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 612ab23..1cb96f0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2073,6 +2073,38 @@ static long pnv_pci_ioda2_create_table(struct 
iommu_table_group *table_group,
 }
 
 #ifdef CONFIG_IOMMU_API
+static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels)
+{
+   unsigned long bytes = 0;
+   const unsigned window_shift = ilog2(window_size);
+   unsigned entries_shift = window_shift - page_shift;
+   unsigned table_shift = entries_shift + 3;
+   unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift);
+   unsigned long direct_table_size;
+
+   if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS) ||
+   (window_size > memory_hotplug_max()) ||
+   !is_power_of_2(window_size))
+   return 0;
+
+   /* Calculate a direct table size from window_size and levels */
+   entries_shift = (entries_shift + levels - 1) / levels;
+   table_shift = entries_shift + 3;
+   table_shift = max_t(unsigned, table_shift, PAGE_SHIFT);
+   direct_table_size =  1UL << table_shift;
+
+   for ( ; levels; --levels) {
+   bytes += _ALIGN_UP(tce_table_size, direct_table_size);
+
+   tce_table_size /= direct_table_size;
+   tce_table_size <<= 3;
+   tce_table_size = _ALIGN_UP(tce_table_size, direct_table_size);
+   }
+
+   return bytes;
+}
+
 static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
int num)
 {
@@ -2116,6 +2148,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
 }
 
 static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
+   .get_table_size = pnv_pci_ioda2_get_table_size,
.create_table = pnv_pci_ioda2_create_table,
.set_window = pnv_pci_ioda2_set_window,
.unset_window = pnv_pci_ioda2_unset_window,
@@ -2227,6 +2260,7 @@ static long pnv_pci_ioda2_table_alloc_pages(int nid, 
__u64 bus_offset,
page_shift);
tbl->it_level_size = 1ULL << (level_shift - 3);
tbl->it_indirect_levels = levels - 1;
+   tbl->it_allocated_size = offset;
 
pr_devel("Created TCE table: ws=%08llx ts=%lx @%08llx\n",
window_size, tce_table_size, bus_offset);
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 32/34] powerpc/mmu: Add userspace-to-physical addresses translation cache

2015-06-04 Thread Alexey Kardashevskiy

We are adding support for DMA memory pre-registration to be used in
conjunction with VFIO. The idea is that the userspace which is going to
run a guest may want to pre-register a user space memory region so
it all gets pinned once and never goes away. Having this done,
a hypervisor will not have to pin/unpin pages on every DMA map/unmap
request. This is going to help with multiple pinning of the same memory.

Another use of it is in-kernel real mode (mmu off) acceleration of
DMA requests where real time translation of guest physical to host
physical addresses is non-trivial and may fail as linux ptes may be
temporarily invalid. Also, having cached host physical addresses
(compared to just pinning at the start and then walking the page table
again on every H_PUT_TCE), we can be sure that the addresses which we put
into TCE table are the ones we already pinned.

This adds a list of memory regions to mm_context_t. Each region consists
of a header and a list of physical addresses. This adds API to:
1. register/unregister memory regions;
2. do final cleanup (which puts all pre-registered pages);
3. do userspace to physical address translation;
4. manage usage counters; multiple registration of the same memory
is allowed (once per container).

This implements 2 counters per registered memory region:
- @mapped: incremented on every DMA mapping; decremented on unmapping;
initialized to 1 when a region is just registered; once it becomes zero,
no more mappings allowe;
- @used: incremented on every "register" ioctl; decremented on
"unregister"; unregistration is allowed for DMA mapped regions unless
it is the very last reference. For the very last reference this checks
that the region is still mapped and returns -EBUSY so the userspace
gets to know that memory is still pinned and unregistration needs to
be retried; @used remains 1.

Host physical addresses are stored in vmalloc'ed array. In order to
access these in the real mode (mmu off), there is a real_vmalloc_addr()
helper. In-kernel acceleration patchset will move it from KVM to MMU code.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v12:
* s/mmu_context_hash64_iommu.c/mmu_context_iommu.c/ as there is nothing
about hash64 in the new file
* added WARN_ON_ONCE() in mm_iommu_adjust_locked_vm()
* added mm_iommu_find() to find exact region (rather than overlapped),
as mm_iommu_get(), it takes @entries rather than @size
* mm_iommu_adjust_locked_vm() takes positive npages and a bool saying
whether to increment or decrement the limit

v11:
* added mutex to protect adding and removing
* added mm_iommu_init() helper
* kref is removed, now there are an atomic counter (@mapped) and a mutex
(for @used)
* merged mm_iommu_alloc into mm_iommu_get and do check-and-alloc under
one mutex lock; mm_iommu_get() returns old @used value so the caller can
know if it needs to elevate locked_vm counter
* do locked_vm counting in mmu_context_hash64_iommu.c

v10:
* split mm_iommu_mapped_update into mm_iommu_mapped_dec + mm_iommu_mapped_inc
* mapped counter now keep one reference for itself and mm_iommu_mapped_inc()
can tell if the region is being released
* updated commit log

v8:
* s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/
* fixed error fallback look (s/[i]/[j]/)
---
 arch/powerpc/include/asm/mmu-hash64.h  |   3 +
 arch/powerpc/include/asm/mmu_context.h |  18 ++
 arch/powerpc/kernel/setup_64.c |   3 +
 arch/powerpc/mm/Makefile   |   1 +
 arch/powerpc/mm/mmu_context_hash64.c   |   6 +
 arch/powerpc/mm/mmu_context_iommu.c| 316 +
 6 files changed, 347 insertions(+)
 create mode 100644 arch/powerpc/mm/mmu_context_iommu.c

diff --git a/arch/powerpc/include/asm/mmu-hash64.h 
b/arch/powerpc/include/asm/mmu-hash64.h
index 1da6a81..a82f534 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -536,6 +536,9 @@ typedef struct {
/* for 4K PTE fragment support */
void *pte_frag;
 #endif
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+   struct list_head iommu_group_mem_list;
+#endif
 } mm_context_t;
 
 
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 73382eb..3e51842 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -16,6 +16,24 @@
  */
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
 extern void destroy_context(struct mm_struct *mm);
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+struct mm_iommu_table_group_mem_t;
+
+extern bool mm_iommu_preregistered(void);
+extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+   struct mm_iommu_table_group_mem_t **pmem);
+extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern void mm_iommu_init(mm_context_t *ctx);
+extern void mm_iommu_cleanup(mm_context_t *ctx);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
+

[PATCH kernel v12 08/34] vfio: powerpc/spapr: Use it_page_size

2015-06-04 Thread Alexey Kardashevskiy

This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: Gavin Shan 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 735b308..64300cc 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -91,7 +91,7 @@ static int tce_iommu_enable(struct tce_container *container)
 * enforcing the limit based on the max that the guest can map.
 */
down_write(¤t->mm->mmap_sem);
-   npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+   npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
locked = current->mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
@@ -120,7 +120,7 @@ static void tce_iommu_disable(struct tce_container 
*container)
 
down_write(¤t->mm->mmap_sem);
current->mm->locked_vm -= (container->tbl->it_size <<
-   IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+   container->tbl->it_page_shift) >> PAGE_SHIFT;
up_write(¤t->mm->mmap_sem);
 }
 
@@ -215,7 +215,7 @@ static long tce_iommu_build(struct tce_container *container,
tce, ret);
break;
}
-   tce += IOMMU_PAGE_SIZE_4K;
+   tce += IOMMU_PAGE_SIZE(tbl);
}
 
if (ret)
@@ -260,8 +260,8 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz < minsz)
return -EINVAL;
 
-   info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
-   info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+   info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+   info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
info.flags = 0;
 
if (copy_to_user((void __user *)arg, &info, minsz))
@@ -291,8 +291,8 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;
 
-   if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
-   (param.vaddr & ~IOMMU_PAGE_MASK_4K))
+   if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
+   (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
return -EINVAL;
 
/* iova is checked by the IOMMU API */
@@ -307,8 +307,8 @@ static long tce_iommu_ioctl(void *iommu_data,
return ret;
 
ret = tce_iommu_build(container, tbl,
-   param.iova >> IOMMU_PAGE_SHIFT_4K,
-   tce, param.size >> IOMMU_PAGE_SHIFT_4K);
+   param.iova >> tbl->it_page_shift,
+   tce, param.size >> tbl->it_page_shift);
 
iommu_flush_tce(tbl);
 
@@ -334,17 +334,17 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;
 
-   if (param.size & ~IOMMU_PAGE_MASK_4K)
+   if (param.size & ~IOMMU_PAGE_MASK(tbl))
return -EINVAL;
 
ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-   param.size >> IOMMU_PAGE_SHIFT_4K);
+   param.size >> tbl->it_page_shift);
if (ret)
return ret;
 
ret = tce_iommu_clear(container, tbl,
-   param.iova >> IOMMU_PAGE_SHIFT_4K,
-   param.size >> IOMMU_PAGE_SHIFT_4K);
+   param.iova >> tbl->it_page_shift,
+   param.size >> tbl->it_page_shift);
iommu_flush_tce(tbl);
 
return ret;
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 11/34] vfio: powerpc/spapr: Moving pinning/unpinning to helpers

2015-06-04 Thread Alexey Kardashevskiy

This is a pretty mechanical patch to make next patches simpler.

New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.

As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.

This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.

This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v9:
* changed helpers to work with physical addresses rather than linear
(for simplicity - later ::xchg() will receive physical and avoid
additional convertions)

v6:
* tce_get_hva() returns hva via a pointer
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 61 +
 1 file changed, 41 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 5bbdf37..cf5d4a1 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -191,69 +191,90 @@ static void tce_iommu_release(void *iommu_data)
struct tce_container *container = iommu_data;
 
WARN_ON(container->tbl && !container->tbl->it_group);
-   tce_iommu_disable(container);
 
if (container->tbl && container->tbl->it_group)
tce_iommu_detach_group(iommu_data, container->tbl->it_group);
 
+   tce_iommu_disable(container);
mutex_destroy(&container->lock);
 
kfree(container);
 }
 
+static void tce_iommu_unuse_page(struct tce_container *container,
+   unsigned long oldtce)
+{
+   struct page *page;
+
+   if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
+   return;
+
+   page = pfn_to_page(oldtce >> PAGE_SHIFT);
+
+   if (oldtce & TCE_PCI_WRITE)
+   SetPageDirty(page);
+
+   put_page(page);
+}
+
 static int tce_iommu_clear(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
unsigned long oldtce;
-   struct page *page;
 
for ( ; pages; --pages, ++entry) {
oldtce = iommu_clear_tce(tbl, entry);
if (!oldtce)
continue;
 
-   page = pfn_to_page(oldtce >> PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce & TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
+   tce_iommu_unuse_page(container, oldtce);
}
 
return 0;
 }
 
+static int tce_iommu_use_page(unsigned long tce, unsigned long *hpa)
+{
+   struct page *page = NULL;
+   enum dma_data_direction direction = iommu_tce_direction(tce);
+
+   if (get_user_pages_fast(tce & PAGE_MASK, 1,
+   direction != DMA_TO_DEVICE, &page) != 1)
+   return -EFAULT;
+
+   *hpa = __pa((unsigned long) page_address(page));
+
+   return 0;
+}
+
 static long tce_iommu_build(struct tce_container *container,
struct iommu_table *tbl,
unsigned long entry, unsigned long tce, unsigned long pages)
 {
long i, ret = 0;
-   struct page *page = NULL;
-   unsigned long hva;
+   struct page *page;
+   unsigned long hpa;
enum dma_data_direction direction = iommu_tce_direction(tce);
 
for (i = 0; i < pages; ++i) {
unsigned long offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK;
 
-   ret = get_user_pages_fast(tce & PAGE_MASK, 1,
-   direction != DMA_TO_DEVICE, &page);
-   if (unlikely(ret != 1)) {
-   ret = -EFAULT;
+   ret = tce_iommu_use_page(tce, &hpa);
+   if (ret)
break;
-   }
 
+   page = pfn_to_page(hpa >> PAGE_SHIFT);
if (!tce_page_is_contained(page, tbl->it_page_shift)) {
ret = -EPERM;
break;
}
 
-   hva = (unsigned long) page_address(page) + offset;
-
-   ret = iommu_tce_build(tbl, entry + i, hva, direction);
+   hpa |= offset;
+   ret = iommu_tce_build(tbl, entry + i, (unsigned long) __va(hpa),
+   direction);

[PATCH kernel v12 12/34] vfio: powerpc/spapr: Rework groups attaching

2015-06-04 Thread Alexey Kardashevskiy

This is to make extended ownership and multiple groups support patches
simpler for review.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 40 ++---
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index cf5d4a1..e65bc73 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -460,16 +460,21 @@ static int tce_iommu_attach_group(void *iommu_data,
iommu_group_id(container->tbl->it_group),
iommu_group_id(iommu_group));
ret = -EBUSY;
-   } else if (container->enabled) {
+   goto unlock_exit;
+   }
+
+   if (container->enabled) {
pr_err("tce_vfio: attaching group #%u to enabled container\n",
iommu_group_id(iommu_group));
ret = -EBUSY;
-   } else {
-   ret = iommu_take_ownership(tbl);
-   if (!ret)
-   container->tbl = tbl;
+   goto unlock_exit;
}
 
+   ret = iommu_take_ownership(tbl);
+   if (!ret)
+   container->tbl = tbl;
+
+unlock_exit:
mutex_unlock(&container->lock);
 
return ret;
@@ -487,19 +492,22 @@ static void tce_iommu_detach_group(void *iommu_data,
pr_warn("tce_vfio: detaching group #%u, expected group is 
#%u\n",
iommu_group_id(iommu_group),
iommu_group_id(tbl->it_group));
-   } else {
-   if (container->enabled) {
-   pr_warn("tce_vfio: detaching group #%u from enabled 
container, forcing disable\n",
-   iommu_group_id(tbl->it_group));
-   tce_iommu_disable(container);
-   }
+   goto unlock_exit;
+   }
 
-   /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
-   iommu_group_id(iommu_group), iommu_group); */
-   container->tbl = NULL;
-   tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-   iommu_release_ownership(tbl);
+   if (container->enabled) {
+   pr_warn("tce_vfio: detaching group #%u from enabled container, 
forcing disable\n",
+   iommu_group_id(tbl->it_group));
+   tce_iommu_disable(container);
}
+
+   /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
+  iommu_group_id(iommu_group), iommu_group); */
+   container->tbl = NULL;
+   tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
+   iommu_release_ownership(tbl);
+
+unlock_exit:
mutex_unlock(&container->lock);
 }
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 29/34] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release

2015-06-04 Thread Alexey Kardashevskiy

The existing code programmed TVT#0 with some address and then
immediately released that memory.

This makes use of pnv_pci_ioda2_unset_window() and
pnv_pci_ioda2_set_bypass() which do correct resource release and
TVT update.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 25 ++---
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index ace0302..612ab23 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1288,34 +1288,21 @@ m64_failed:
return -EBUSY;
 }
 
+static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
+   int num);
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+
 static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct 
pnv_ioda_pe *pe)
 {
-   struct pci_bus*bus;
-   struct pci_controller *hose;
-   struct pnv_phb*phb;
struct iommu_table*tbl;
-   unsigned long addr;
int64_t   rc;
 
-   bus = dev->bus;
-   hose = pci_bus_to_host(bus);
-   phb = hose->private_data;
tbl = pe->table_group.tables[0];
-   addr = tbl->it_base;
-
-   opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-  pe->pe_number << 1, 1, __pa(addr),
-  0, 0x1000);
-
-   rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id,
-   pe->pe_number,
-   (pe->pe_number << 1) + 1,
-   pe->tce_bypass_base,
-   0);
+   rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
if (rc)
pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
 
-   pnv_pci_unlink_table_and_group(tbl, &pe->table_group);
+   pnv_pci_ioda2_set_bypass(pe, false);
if (pe->table_group.group) {
iommu_group_put(pe->table_group.group);
BUG_ON(pe->table_group.group);
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 05/34] powerpc/iommu: Always release iommu_table in iommu_free_table()

2015-06-04 Thread Alexey Kardashevskiy

At the moment iommu_free_table() only releases memory if
the table was initialized for the platform code use, i.e. it had
it_map initialized (which purpose is to track DMA memory space use).

With dynamic DMA windows, we will need to be able to release
iommu_table even if it was used for VFIO in which case it_map is NULL
so does the patch.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v11:
* fixed parameter checks
---
 arch/powerpc/kernel/iommu.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 3d47eb3..73eb39a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -713,9 +713,11 @@ void iommu_free_table(struct iommu_table *tbl, const char 
*node_name)
unsigned long bitmap_sz;
unsigned int order;
 
-   if (!tbl || !tbl->it_map) {
-   printk(KERN_ERR "%s: expected TCE map for %s\n", __func__,
-   node_name);
+   if (!tbl)
+   return;
+
+   if (!tbl->it_map) {
+   kfree(tbl);
return;
}
 
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 28/34] vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API

2015-06-04 Thread Alexey Kardashevskiy

This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.

create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.

set_window() sets the window at specified TVT index + @num on PHB.

unset_window() unsets the window from specified TVT.

This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.

create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.

This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
---
Changes:
v10:
* squashed "vfio: powerpc/spapr: Use 32bit DMA window properties from 
table_group"
into this
* shortened the subject

v9:
* new in the series - to make the next patch simpler
---
 arch/powerpc/include/asm/iommu.h| 19 ++
 arch/powerpc/platforms/powernv/pci-ioda.c   | 96 ++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  7 ++-
 drivers/vfio/vfio_iommu_spapr_tce.c | 19 +++---
 4 files changed, 124 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 706cfc0..e554175 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -70,6 +70,7 @@ struct iommu_table_ops {
/* get() returns a physical address */
unsigned long (*get)(struct iommu_table *tbl, long index);
void (*flush)(struct iommu_table *tbl);
+   void (*free)(struct iommu_table *tbl);
 };
 
 /* These are used by VIO */
@@ -146,6 +147,17 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
 struct iommu_table_group;
 
 struct iommu_table_group_ops {
+   long (*create_table)(struct iommu_table_group *table_group,
+   int num,
+   __u32 page_shift,
+   __u64 window_size,
+   __u32 levels,
+   struct iommu_table **ptbl);
+   long (*set_window)(struct iommu_table_group *table_group,
+   int num,
+   struct iommu_table *tblnew);
+   long (*unset_window)(struct iommu_table_group *table_group,
+   int num);
/* Switch ownership from platform code to external user (e.g. VFIO) */
void (*take_ownership)(struct iommu_table_group *table_group);
/* Switch ownership from external user (e.g. VFIO) back to core */
@@ -159,6 +171,13 @@ struct iommu_table_group_link {
 };
 
 struct iommu_table_group {
+   /* IOMMU properties */
+   __u32 tce32_start;
+   __u32 tce32_size;
+   __u64 pgsizes; /* Bitmap of supported page sizes */
+   __u32 max_dynamic_windows_supported;
+   __u32 max_levels;
+
struct iommu_group *group;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct iommu_table_group_ops *ops;
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index a253dda..ace0302 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1869,6 +1870,12 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
long index,
pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false);
 }
 
+static void pnv_ioda2_table_free(struct iommu_table *tbl)
+{
+   pnv_pci_ioda2_table_free_pages(tbl);
+   iommu_free_table(tbl, "pnv");
+}
+
 static struct iommu_table_ops pnv_ioda2_iommu_ops = {
.set = pnv_ioda2_tce_build,
 #ifdef CONFIG_IOMMU_API
@@ -1876,6 +1883,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = {
 #endif
.clear = pnv_ioda2_tce_free,
.get = pnv_tce_get,
+   .free = pnv_ioda2_table_free,
 };
 
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
@@ -1946,6 +1954,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 TCE_PCI_SWINV_PAIR);
 
tbl->it_ops = &pnv_ioda1_iommu_ops;
+   pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
+   pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
iommu_init_table(tbl, phb->hose->node);
 
if (pe->flags & PNV_IODA_PE_DEV) {
@@ -1

[PATCH kernel v12 00/34] powerpc/iommu/vfio: Enable Dynamic DMA windows

2015-06-04 Thread Alexey Kardashevskiy


This enables sPAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PCI bus.

Hi-speed devices may suffer from the limited size of the window.
The recent host kernels use a TCE bypass window on POWER8 CPU which implements
direct PCI bus address range mapping (with offset of 1<<59) to the host memory.

For guests, PAPR defines a DDW RTAS API which allows pseries guests
querying the hypervisor about DDW support and capabilities (page size mask
for now). A pseries guest may request an additional (to the default)
DMA windows using this RTAS API.
The existing pseries Linux guests request an additional window as big as
the guest RAM and map the entire guest window which effectively creates
direct mapping of the guest memory to a PCI bus.

The multiple DMA windows feature is supported by POWER7/POWER8 CPUs; however
this patchset only adds support for POWER8 as TCE tables are implemented
in POWER7 in a quite different way ans POWER7 is not the highest priority.

This patchset reworks PPC64 IOMMU code and adds necessary structures
to support big windows.

Once a Linux guest discovers the presence of DDW, it does:
1. query hypervisor about number of available windows and page size masks;
2. create a window with the biggest possible page size (today 4K/64K/16M);
3. map the entire guest RAM via H_PUT_TCE* hypercalls;
4. switche dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore for 64bit devices and
the guest does not waste time on DMA map/unmap operations.

Note that 32bit devices won't use DDW and will keep using the default
DMA window so KVM optimizations will be required (to be posted later).

This is pushed to g...@github.com:aik/linux.git
 + 93b347697...5ba9cbd vfio-for-github -> vfio-for-github (forced update)

The pushed branch contains all patches from this patchset and KVM
acceleration patches as well to give an idea about the current state
of in-kernel acceleration support.

Changes:
v12:
* fixed few issues in multilevel TCE tables
* fixed locked_vm counting in "userspace-to-physical addresses translation 
cache"
* fixed some commit logs
* rebased on 4.1-rc6

v11:
* reworked locking in pinned pages cache

v10:
* fixed&tested on SRIOV system
* fixed multiple comments from David
* added bunch of iommu device attachment reworks

v9:
* rebased on top of SRIOV (which is in upstream now)
* fixed multiple comments from David
* reworked ownership patches
* removed vfio: powerpc/spapr: Do cleanup when releasing the group (used to be 
#2)
as updated #1 should do this
* moved "powerpc/powernv: Implement accessor to TCE entry" to a separate patch
* added a patch which moves TCE Kill register address to PE from IOMMU table

v8:
* fixed a bug in error fallback in "powerpc/mmu: Add userspace-to-physical
addresses translation cache"
* fixed subject in "vfio: powerpc/spapr: Check that IOMMU page is fully
contained by system page"
* moved v2 documentation to the correct patch
* added checks for failed vzalloc() in "powerpc/iommu: Add userspace view
of TCE table"

v7:
* moved memory preregistration to the current process's MMU context
* added code preventing unregistration if some pages are still mapped;
for this, there is a userspace view of the table is stored in iommu_table
* added locked_vm counting for DDW tables (including userspace view of those)

v6:
* fixed a bunch of errors in "vfio: powerpc/spapr: Support Dynamic DMA windows"
* moved static IOMMU properties from iommu_table_group to iommu_table_group_ops

v5:
* added SPAPR_TCE_IOMMU_v2 to tell the userspace that there is a memory
pre-registration feature
* added backward compatibility
* renamed few things (mostly powerpc_iommu -> iommu_table_group)

v4:
* moved patches around to have VFIO and PPC patches separated as much as
possible
* now works with the existing upstream QEMU

v3:
* redesigned the whole thing
* multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest ->
no problems with locked_vm counting; also we save memory on actual tables
* guest RAM preregistration is required for DDW
* PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so
we do not bother with iommu_table::it_map anymore
* added multilevel TCE tables support to support really huge guests

v2:
* added missing __pa() in "powerpc/powernv: Release replaced TCE"
* reposted to make some noise




Alexey Kardashevskiy (34):
  powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group
  powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group
  powerpc/powernv/ioda: Clean up IOMMU group registration
  powerpc/iommu: Put IOMMU group explicitly
  powerpc/iommu: Always release iommu_table in iommu_free_table()
  vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU
driver

[PATCH kernel v12 24/34] powerpc/powernv/ioda2: Rework iommu_table creation

2015-06-04 Thread Alexey Kardashevskiy

This moves iommu_table creation to the beginning to make following changes
easier to review. This starts using table parameters from the iommu_table
struct.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v9:
* updated commit log and did minor cleanup
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 19d89dc..95d3121 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2070,13 +2070,23 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
addr = page_address(tce_mem);
memset(addr, 0, tce_table_size);
 
+   /* Setup linux iommu table */
+   pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
+   IOMMU_PAGE_SHIFT_4K);
+
+   tbl->it_ops = &pnv_ioda2_iommu_ops;
+   iommu_init_table(tbl, phb->hose->node);
+#ifdef CONFIG_IOMMU_API
+   pe->table_group.ops = &pnv_pci_ioda2_ops;
+#endif
+
/*
 * Map TCE table through TVT. The TVE index is the PE number
 * shifted by 1 bit for 32-bits DMA space.
 */
rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number,
-   pe->pe_number << 1, 1, __pa(addr),
-   tce_table_size, 0x1000);
+   pe->pe_number << 1, 1, __pa(tbl->it_base),
+   tbl->it_size << 3, 1ULL << tbl->it_page_shift);
if (rc) {
pe_err(pe, "Failed to configure 32-bit TCE table,"
   " err %ld\n", rc);
@@ -2085,20 +2095,10 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
 
pnv_pci_ioda2_tce_invalidate_entire(pe);
 
-   /* Setup linux iommu table */
-   pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0,
-   IOMMU_PAGE_SHIFT_4K);
-
/* OPAL variant of PHB3 invalidated TCEs */
if (phb->ioda.tce_inval_reg)
tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE);
 
-   tbl->it_ops = &pnv_ioda2_iommu_ops;
-   iommu_init_table(tbl, phb->hose->node);
-#ifdef CONFIG_IOMMU_API
-   pe->table_group.ops = &pnv_pci_ioda2_ops;
-#endif
-
if (pe->flags & PNV_IODA_PE_DEV) {
/*
 * Setting table base here only for carrying iommu_group
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 21/34] powerpc/powernv/ioda2: Add TCE invalidation for all attached groups

2015-06-04 Thread Alexey Kardashevskiy

The iommu_table struct keeps a list of IOMMU groups it is used for.
At the moment there is just a single group attached but further
patches will add TCE table sharing. When sharing is enabled, TCE cache
in each PE needs to be invalidated so does the patch.

This does not change pnv_pci_ioda1_tce_invalidate() as there is no plan
to enable TCE table sharing on PHBs older than IODA2.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Reviewed-by: Gavin Shan 
---
Changes:
v10:
* new to the series
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 35 ---
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3fd8b18..88a799a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1764,23 +1765,15 @@ static inline void 
pnv_pci_ioda2_tce_invalidate_entire(struct pnv_ioda_pe *pe)
__raw_writeq(cpu_to_be64(val), phb->ioda.tce_inval_reg);
 }
 
-static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
-   unsigned long index, unsigned long npages, bool rm)
+static void pnv_pci_ioda2_do_tce_invalidate(unsigned pe_number, bool rm,
+   __be64 __iomem *invalidate, unsigned shift,
+   unsigned long index, unsigned long npages)
 {
-   struct iommu_table_group_link *tgl = list_first_entry_or_null(
-   &tbl->it_group_list, struct iommu_table_group_link,
-   next);
-   struct pnv_ioda_pe *pe = container_of(tgl->table_group,
-   struct pnv_ioda_pe, table_group);
unsigned long start, end, inc;
-   __be64 __iomem *invalidate = rm ?
-   (__be64 __iomem *)pe->phb->ioda.tce_inval_reg_phys :
-   pe->phb->ioda.tce_inval_reg;
-   const unsigned shift = tbl->it_page_shift;
 
/* We'll invalidate DMA address in PE scope */
start = 0x2ull << 60;
-   start |= (pe->pe_number & 0xFF);
+   start |= (pe_number & 0xFF);
end = start;
 
/* Figure out the start, end and step */
@@ -1798,6 +1791,24 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
iommu_table *tbl,
}
 }
 
+static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
+   unsigned long index, unsigned long npages, bool rm)
+{
+   struct iommu_table_group_link *tgl;
+
+   list_for_each_entry_rcu(tgl, &tbl->it_group_list, next) {
+   struct pnv_ioda_pe *pe = container_of(tgl->table_group,
+   struct pnv_ioda_pe, table_group);
+   __be64 __iomem *invalidate = rm ?
+   (__be64 __iomem *)pe->phb->ioda.tce_inval_reg_phys :
+   pe->phb->ioda.tce_inval_reg;
+
+   pnv_pci_ioda2_do_tce_invalidate(pe->pe_number, rm,
+   invalidate, tbl->it_page_shift,
+   index, npages);
+   }
+}
+
 static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
long npages, unsigned long uaddr,
enum dma_data_direction direction,
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH kernel v12 02/34] powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group

2015-06-04 Thread Alexey Kardashevskiy

The set_iommu_table_base_and_group() name suggests that the function
sets table base and add a device to an IOMMU group.

The actual purpose for table base setting is to put some reference
into a device so later iommu_add_device() can get the IOMMU group
reference and the device to the group.

At the moment a group cannot be explicitly passed to iommu_add_device()
as we want it to work from the bus notifier, we can fix it later and
remove confusing calls of set_iommu_table_base().

This replaces set_iommu_table_base_and_group() with a couple of
set_iommu_table_base() + iommu_add_device() which makes reading the code
easier.

This adds few comments why set_iommu_table_base() and iommu_add_device()
are called where they are called.

For IODA1/2, this essentially removes iommu_add_device() call from
the pnv_pci_ioda_dma_dev_setup() as it will always fail at this particular
place:
- for physical PE, the device is already attached by iommu_add_device()
in pnv_pci_ioda_setup_dma_pe();
- for virtual PE, the sysfs entries are not ready to create all symlinks
so actual adding is happening in tce_iommu_bus_notifier.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Gavin Shan 
Reviewed-by: David Gibson 
---
Changes:
v10:
* new to the series
---
 arch/powerpc/include/asm/iommu.h|  7 ---
 arch/powerpc/platforms/powernv/pci-ioda.c   | 27 +++
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  3 ++-
 arch/powerpc/platforms/pseries/iommu.c  | 15 ---
 4 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 1e27d63..8353c86 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -140,13 +140,6 @@ static inline int __init tce_iommu_bus_notifier_init(void)
 }
 #endif /* !CONFIG_IOMMU_API */
 
-static inline void set_iommu_table_base_and_group(struct device *dev,
- void *base)
-{
-   set_iommu_table_base(dev, base);
-   iommu_add_device(dev);
-}
-
 extern int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
struct scatterlist *sglist, int nelems,
unsigned long mask,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2f092bb..9a77f3c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1598,7 +1598,13 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb 
*phb, struct pci_dev *pdev
 
pe = &phb->ioda.pe_array[pdn->pe_number];
WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
-   set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table);
+   set_iommu_table_base(&pdev->dev, pe->tce32_table);
+   /*
+* Note: iommu_add_device() will fail here as
+* for physical PE: the device is already added by now;
+* for virtual PE: sysfs entries are not ready yet and
+* tce_iommu_bus_notifier will add the device to a group later.
+*/
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -1659,7 +1665,8 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list) {
-   set_iommu_table_base_and_group(&dev->dev, pe->tce32_table);
+   set_iommu_table_base(&dev->dev, pe->tce32_table);
+   iommu_add_device(&dev->dev);
 
if (dev->subordinate)
pnv_ioda_setup_bus_dma(pe, dev->subordinate);
@@ -1835,7 +1842,13 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
if (pe->flags & PNV_IODA_PE_DEV) {
iommu_register_group(tbl, phb->hose->global_number,
 pe->pe_number);
-   set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
+   /*
+* Setting table base here only for carrying iommu_group
+* further down to let iommu_add_device() do the job.
+* pnv_pci_ioda_dma_dev_setup will override it later anyway.
+*/
+   set_iommu_table_base(&pe->pdev->dev, tbl);
+   iommu_add_device(&pe->pdev->dev);
} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
iommu_register_group(tbl, phb->hose->global_number,
 pe->pe_number);
@@ -1963,7 +1976,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
if (pe->flags & PNV_IODA_PE_DEV) {
iommu_register_group(tbl, phb->hose->global_number,
 pe->pe_number);
-   set_iommu_table_base_and_group(&pe->pdev->dev, tbl);
+   /*
+* Setting table base here only for carrying iommu_group
+* further down to let iommu_add_device()

[PATCH kernel v12 17/34] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group

2015-06-04 Thread Alexey Kardashevskiy

So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.

This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds 
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group() helpers to 
manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.

This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.

This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: Gavin Shan 
---
Changes:
v12:
* fixed iommu_add_device() to check what list_first_entry_or_null()
returned
* changed commit log
* removed loops from iommu_pseries_free_group as it does not support
tables sharing anyway

v10:
* iommu_table is not embedded into iommu_table_group but allocated
dynamically
* iommu_table allocation is moved to a single place for IODA2's
pnv_pci_ioda_setup_dma_pe where it belongs to
* added list of groups into iommu_table; most of the code just looks at
the first item to keep the patch simpler

v9:
* s/it_group/it_table_group/
* added and used iommu_table_group_free(), from now iommu_free_table()
is only used for VIO
* added iommu_pseries_group_alloc()
* squashed "powerpc/iommu: Introduce iommu_table_alloc() helper" into this
---
 arch/powerpc/include/asm/iommu.h|   8 +-
 arch/powerpc/kernel/iommu.c |  14 +++-
 arch/powerpc/platforms/powernv/pci-ioda.c   |  45 ++
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   3 +
 arch/powerpc/platforms/powernv/pci.c|  76 +
 arch/powerpc/platforms/powernv/pci.h|   7 ++
 arch/powerpc/platforms/pseries/iommu.c  |  25 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 122 
 8 files changed, 240 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 5a7267f..44a20cc 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -91,7 +91,7 @@ struct iommu_table {
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map;   /* A simple allocation bitmap for now */
unsigned long  it_page_shift;/* table iommu page size */
-   struct iommu_table_group *it_table_group;
+   struct list_head it_group_list;/* List of iommu_table_group_link */
struct iommu_table_ops *it_ops;
void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
@@ -126,6 +126,12 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
int nid);
 #define IOMMU_TABLE_GROUP_MAX_TABLES   1
 
+struct iommu_table_group_link {
+   struct list_head next;
+   struct rcu_head rcu;
+   struct iommu_table_group *table_group;
+};
+
 struct iommu_table_group {
struct iommu_group *group;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 719f048..be258b2 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1078,6 +1078,7 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership);
 int iommu_add_device(struct device *dev)
 {
struct iommu_table *tbl;
+   struct iommu_table_group_link *tgl;
 
/*
 * The sysfs entries should be populated before
@@ -1095,15 +1096,22 @@ int iommu_add_device(struct device *dev)
}
 
tbl = get_iommu_table_base(dev);
-   if (!tbl || !tbl->it_table_group || !tbl->it_table_group->group) {
+   if (!tbl) {
pr_debug("%s: Skipping device %s with no tbl\n",
 __func__, dev_name(dev));
return 0;
}
 
+   tgl = list_first_entry_or_null(&tbl->it_group_list,
+   struct iommu_table_group_link, next);
+   if (!tgl) {
+   pr_debug("%s: Skipping device %s with no group\n",
+__func__, dev_name(dev));
+   return 0;
+   }
pr_debug("%s: Adding %s to iommu group %d\n",
 __func__, dev_name(dev),
-iommu_group_id(tbl->it_table_group->group));
+iommu_group_id(tgl->table_group->group));
 
if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
@@ -1112,7 +1

[PATCH kernel v12 33/34] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-06-04 Thread Alexey Kardashevskiy

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
---
Changes:
v12:
* tce_iommu_unregister_pages() is fixed to use mm_iommu_find() which
enforces the requirement to unregister exactly the same region which
was registered (not overlapped)
* added a clause about creating default DMA window on IODA2

v11:
* mm_iommu_put() does not return a code so this does not check it
* moved "v2" in tce_container to pack the struct

v10:
* moved it_userspace allocation to vfio_iommu_spapr_tce as it VFIO
specific thing
* squashed "powerpc/iommu: Add userspace view of TCE table" into this as
it is
a part of IOMMU v2
* s/tce_iommu_use_page_v2/tce_iommu_prereg_ua_to_hpa/
* fixed some function names to have "tce_iommu_" in the beginning rather
just "tce_"
* as mm_iommu_mapped_inc() can now fail, check for the return code

v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
 Documentation/vfio.txt  |  31 ++-
 arch/powerpc/include/asm/iommu.h|   6 +
 drivers/vfio/vfio_iommu_spapr_tce.c | 513 ++--
 include/uapi/linux/vfio.h   |  27 ++
 4 files changed, 488 insertions(+), 89 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..7dcf2b5 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -289,10 +289,12 @@ PPC64 sPAPR implementation note
 
 This implementation has some specifics:
 
-1) Only one IOMMU group per container is supported as an IOMMU group
-represents the minimal entity which isolation can be guaranteed for and
-groups are allocated statically, one per a Partitionable Endpoint (PE)
+1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
+container is supported as an IOMMU table is allocated at the boot time,
+one table per a IOMMU group which is a Partitionable Endpoint (PE)
 (PE is often a PCI domain but not always).
+Newer systems (POWER8 with IODA2) have improved

[PATCH kernel v12 34/34] vfio: powerpc/spapr: Support Dynamic DMA windows

2015-06-04 Thread Alexey Kardashevskiy

This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.

This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.

This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.

DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
Reviewed-by: David Gibson 
---
Changes:
v7:
* s/VFIO_IOMMU_INFO_DDW/VFIO_IOMMU_SPAPR_INFO_DDW/
* fixed typos in and updated vfio.txt
* fixed VFIO_IOMMU_SPAPR_TCE_GET_INFO handler
* moved ddw properties to vfio_iommu_spapr_tce_ddw_info

v6:
* added explicit VFIO_IOMMU_INFO_DDW flag to vfio_iommu_spapr_tce_info,
it used to be page mask flags from platform code
* added explicit pgsizes field
* added cleanup if tce_iommu_create_window() failed in a middle
* added checks for callbacks in tce_iommu_create_window and remove those
from tce_iommu_remove_window when it is too late to test anyway
* spapr_tce_find_free_table returns sensible error code now
* updated description of VFIO_IOMMU_SPAPR_TCE_CREATE/
VFIO_IOMMU_SPAPR_TCE_REMOVE

v4:
* moved code to tce_iommu_create_window()/tce_iommu_remove_window()
helpers
* added docs
---
 Documentation/vfio.txt  |  19 
 arch/powerpc/include/asm/iommu.h|   2 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 196 +++-
 include/uapi/linux/vfio.h   |  61 ++-
 4 files changed, 273 insertions(+), 5 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 7dcf2b5..8b1ec51 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -452,6 +452,25 @@ address is from pre-registered range.
 
 This separation helps in optimizing DMA for guests.
 
+6) sPAPR specification allows guests to have an additional DMA window(s) on
+a PCI bus with a variable page size. Two ioctls have been added to support
+this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
+The platform has to support the functionality or error will be returned to
+the userspace. The existing hardware supports up to 2 DMA windows, one is
+2GB long, uses 4K pages and called "default 32bit window"; the other can
+be as big as entire RAM, use different page size, it is optional - guests
+create those in run-time if the guest driver supports 64bit DMA.
+
+VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
+a number of TCE table levels (if a TCE table is going to be big enough and
+the kernel may not be able to allocate enough of physically contiguous memory).
+It creates a new window in the available slot and returns the bus address where
+the new window starts. Due to hardware limitation, the user space cannot choose
+the location of DMA windows.
+
+VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
+and removes it.
+
 ---
 
 [1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index f9957eb..ca18cff 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -149,7 +149,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const 
char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
-#define IOMMU_TABLE_GROUP_MAX_TABLES   1
+#define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
 struct iommu_table_group;
 
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 91a3223..0582b72 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -211,6 +211,18 @@ static long tce_iommu_find_table(struct tce_container 
*container,
return -1;
 }
 
+static int tce_iommu_find_free_table(struct tce_container *container)
+{
+   int i;
+
+   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
+   if (!container->tables[i])
+   return i;
+   }
+
+   return -ENOSPC;
+}
+
 static int tce_iommu_enable(struct tce_container *container)
 {
int ret = 0;
@@ -593,11 +605,115 @@ static void tce_iommu_free_table(struct iommu_table *tbl)
decrement_locked_

[PATCH kernel v12 01/34] powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group

2015-06-04 Thread Alexey Kardashevskiy

This relies on the fact that a PCI device always has an IOMMU table
which may not be the case when we get dynamic DMA windows so
let's use more reliable check for IOMMU group here.

As we do not rely on the table presence here, remove the workaround
from pnv_pci_ioda2_set_bypass(); also remove the @add_to_iommu_group
parameter from pnv_ioda_setup_bus_dma().

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Gavin Shan 
Reviewed-by: David Gibson 
---
 arch/powerpc/kernel/eeh.c |  4 +---
 arch/powerpc/platforms/powernv/pci-ioda.c | 27 +--
 2 files changed, 6 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 9ee61d1..defd874 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1412,13 +1412,11 @@ static int dev_has_iommu_table(struct device *dev, void 
*data)
 {
struct pci_dev *pdev = to_pci_dev(dev);
struct pci_dev **ppdev = data;
-   struct iommu_table *tbl;
 
if (!dev)
return 0;
 
-   tbl = get_iommu_table_base(dev);
-   if (tbl && tbl->it_group) {
+   if (dev->iommu_group) {
*ppdev = pdev;
return 1;
}
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index f8bc950..2f092bb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1654,21 +1654,15 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct 
pnv_phb *phb,
 }
 
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
-  struct pci_bus *bus,
-  bool add_to_iommu_group)
+  struct pci_bus *bus)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list) {
-   if (add_to_iommu_group)
-   set_iommu_table_base_and_group(&dev->dev,
-  pe->tce32_table);
-   else
-   set_iommu_table_base(&dev->dev, pe->tce32_table);
+   set_iommu_table_base_and_group(&dev->dev, pe->tce32_table);
 
if (dev->subordinate)
-   pnv_ioda_setup_bus_dma(pe, dev->subordinate,
-  add_to_iommu_group);
+   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
}
 }
 
@@ -1845,7 +1839,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
iommu_register_group(tbl, phb->hose->global_number,
 pe->pe_number);
-   pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+   pnv_ioda_setup_bus_dma(pe, pe->pbus);
} else if (pe->flags & PNV_IODA_PE_VF) {
iommu_register_group(tbl, phb->hose->global_number,
 pe->pe_number);
@@ -1882,17 +1876,6 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table 
*tbl, bool enable)
 window_id,
 pe->tce_bypass_base,
 0);
-
-   /*
-* EEH needs the mapping between IOMMU table and group
-* of those VFIO/KVM pass-through devices. We can postpone
-* resetting DMA ops until the DMA mask is configured in
-* host side.
-*/
-   if (pe->pdev)
-   set_iommu_table_base(&pe->pdev->dev, tbl);
-   else
-   pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
}
if (rc)
pe_err(pe, "OPAL error %lld configuring bypass window\n", rc);
@@ -1984,7 +1967,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
} else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) {
iommu_register_group(tbl, phb->hose->global_number,
 pe->pe_number);
-   pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+   pnv_ioda_setup_bus_dma(pe, pe->pbus);
} else if (pe->flags & PNV_IODA_PE_VF) {
iommu_register_group(tbl, phb->hose->global_number,
 pe->pe_number);
-- 
2.4.0.rc3.8.gfb3e7d5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] staging: unisys: drop format string in kthread_run

2015-06-04 Thread Sudip Mukherjee

On Thu, Jun 04, 2015 at 11:37:01AM -0700, Kees Cook wrote:
> Calling kthread_run with a single name parameter causes it to be handled
> as a format string. Since the uisthread interface lacks format parameters,
> use "%s" to avoid any potential accidents from callers passing in dynamic
> string content.

uislib folder has already been deleted in staging-testing tree.

regards
sudip
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 5/6] usb: chipidea: allow multiple instances to use default ci_default_pdata

2015-06-04 Thread Peter Chen

On Fri, May 29, 2015 at 11:38:45AM -0500, Rob Herring wrote:
> Currently, ci_default_pdata is common to all instances of the driver and
> gets modified by the core driver code. This is bad if there are multiple
> instances of the device with different settings such as the phy type. Fix
> this by making a copy of the default platform_data.
> 
> Signed-off-by: Rob Herring 
> Cc: Peter Chen 
> Cc: Greg Kroah-Hartman 
> Cc: linux-...@vger.kernel.org
> ---
>  drivers/usb/chipidea/ci_hdrc_usb2.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/chipidea/ci_hdrc_usb2.c 
> b/drivers/usb/chipidea/ci_hdrc_usb2.c
> index 45844c9..9eae1a1 100644
> --- a/drivers/usb/chipidea/ci_hdrc_usb2.c
> +++ b/drivers/usb/chipidea/ci_hdrc_usb2.c
> @@ -25,7 +25,7 @@ struct ci_hdrc_usb2_priv {
>   struct clk  *clk;
>  };
>  
> -static struct ci_hdrc_platform_data ci_default_pdata = {
> +static const struct ci_hdrc_platform_data ci_default_pdata = {
>   .capoffset  = DEF_CAPOFFSET,
>   .flags  = CI_HDRC_DISABLE_STREAMING,
>  };
> @@ -37,8 +37,10 @@ static int ci_hdrc_usb2_probe(struct platform_device *pdev)
>   struct ci_hdrc_platform_data *ci_pdata = dev_get_platdata(dev);
>   int ret;
>  
> - if (!ci_pdata)
> - ci_pdata = &ci_default_pdata;
> + if (!ci_pdata) {
> + ci_pdata = devm_kmalloc(dev, sizeof(*ci_pdata), GFP_KERNEL);
> + *ci_pdata = ci_default_pdata;   /* struct copy */
> + }
>  
>   priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>   if (!priv)
> -- 
> 2.1.0
> 

Acked-by: Peter Chen 

-- 

Best Regards,
Peter Chen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 6/6] usb: chipidea: add work-around for Marvell HSIC PHY startup

2015-06-04 Thread Peter Chen

On Fri, May 29, 2015 at 11:38:46AM -0500, Rob Herring wrote:
> The Marvell 28nm HSIC PHY requires the port to be forced to HS mode after
> the port power is applied. This is done using the test mode in the PORTSC
> register.
> 
> As HSIC is always HS, this work-around should be safe to do with all HSIC
> PHYs and has been tested on i.MX6S.

tested on i.mx6sx

> 
> Signed-off-by: Rob Herring 
> Tested-by: Peter Chen 
> Cc: Greg Kroah-Hartman 
> Cc: linux-...@vger.kernel.org
> ---
>  drivers/usb/chipidea/host.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/usb/chipidea/host.c b/drivers/usb/chipidea/host.c
> index 21fe1a3..6cf87b8 100644
> --- a/drivers/usb/chipidea/host.c
> +++ b/drivers/usb/chipidea/host.c
> @@ -37,12 +37,14 @@ static int (*orig_bus_suspend)(struct usb_hcd *hcd);
>  
>  struct ehci_ci_priv {
>   struct regulator *reg_vbus;
> + struct ci_hdrc *ci;
>  };
>  
>  static int ehci_ci_portpower(struct usb_hcd *hcd, int portnum, bool enable)
>  {
>   struct ehci_hcd *ehci = hcd_to_ehci(hcd);
>   struct ehci_ci_priv *priv = (struct ehci_ci_priv *)ehci->priv;
> + struct ci_hdrc *ci = priv->ci;
>   struct device *dev = hcd->self.controller;

Forgot to comment last time, the ci is the private driver_data for
ci->dev, so you can get ci by calling dev_get_drvdata(dev), no need
to add an entry at struct ehci_ci_priv.

>   int ret = 0;
>   int port = HCS_N_PORTS(ehci->hcs_params);
> @@ -64,6 +66,15 @@ static int ehci_ci_portpower(struct usb_hcd *hcd, int 
> portnum, bool enable)
>   return ret;
>   }
>   }
> +
> + if (enable && (ci->platdata->phy_mode == USBPHY_INTERFACE_MODE_HSIC)) {
> + /*
> +  * Marvell 28nm HSIC PHY requires forcing the port to HS mode.
> +  * As HSIC is always HS, this should be safe for others.
> +  */
> + hw_port_test_set(ci, 5);
> + hw_port_test_set(ci, 0);
> + }
>   return 0;
>  };
>  
> @@ -112,6 +123,7 @@ static int host_start(struct ci_hdrc *ci)
>  
>   priv = (struct ehci_ci_priv *)ehci->priv;
>   priv->reg_vbus = NULL;
> + priv->ci = ci;
>  
>   if (ci->platdata->reg_vbus && !ci_otg_is_fsm_mode(ci)) {
>   if (ci->platdata->flags & CI_HDRC_TURN_VBUS_EARLY_ON) {
> -- 
> 2.1.0
> 

-- 

Best Regards,
Peter Chen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] ARM: rockchip: ensure CPU to enter WIF state

2015-06-04 Thread Kever Yang


Hi Caesar,

Subject typo WIF/WFI.

On 06/05/2015 12:47 PM, Caesar Wang wrote:

In idle mode, core1/2/3 of Cortex-A17 should be either power off or in
WFI/WFE state.
we can delay 1ms to ensure the CPU enter WFI state.

Signed-off-by: Caesar Wang 
---

  arch/arm/mach-rockchip/platsmp.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/arm/mach-rockchip/platsmp.c b/arch/arm/mach-rockchip/platsmp.c
index 1230d3d..978c357 100644
--- a/arch/arm/mach-rockchip/platsmp.c
+++ b/arch/arm/mach-rockchip/platsmp.c
@@ -316,6 +316,9 @@ static void __init rockchip_smp_prepare_cpus(unsigned int 
max_cpus)
  #ifdef CONFIG_HOTPLUG_CPU
  static int rockchip_cpu_kill(unsigned int cpu)
  {
+   /* ensure CPU can enter the WFI/WFE state */
+   mdelay(1);
+

Does it matter if core is not in WFI state when we want to power down it?

Thanks,
- Kever

pmu_set_power_domain(0 + cpu, false);
return 1;
  }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 02/19] perf, tools, jevents: Program to convert JSON file to C style file

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

This is a modified version of an earlier patch by Andi Kleen.

We expect architectures to describe the performance monitoring events
for each CPU in a corresponding JSON file, which look like:

[
{
"EventCode": "0x00",
"UMask": "0x01",
"EventName": "INST_RETIRED.ANY",
"BriefDescription": "Instructions retired from execution.",
"PublicDescription": "Instructions retired from execution.",
"Counter": "Fixed counter 1",
"CounterHTOff": "Fixed counter 1",
"SampleAfterValue": "203",
"SampleAfterValue": "203",
"MSRIndex": "0",
"MSRValue": "0",
"TakenAlone": "0",
"CounterMask": "0",
"Invert": "0",
"AnyThread": "0",
"EdgeDetect": "0",
"PEBS": "0",
"PRECISE_STORE": "0",
"Errata": "null",
"Offcore": "0"
}
]

We also expect the architectures to provide a mapping between individual
CPUs to their JSON files. Eg:

GenuineIntel-6-1E,V1,/NHM-EP/NehalemEP_core_V1.json,core

which maps each CPU, identified by [vendor, family, model, version, type]
to a JSON file.

Given these files, the program, jevents::
- locates all JSON files for the architecture,
- parses each JSON file and generates a C-style "PMU-events table"
  (pmu-events.c)
- locates a mapfile for the architecture
- builds a global table, mapping each model of CPU to the
  corresponding PMU-events table.

The 'pmu-events.c' is generated when building perf and added to libperf.a.
The global table pmu_events_map[] table in this pmu-events.c will be used
in perf in a follow-on patch.

If the architecture does not have any JSON files or there is an error in
processing them, an empty mapping file is created. This would allow the
build of perf to proceed even if we are not able to provide aliases for
events.

The parser for JSON files allows parsing Intel style JSON event files. This
allows to use an Intel event list directly with perf. The Intel event lists
can be quite large and are too big to store in unswappable kernel memory.

The conversion from JSON to C-style is straight forward.  The parser knows
(very little) Intel specific information, and can be easily extended to
handle fields for other CPUs.

The parser code is partially shared with an independent parsing library,
which is 2-clause BSD licenced. To avoid any conflicts I marked those
files as BSD licenced too. As part of perf they become GPLv2.

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

v2: Address review feedback. Rename option to --event-files
v3: Add JSON example
v4: Update manpages.
v5: Don't remove dot in fixname. Fix compile error. Add include
protection. Comment realloc.
v6: Include debug/util.h
v7: (Sukadev Bhattiprolu)
Rebase to 4.0 and fix some conflicts.
v8: (Sukadev Bhattiprolu)
Move jevents.[hc] to tools/perf/pmu-events/
Rewrite to locate and process arch specific JSON and "map" files;
and generate a C file.
(Removed acked-by Namhyung Kim due to modest changes to patch)
Compile the generated pmu-events.c and add the pmu-events.o to
libperf.a
v9: [Sukadev Bhattiprolu/Andi Kleen] Rename ->vfm to ->cpuid and use
that field to encode the PVR in Power.
Allow blank lines in mapfile.
[Jiri Olsa] Pass ARCH as a parameter to jevents so we don't have
to detect it.
[Jiri Olsa] Use the infrastrastructure to build pmu-events/perf
(Makefile changes from Jiri included in this patch).
[Jiri Olsa, Andi Kleen] Detect changes to JSON files and rebuild
pmu-events.o only if necessary.

v11:- [Andi Kleen] Add mapfile, jevents dependency on pmu-events.c
- [Jiri Olsa] Be silient if arch doesn't have JSON files
- Also silence 'jevents' when parsing JSON files unless V=1 is
  specified during build. Cleanup error messages.

v14:-   - [Jiri Olsa] Fix compile error with DEBUG=1; drop unlink() and
  use "w" mode with fopen(); simplify file_name_to_table_name()
---
 tools/perf/Makefile.perf   |   25 +-
 tools/perf/pmu-events/Build|   11 +
 tools/perf/pmu-events/jevents.c|  686 
 tools/perf/pmu-events/jevents.h|   17 +
 tools/perf/pmu-events/json.h   |3 +
 tools/perf/pmu-events/pmu-events.h |   35 ++
 6 files changed, 773 insertions(+), 4 deletions(-)
 create mode 100644 tools/perf/pmu-events/Build
 create mode 100644 tools/perf/pmu-events/jevents.c
 create mode 100644 tools/perf/pmu-events/jevents.h
 create mode 100644 tools/perf/pmu-events/pmu-events.h

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 5816a3b..6a50fc1 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -272,14 +272,29 @@ strip: $(PROGRAMS) $(OUTPUT)perf
 
 PERF_IN := $(OUTPUT)perf-in.o
 
+JEVENTS

[PATCH v14 04/19] perf, tools: Split perf_pmu__new_alias()

2015-06-04 Thread Sukadev Bhattiprolu

Separate the event parsing code in perf_pmu__new_alias() out into
a separate function __perf_pmu__new_alias() so that code can be
called indepdently.

This is based on an earlier patch from Andi Kleen.

Signed-off-by: Sukadev Bhattiprolu 
---
 tools/perf/util/pmu.c |   42 +++---
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index c6b16b1..7bcb8c3 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -206,17 +206,12 @@ static int perf_pmu__parse_snapshot(struct perf_pmu_alias 
*alias,
return 0;
 }
 
-static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, 
FILE *file)
+static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name,
+char *desc __maybe_unused, char *val)
 {
struct perf_pmu_alias *alias;
-   char buf[256];
int ret;
 
-   ret = fread(buf, 1, sizeof(buf), file);
-   if (ret == 0)
-   return -EINVAL;
-   buf[ret] = 0;
-
alias = malloc(sizeof(*alias));
if (!alias)
return -ENOMEM;
@@ -226,26 +221,43 @@ static int perf_pmu__new_alias(struct list_head *list, 
char *dir, char *name, FI
alias->unit[0] = '\0';
alias->per_pkg = false;
 
-   ret = parse_events_terms(&alias->terms, buf);
+   ret = parse_events_terms(&alias->terms, val);
if (ret) {
+   pr_err("Cannot parse alias %s: %d\n", val, ret);
free(alias);
return ret;
}
 
alias->name = strdup(name);
-   /*
-* load unit name and scale if available
-*/
-   perf_pmu__parse_unit(alias, dir, name);
-   perf_pmu__parse_scale(alias, dir, name);
-   perf_pmu__parse_per_pkg(alias, dir, name);
-   perf_pmu__parse_snapshot(alias, dir, name);
+   if (dir) {
+   /*
+* load unit name and scale if available
+*/
+   perf_pmu__parse_unit(alias, dir, name);
+   perf_pmu__parse_scale(alias, dir, name);
+   perf_pmu__parse_per_pkg(alias, dir, name);
+   perf_pmu__parse_snapshot(alias, dir, name);
+   }
 
list_add_tail(&alias->list, list);
 
return 0;
 }
 
+static int perf_pmu__new_alias(struct list_head *list, char *dir, char *name, 
FILE *file)
+{
+   char buf[256];
+   int ret;
+
+   ret = fread(buf, 1, sizeof(buf), file);
+   if (ret == 0)
+   return -EINVAL;
+
+   buf[ret] = 0;
+
+   return __perf_pmu__new_alias(list, dir, name, NULL, buf);
+}
+
 static inline bool pmu_alias_info_file(char *name)
 {
size_t len;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 05/19] perf, tools: Use pmu_events table to create aliases

2015-06-04 Thread Sukadev Bhattiprolu

At run time (when 'perf' is starting up), locate the specific table
of PMU events that corresponds to the current CPU. Using that table,
create aliases for the each of the PMU events in the CPU. The use
these aliases to parse the user specified perf event.

In short this would allow the user to specify events using their
aliases rather than raw event codes.

Based on input and some earlier patches from Andi Kleen, Jiri Olsa.

Signed-off-by: Sukadev Bhattiprolu 

Changelog[v4]
- Split off unrelated code into separate patches.
Changelog[v3]
- [Jiri Olsa] Fix memory leak in cpuid
Changelog[v2]
- [Andi Kleen] Replace pmu_events_map->vfm with a generic "cpuid".
---
 tools/perf/util/header.h |1 +
 tools/perf/util/pmu.c|   61 ++
 2 files changed, 62 insertions(+)

diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index d4d5796..996e899 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -157,4 +157,5 @@ int write_padded(int fd, const void *bf, size_t count, 
size_t count_aligned);
  */
 int get_cpuid(char *buffer, size_t sz);
 
+char *get_cpuid_str(void);
 #endif /* __PERF_HEADER_H */
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 7bcb8c3..7863d05 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -12,6 +12,8 @@
 #include "pmu.h"
 #include "parse-events.h"
 #include "cpumap.h"
+#include "header.h"
+#include "pmu-events/pmu-events.h"
 
 struct perf_pmu_format {
char *name;
@@ -449,6 +451,62 @@ static struct cpu_map *pmu_cpumask(const char *name)
return cpus;
 }
 
+/*
+ * Return the CPU id as a raw string.
+ *
+ * Each architecture should provide a more precise id string that
+ * can be use to match the architecture's "mapfile".
+ */
+char * __weak get_cpuid_str(void)
+{
+   return NULL;
+}
+
+/*
+ * From the pmu_events_map, find the table of PMU events that corresponds
+ * to the current running CPU. Then, add all PMU events from that table
+ * as aliases.
+ */
+static int pmu_add_cpu_aliases(struct list_head *head)
+{
+   int i;
+   struct pmu_events_map *map;
+   struct pmu_event *pe;
+   char *cpuid;
+
+   cpuid = get_cpuid_str();
+   if (!cpuid)
+   return 0;
+
+   i = 0;
+   while (1) {
+   map = &pmu_events_map[i++];
+   if (!map->table)
+   goto out;
+
+   if (!strcmp(map->cpuid, cpuid))
+   break;
+   }
+
+   /*
+* Found a matching PMU events table. Create aliases
+*/
+   i = 0;
+   while (1) {
+   pe = &map->table[i++];
+   if (!pe->name)
+   break;
+
+   /* need type casts to override 'const' */
+   __perf_pmu__new_alias(head, NULL, (char *)pe->name,
+   (char *)pe->desc, (char *)pe->event);
+   }
+
+out:
+   free(cpuid);
+   return 0;
+}
+
 struct perf_event_attr * __weak
 perf_pmu__get_default_config(struct perf_pmu *pmu __maybe_unused)
 {
@@ -477,6 +535,9 @@ static struct perf_pmu *pmu_lookup(const char *name)
if (pmu_aliases(name, &aliases))
return NULL;
 
+   if (!strcmp(name, "cpu"))
+   (void)pmu_add_cpu_aliases(&aliases);
+
if (pmu_type(name, &type))
return NULL;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 06/19] perf, tools: Support CPU ID matching for Powerpc

2015-06-04 Thread Sukadev Bhattiprolu

Implement code that returns the generic CPU ID string for Powerpc.
This will be used to identify the specific table of PMU events to
parse/compare user specified events against.

Signed-off-by: Sukadev Bhattiprolu 

Changelog[v14]
- [Jiri Olsa] Move this independent code off into a separate patch.
---
 tools/perf/arch/powerpc/util/header.c |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/perf/arch/powerpc/util/header.c 
b/tools/perf/arch/powerpc/util/header.c
index 6c1b8a7..65f9391 100644
--- a/tools/perf/arch/powerpc/util/header.c
+++ b/tools/perf/arch/powerpc/util/header.c
@@ -32,3 +32,14 @@ get_cpuid(char *buffer, size_t sz)
}
return -1;
 }
+
+char *
+get_cpuid_str(void)
+{
+   char *bufp;
+
+   if (asprintf(&bufp, "%.8lx", mfspr(SPRN_PVR)) < 0)
+   bufp = NULL;
+
+   return bufp;
+}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 03/19] Use __weak definition from

2015-06-04 Thread Sukadev Bhattiprolu

Jiri Olsa pointed out, that the  defines the
attribute '__weak'. We might as well use that.

Signed-off-by: Sukadev Bhattiprolu 
---
 tools/perf/util/pmu.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 0fcc624..c6b16b1 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1,4 +1,5 @@
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -436,7 +437,7 @@ static struct cpu_map *pmu_cpumask(const char *name)
return cpus;
 }
 
-struct perf_event_attr *__attribute__((weak))
+struct perf_event_attr * __weak
 perf_pmu__get_default_config(struct perf_pmu *pmu __maybe_unused)
 {
return NULL;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 09/19] perf, tools: Support alias descriptions

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

Add support to print alias descriptions in perf list, which
are taken from the generated event files.

The sorting code is changed to put the events with descriptions
at the end. The descriptions are printed as possibly multiple word
wrapped lines.

Example output:

% perf list
...
  arith.fpu_div
   [Divide operations executed]
  arith.fpu_div_active
   [Cycles when divider is busy executing divide operations]

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

Changelog
- Delete a redundant free()

Changelog[v14]
- [Jiri Olsa] Fail, rather than continue if strdup() returns NULL.
---
 tools/perf/util/pmu.c |   80 +++--
 tools/perf/util/pmu.h |1 +
 2 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 7863d05..e377598 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -241,6 +241,8 @@ static int __perf_pmu__new_alias(struct list_head *list, 
char *dir, char *name,
perf_pmu__parse_snapshot(alias, dir, name);
}
 
+   alias->desc = desc ? strdup(desc) : NULL;
+
list_add_tail(&alias->list, list);
 
return 0;
@@ -989,11 +991,42 @@ static char *format_alias_or(char *buf, int len, struct 
perf_pmu *pmu,
return buf;
 }
 
-static int cmp_string(const void *a, const void *b)
+struct pair {
+   char *name;
+   char *desc;
+};
+
+static int cmp_pair(const void *a, const void *b)
+{
+   const struct pair *as = a;
+   const struct pair *bs = b;
+
+   /* Put extra events last */
+   if (!!as->desc != !!bs->desc)
+   return !!as->desc - !!bs->desc;
+   return strcmp(as->name, bs->name);
+}
+
+static void wordwrap(char *s, int start, int max, int corr)
 {
-   const char * const *as = a;
-   const char * const *bs = b;
-   return strcmp(*as, *bs);
+   int column = start;
+   int n;
+
+   while (*s) {
+   int wlen = strcspn(s, " \t");
+
+   if (column + wlen >= max && column > start) {
+   printf("\n%*s", start, "");
+   column = start + corr;
+   }
+   n = printf("%s%.*s", column > start ? " " : "", wlen, s);
+   if (n <= 0)
+   break;
+   s += wlen;
+   column += n;
+   while (isspace(*s))
+   s++;
+   }
 }
 
 void print_pmu_events(const char *event_glob, bool name_only)
@@ -1003,7 +1036,9 @@ void print_pmu_events(const char *event_glob, bool 
name_only)
char buf[1024];
int printed = 0;
int len, j;
-   char **aliases;
+   struct pair *aliases;
+   int numdesc = 0;
+   int columns = 78;
 
pmu = NULL;
len = 0;
@@ -1013,14 +1048,15 @@ void print_pmu_events(const char *event_glob, bool 
name_only)
if (pmu->selectable)
len++;
}
-   aliases = zalloc(sizeof(char *) * len);
+   aliases = zalloc(sizeof(struct pair) * len);
if (!aliases)
goto out_enomem;
pmu = NULL;
j = 0;
while ((pmu = perf_pmu__scan(pmu)) != NULL) {
list_for_each_entry(alias, &pmu->aliases, list) {
-   char *name = format_alias(buf, sizeof(buf), pmu, alias);
+   char *name = alias->desc ? alias->name :
+   format_alias(buf, sizeof(buf), pmu, alias);
bool is_cpu = !strcmp(pmu->name, "cpu");
 
if (event_glob != NULL &&
@@ -1029,37 +1065,51 @@ void print_pmu_events(const char *event_glob, bool 
name_only)
   event_glob
continue;
 
-   if (is_cpu && !name_only)
+   if (is_cpu && !name_only && !alias->desc)
name = format_alias_or(buf, sizeof(buf), pmu, 
alias);
 
-   aliases[j] = strdup(name);
-   if (aliases[j] == NULL)
+   aliases[j].name = name;
+   if (is_cpu && !name_only && !alias->desc)
+   aliases[j].name = format_alias_or(buf, 
sizeof(buf),
+ pmu, alias);
+   aliases[j].name = strdup(aliases[j].name);
+   if (!aliases[j].name)
goto out_enomem;
+
+   aliases[j].desc = alias->desc;
j++;
}
if (pmu->selectable) {
char *s;
if (asprintf(&s, "%s//", pmu->name) < 0)
goto out_enomem;
-   aliases[j] = s;
+   aliases[j].name =

[PATCH v14 08/19] perf, tools: Support CPU id matching for x86 v2

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

Implement the code to match CPU types to mapfile types for x86
based on CPUID. This extends an existing similar function,
but changes it to use the x86 mapfile cpu description.
This allows to resolve event lists generated by jevents.

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

v2: Update to new get_cpuid_str() interface
---
 tools/perf/arch/x86/util/header.c |   24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/perf/arch/x86/util/header.c 
b/tools/perf/arch/x86/util/header.c
index 146d12a..a74a48d 100644
--- a/tools/perf/arch/x86/util/header.c
+++ b/tools/perf/arch/x86/util/header.c
@@ -19,8 +19,8 @@ cpuid(unsigned int op, unsigned int *a, unsigned int *b, 
unsigned int *c,
: "a" (op));
 }
 
-int
-get_cpuid(char *buffer, size_t sz)
+static int
+__get_cpuid(char *buffer, size_t sz, const char *fmt)
 {
unsigned int a, b, c, d, lvl;
int family = -1, model = -1, step = -1;
@@ -48,7 +48,7 @@ get_cpuid(char *buffer, size_t sz)
if (family >= 0x6)
model += ((a >> 16) & 0xf) << 4;
}
-   nb = scnprintf(buffer, sz, "%s,%u,%u,%u$", vendor, family, model, step);
+   nb = scnprintf(buffer, sz, fmt, vendor, family, model, step);
 
/* look for end marker to ensure the entire data fit */
if (strchr(buffer, '$')) {
@@ -57,3 +57,21 @@ get_cpuid(char *buffer, size_t sz)
}
return -1;
 }
+
+int
+get_cpuid(char *buffer, size_t sz)
+{
+   return __get_cpuid(buffer, sz, "%s,%u,%u,%u$");
+}
+
+char *
+get_cpuid_str(void)
+{
+   char *buf = malloc(128);
+
+   if (__get_cpuid(buf, 128, "%s-%u-%X$") < 0) {
+   free(buf);
+   return NULL;
+   }
+   return buf;
+}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 01/19] perf, tools: Add jsmn `jasmine' JSON parser

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

I need a JSON parser. This adds the simplest JSON
parser I could find -- Serge Zaitsev's jsmn `jasmine' --
to the perf library. I merely converted it to (mostly)
Linux style and added support for non 0 terminated input.

The parser is quite straight forward and does not
copy any data, just returns tokens with offsets
into the input buffer. So it's relatively efficient
and simple to use.

The code is not fully checkpatch clean, but I didn't
want to completely fork the upstream code.

Original source: http://zserge.bitbucket.org/jsmn.html

In addition I added a simple wrapper that mmaps a json
file and provides some straight forward access functions.

Used in follow-on patches to parse event files.

Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 
---

v2: Address review feedback.
v3: Minor checkpatch fixes.
v4 (by Sukadev Bhattiprolu)
- Rebase to 4.0 and fix minor conflicts in tools/perf/Makefile.perf
- Report error if specified events file is invalid.
v5 (Sukadev Bhattiprolu)
- Move files to tools/perf/pmu-events/ since parsing of JSON file
now occurs when _building_ rather than running perf.
---
 tools/perf/pmu-events/jsmn.c |  313 ++
 tools/perf/pmu-events/jsmn.h |   67 +
 tools/perf/pmu-events/json.c |  162 ++
 tools/perf/pmu-events/json.h |   36 +
 4 files changed, 578 insertions(+)
 create mode 100644 tools/perf/pmu-events/jsmn.c
 create mode 100644 tools/perf/pmu-events/jsmn.h
 create mode 100644 tools/perf/pmu-events/json.c
 create mode 100644 tools/perf/pmu-events/json.h

diff --git a/tools/perf/pmu-events/jsmn.c b/tools/perf/pmu-events/jsmn.c
new file mode 100644
index 000..11d1fa1
--- /dev/null
+++ b/tools/perf/pmu-events/jsmn.c
@@ -0,0 +1,313 @@
+/*
+ * Copyright (c) 2010 Serge A. Zaitsev
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ * Slightly modified by AK to not assume 0 terminated input.
+ */
+
+#include 
+#include "jsmn.h"
+
+/*
+ * Allocates a fresh unused token from the token pool.
+ */
+static jsmntok_t *jsmn_alloc_token(jsmn_parser *parser,
+  jsmntok_t *tokens, size_t num_tokens)
+{
+   jsmntok_t *tok;
+
+   if ((unsigned)parser->toknext >= num_tokens)
+   return NULL;
+   tok = &tokens[parser->toknext++];
+   tok->start = tok->end = -1;
+   tok->size = 0;
+   return tok;
+}
+
+/*
+ * Fills token type and boundaries.
+ */
+static void jsmn_fill_token(jsmntok_t *token, jsmntype_t type,
+   int start, int end)
+{
+   token->type = type;
+   token->start = start;
+   token->end = end;
+   token->size = 0;
+}
+
+/*
+ * Fills next available token with JSON primitive.
+ */
+static jsmnerr_t jsmn_parse_primitive(jsmn_parser *parser, const char *js,
+ size_t len,
+ jsmntok_t *tokens, size_t num_tokens)
+{
+   jsmntok_t *token;
+   int start;
+
+   start = parser->pos;
+
+   for (; parser->pos < len; parser->pos++) {
+   switch (js[parser->pos]) {
+#ifndef JSMN_STRICT
+   /*
+* In strict mode primitive must be followed by ","
+* or "}" or "]"
+*/
+   case ':':
+#endif
+   case '\t':
+   case '\r':
+   case '\n':
+   case ' ':
+   case ',':
+   case ']':
+   case '}':
+   goto found;
+   default:
+   break;
+   }
+   if (js[parser->pos] < 32 || js[parser->pos] >= 127) {
+   parser->pos = start;
+   return JSMN_ERROR_INVAL;
+   }
+   }
+#ifdef JSMN_STRICT
+   /*
+* In strict mode primitive must be fol

[PATCH v14 07/19] perf, tools: Allow events with dot

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

The Intel events use a dot to separate event name and unit mask.
Allow dot in names in the scanner, and remove special handling
of dot as EOF. Also remove the hack in jevents to replace dot
with underscore. This way dotted events can be specified
directly by the user.

I'm not fully sure this change to the scanner is correct
(what was the dot special case good for?), but I haven't
found anything that breaks with it so far at least.

V2: Add the dot to name too, to handle events outside cpu//
Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 
---
 tools/perf/util/parse-events.l |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 09e738f..13cef3c 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -119,8 +119,8 @@ event   [^,{}/]+
 num_dec[0-9]+
 num_hex0x[a-fA-F0-9]+
 num_raw_hex[a-fA-F0-9]+
-name   [a-zA-Z_*?][a-zA-Z0-9_*?]*
-name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?]*
+name   [a-zA-Z_*?][a-zA-Z0-9_*?.]*
+name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.]*
 /* If you add a modifier you need to update check_modifier() */
 modifier_event [ukhpGHSDI]+
 modifier_bp[rwx]{1,3}
@@ -165,7 +165,6 @@ modifier_bp [rwx]{1,3}
return PE_EVENT_NAME;
}
 
-.  |
 <>{
BEGIN(INITIAL);
REWIND(0);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 16/19] perf, tools, jevents: Add support for event topics

2015-06-04 Thread Sukadev Bhattiprolu

Allow assigning categories "Topics" field to the PMU events  i.e.
process the topic field from the JSON file and add a corresponding
topic field to the generated C events tables.

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

Changelog[v14]
[Jiri Olsa] Move this independent code off into a separate patch.
---
 tools/perf/pmu-events/jevents.c|   12 +---
 tools/perf/pmu-events/jevents.h|2 +-
 tools/perf/pmu-events/pmu-events.h |1 +
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c
index a8507c9..ea3474b 100644
--- a/tools/perf/pmu-events/jevents.c
+++ b/tools/perf/pmu-events/jevents.c
@@ -203,7 +203,7 @@ static void print_events_table_prefix(FILE *fp, const char 
*tblname)
 }
 
 static int print_events_table_entry(void *data, char *name, char *event,
-   char *desc, char *long_desc)
+   char *desc, char *long_desc, char *topic)
 {
FILE *outfp = data;
/*
@@ -217,6 +217,8 @@ static int print_events_table_entry(void *data, char *name, 
char *event,
fprintf(outfp, "\t.desc = \"%s\",\n", desc);
if (long_desc && long_desc[0])
fprintf(outfp, "\t.long_desc = \"%s\",\n", long_desc);
+   if (topic)
+   fprintf(outfp, "\t.topic = \"%s\",\n", topic);
 
fprintf(outfp, "},\n");
 
@@ -238,7 +240,7 @@ static void print_events_table_suffix(FILE *outfp)
 /* Call func with each event in the json file */
 int json_events(const char *fn,
  int (*func)(void *data, char *name, char *event, char *desc,
- char *long_desc),
+ char *long_desc, char *topic),
  void *data)
 {
int err = -EIO;
@@ -259,6 +261,7 @@ int json_events(const char *fn,
char *event = NULL, *desc = NULL, *name = NULL;
char *long_desc = NULL;
char *extra_desc = NULL;
+   char *topic = NULL;
struct msrmap *msr = NULL;
jsmntok_t *msrval = NULL;
jsmntok_t *precise = NULL;
@@ -297,6 +300,8 @@ int json_events(const char *fn,
   !json_streq(map, val, "null")) {
addfield(map, &extra_desc, ". ",
" Spec update: ", val);
+   } else if (json_streq(map, field, "Topic")) {
+   addfield(map, &topic, "", "", val);
} else if (json_streq(map, field, "Data_LA") && nz) {
addfield(map, &extra_desc, ". ",
" Supports address when precise",
@@ -320,12 +325,13 @@ int json_events(const char *fn,
addfield(map, &event, ",", msr->pname, msrval);
fixname(name);
 
-   err = func(data, name, event, desc, long_desc);
+   err = func(data, name, event, desc, long_desc, topic);
free(event);
free(desc);
free(name);
free(long_desc);
free(extra_desc);
+   free(topic);
if (err)
break;
tok += j;
diff --git a/tools/perf/pmu-events/jevents.h b/tools/perf/pmu-events/jevents.h
index b0eb274..9ffcb89 100644
--- a/tools/perf/pmu-events/jevents.h
+++ b/tools/perf/pmu-events/jevents.h
@@ -3,7 +3,7 @@
 
 int json_events(const char *fn,
int (*func)(void *data, char *name, char *event, char *desc,
-   char *long_desc),
+   char *long_desc, char *topic),
void *data);
 char *get_cpu_str(void);
 
diff --git a/tools/perf/pmu-events/pmu-events.h 
b/tools/perf/pmu-events/pmu-events.h
index 711f049..6b69f4b 100644
--- a/tools/perf/pmu-events/pmu-events.h
+++ b/tools/perf/pmu-events/pmu-events.h
@@ -9,6 +9,7 @@ struct pmu_event {
const char *event;
const char *desc;
const char *long_desc;
+   const char *topic;
 };
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 14/19] perf, tools: Add alias support for long descriptions

2015-06-04 Thread Sukadev Bhattiprolu

Previously we were dropping the useful longer descriptions that some
events have in the event list completely. Now that jevents provides
support for longer descriptions (see previous patch), add support for
parsing the long descriptions

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

Changelog[v14]
- [Jiri Olsa] Break up independent parts of the patch into
  separate patches.
---
 tools/perf/util/parse-events.c |5 +++--
 tools/perf/util/parse-events.h |3 ++-
 tools/perf/util/pmu.c  |   16 +++-
 tools/perf/util/pmu.h  |4 +++-
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 65f7572..c4ee41d 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1521,7 +1521,8 @@ out_enomem:
 /*
  * Print the help text for the event symbols:
  */
-void print_events(const char *event_glob, bool name_only, bool quiet_flag)
+void print_events(const char *event_glob, bool name_only, bool quiet_flag,
+   bool long_desc)
 {
print_symbol_events(event_glob, PERF_TYPE_HARDWARE,
event_symbols_hw, PERF_COUNT_HW_MAX, name_only);
@@ -1531,7 +1532,7 @@ void print_events(const char *event_glob, bool name_only, 
bool quiet_flag)
 
print_hwcache_events(event_glob, name_only);
 
-   print_pmu_events(event_glob, name_only, quiet_flag);
+   print_pmu_events(event_glob, name_only, quiet_flag, long_desc);
 
if (event_glob != NULL)
return;
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index d11f854..5c93814 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -133,7 +133,8 @@ void parse_events_update_lists(struct list_head *list_event,
 void parse_events_evlist_error(struct parse_events_evlist *data,
   int idx, const char *str);
 
-void print_events(const char *event_glob, bool name_only, bool quiet);
+void print_events(const char *event_glob, bool name_only, bool quiet,
+ bool long_desc);
 
 struct event_symbol {
const char  *symbol;
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 300975e..05653ec 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -210,7 +210,8 @@ static int perf_pmu__parse_snapshot(struct perf_pmu_alias 
*alias,
 }
 
 static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name,
-char *desc __maybe_unused, char *val)
+char *desc __maybe_unused, char *val,
+char *long_desc)
 {
struct perf_pmu_alias *alias;
int ret;
@@ -243,6 +244,8 @@ static int __perf_pmu__new_alias(struct list_head *list, 
char *dir, char *name,
}
 
alias->desc = desc ? strdup(desc) : NULL;
+   alias->long_desc = long_desc ? strdup(long_desc) :
+   desc ? strdup(desc) : NULL;
 
list_add_tail(&alias->list, list);
 
@@ -260,7 +263,7 @@ static int perf_pmu__new_alias(struct list_head *list, char 
*dir, char *name, FI
 
buf[ret] = 0;
 
-   return __perf_pmu__new_alias(list, dir, name, NULL, buf);
+   return __perf_pmu__new_alias(list, dir, name, NULL, buf, NULL);
 }
 
 static inline bool pmu_alias_info_file(char *name)
@@ -508,7 +511,8 @@ static int pmu_add_cpu_aliases(struct list_head *head)
 
/* need type casts to override 'const' */
__perf_pmu__new_alias(head, NULL, (char *)pe->name,
-   (char *)pe->desc, (char *)pe->event);
+   (char *)pe->desc, (char *)pe->event,
+   (char *)pe->long_desc);
}
 
 out:
@@ -1036,7 +1040,8 @@ static void wordwrap(char *s, int start, int max, int 
corr)
}
 }
 
-void print_pmu_events(const char *event_glob, bool name_only, bool quiet_flag)
+void print_pmu_events(const char *event_glob, bool name_only, bool quiet_flag,
+   bool long_desc)
 {
struct perf_pmu *pmu;
struct perf_pmu_alias *alias;
@@ -1083,7 +1088,8 @@ void print_pmu_events(const char *event_glob, bool 
name_only, bool quiet_flag)
if (!aliases[j].name)
goto out_enomem;
 
-   aliases[j].desc = alias->desc;
+   aliases[j].desc = long_desc ? alias->long_desc :
+   alias->desc;
j++;
}
if (pmu->selectable) {
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 9966c1a..10e981c 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -39,6 +39,7 @@ struct perf_pmu_info {
 struct perf_pmu_alias {
char *name;
char *desc;
+   char *long_desc;
struct

[PATCH v14 17/19] perf, tools: Add support for event list topics

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

Add support to group the output of perf list by the Topic field
in the JSON file.

Example output:

% perf list
...
Cache:
  l1d.replacement
   [L1D data line replacements]
  l1d_pend_miss.pending
   [L1D miss oustandings duration in cycles]
  l1d_pend_miss.pending_cycles
   [Cycles with L1D load Misses outstanding]
  l2_l1d_wb_rqsts.all
   [Not rejected writebacks from L1D to L2 cache lines in any state]
  l2_l1d_wb_rqsts.hit_e
   [Not rejected writebacks from L1D to L2 cache lines in E state]
  l2_l1d_wb_rqsts.hit_m
   [Not rejected writebacks from L1D to L2 cache lines in M state]

...
Pipeline:
  arith.fpu_div
   [Divide operations executed]
  arith.fpu_div_active
   [Cycles when divider is busy executing divide operations]
  baclears.any
   [Counts the total number when the front end is resteered, mainly
   when the BPU cannot provide a correct prediction and this is
   corrected by other branch handling mechanisms at the front end]
  br_inst_exec.all_branches
   [Speculative and retired branches]
  br_inst_exec.all_conditional
   [Speculative and retired macro-conditional branches]
  br_inst_exec.all_direct_jmp
   [Speculative and retired macro-unconditional branches excluding
   calls and indirects]
  br_inst_exec.all_direct_near_call
   [Speculative and retired direct near calls]
  br_inst_exec.all_indirect_jump_non_call_ret

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

Changelog[v14]
- [Jiri Olsa] Move jevents support for Topic to a separate patch.
---
 tools/perf/util/pmu.c |   36 ++--
 tools/perf/util/pmu.h |1 +
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 05653ec..5ecbd1e 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -211,7 +211,7 @@ static int perf_pmu__parse_snapshot(struct perf_pmu_alias 
*alias,
 
 static int __perf_pmu__new_alias(struct list_head *list, char *dir, char *name,
 char *desc __maybe_unused, char *val,
-char *long_desc)
+char *long_desc, char *topic)
 {
struct perf_pmu_alias *alias;
int ret;
@@ -246,6 +246,7 @@ static int __perf_pmu__new_alias(struct list_head *list, 
char *dir, char *name,
alias->desc = desc ? strdup(desc) : NULL;
alias->long_desc = long_desc ? strdup(long_desc) :
desc ? strdup(desc) : NULL;
+   alias->topic = topic ? strdup(topic) : NULL;
 
list_add_tail(&alias->list, list);
 
@@ -263,7 +264,7 @@ static int perf_pmu__new_alias(struct list_head *list, char 
*dir, char *name, FI
 
buf[ret] = 0;
 
-   return __perf_pmu__new_alias(list, dir, name, NULL, buf, NULL);
+   return __perf_pmu__new_alias(list, dir, name, NULL, buf, NULL, NULL);
 }
 
 static inline bool pmu_alias_info_file(char *name)
@@ -512,7 +513,7 @@ static int pmu_add_cpu_aliases(struct list_head *head)
/* need type casts to override 'const' */
__perf_pmu__new_alias(head, NULL, (char *)pe->name,
(char *)pe->desc, (char *)pe->event,
-   (char *)pe->long_desc);
+   (char *)pe->long_desc, (char *)pe->topic);
}
 
 out:
@@ -1002,19 +1003,26 @@ static char *format_alias_or(char *buf, int len, struct 
perf_pmu *pmu,
return buf;
 }
 
-struct pair {
+struct sevent {
char *name;
char *desc;
+   char *topic;
 };
 
-static int cmp_pair(const void *a, const void *b)
+static int cmp_sevent(const void *a, const void *b)
 {
-   const struct pair *as = a;
-   const struct pair *bs = b;
+   const struct sevent *as = a;
+   const struct sevent *bs = b;
 
/* Put extra events last */
if (!!as->desc != !!bs->desc)
return !!as->desc - !!bs->desc;
+   if (as->topic && bs->topic) {
+   int n = strcmp(as->topic, bs->topic);
+
+   if (n)
+   return n;
+   }
return strcmp(as->name, bs->name);
 }
 
@@ -1048,9 +1056,10 @@ void print_pmu_events(const char *event_glob, bool 
name_only, bool quiet_flag,
char buf[1024];
int printed = 0;
int len, j;
-   struct pair *aliases;
+   struct sevent *aliases;
int numdesc = 0;
int columns = pager_get_columns();
+   char *topic = NULL;
 
pmu = NULL;
len = 0;
@@ -1060,7 +1069,7 @@ void print_pmu_events(const char *event_glob, bool 
name_only, bool quiet_flag,
if (pmu->selectable)
len++;
}
-   aliases = zalloc(sizeof(struct pair) * len);
+   aliases = zalloc(sizeof(struct sevent) * len);
if (!aliases)
goto out_enomem;
pmu = NULL;
@@ -1090,6 +1099,7 @@ void print_p

[PATCH v14 10/19] perf, tools: Query terminal width and use in perf list

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

Automatically adapt the now wider and word wrapped perf list
output to wider terminals. This requires querying the terminal
before the auto pager takes over, and exporting this
information from the pager subsystem.

Acked-by: Namhyung Kim 
Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 
---
 tools/perf/util/cache.h |1 +
 tools/perf/util/pager.c |   15 +++
 tools/perf/util/pmu.c   |3 ++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/cache.h b/tools/perf/util/cache.h
index c861373..8e0d4b8 100644
--- a/tools/perf/util/cache.h
+++ b/tools/perf/util/cache.h
@@ -32,6 +32,7 @@ extern const char *perf_config_dirname(const char *, const 
char *);
 extern void setup_pager(void);
 extern int pager_in_use(void);
 extern int pager_use_color;
+int pager_get_columns(void);
 
 char *alias_lookup(const char *alias);
 int split_cmdline(char *cmdline, const char ***argv);
diff --git a/tools/perf/util/pager.c b/tools/perf/util/pager.c
index 53ef006..1770c88 100644
--- a/tools/perf/util/pager.c
+++ b/tools/perf/util/pager.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "run-command.h"
 #include "sigchain.h"
+#include 
 
 /*
  * This is split up from the rest of git so that we can do
@@ -8,6 +9,7 @@
  */
 
 static int spawned_pager;
+static int pager_columns;
 
 static void pager_preexec(void)
 {
@@ -47,9 +49,12 @@ static void wait_for_pager_signal(int signo)
 void setup_pager(void)
 {
const char *pager = getenv("PERF_PAGER");
+   struct winsize sz;
 
if (!isatty(1))
return;
+   if (ioctl(1, TIOCGWINSZ, &sz) == 0)
+   pager_columns = sz.ws_col;
if (!pager)
pager = getenv("PAGER");
if (!(pager || access("/usr/bin/pager", X_OK)))
@@ -93,3 +98,13 @@ int pager_in_use(void)
env = getenv("PERF_PAGER_IN_USE");
return env ? perf_config_bool("PERF_PAGER_IN_USE", env) : 0;
 }
+
+int pager_get_columns(void)
+{
+   char *s;
+
+   s = getenv("COLUMNS");
+   if (s)
+   return atoi(s);
+   return (pager_columns ? pager_columns : 80) - 2;
+}
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index e377598..443086e 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -14,6 +14,7 @@
 #include "cpumap.h"
 #include "header.h"
 #include "pmu-events/pmu-events.h"
+#include "cache.h"
 
 struct perf_pmu_format {
char *name;
@@ -1038,7 +1039,7 @@ void print_pmu_events(const char *event_glob, bool 
name_only)
int len, j;
struct pair *aliases;
int numdesc = 0;
-   int columns = 78;
+   int columns = pager_get_columns();
 
pmu = NULL;
len = 0;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 19/19] perf, tools: Add README for info on parsing JSON/map files

2015-06-04 Thread Sukadev Bhattiprolu

Signed-off-by: Sukadev Bhattiprolu 
---
 tools/perf/pmu-events/README |  122 ++
 1 file changed, 122 insertions(+)
 create mode 100644 tools/perf/pmu-events/README

diff --git a/tools/perf/pmu-events/README b/tools/perf/pmu-events/README
new file mode 100644
index 000..da57cb5
--- /dev/null
+++ b/tools/perf/pmu-events/README
@@ -0,0 +1,122 @@
+
+The contents of this directory allow users to specify PMU events in
+their CPUs by their symbolic names rather than raw event codes (see
+example below).
+
+The main program in this directory, is the 'jevents', which is built and
+executed _before_ the perf binary itself is built.
+
+The 'jevents' program tries to locate and process JSON files in the directory
+tree tools/perf/pmu-events/arch/foo.
+
+   - Regular files with '.json' extension in the name are assumed to be
+ JSON files, each of which describes a set of PMU events.
+
+   - Regular files with basename starting with 'mapfile.csv' are assumed
+ to be a CSV file that maps a specific CPU to its set of PMU events.
+ (see below for mapfile format)
+
+   - Directories are traversed, but all other files are ignored.
+
+Using the JSON files and the mapfile, 'jevents' generates the C source file,
+'pmu-events.c', which encodes the two sets of tables:
+
+   - Set of 'PMU events tables' for all known CPUs in the architecture,
+ (one table like the following, per JSON file; table name 'pme_power8'
+ is derived from JSON file name, 'power8.json').
+
+   struct pmu_event pme_power8[] = {
+
+   ...
+
+   {
+   .name = "pm_1plus_ppc_cmpl",
+   .event = "event=0x100f2",
+   .desc = "1 or more ppc insts finished,",
+   },
+
+   ...
+   }
+
+   - A 'mapping table' that maps each CPU of the architecture, to its
+ 'PMU events table'
+
+   struct pmu_events_map pmu_events_map[] = {
+   {
+   .cpuid = "004b",
+   .version = "1",
+   .type = "core",
+   .table = pme_power8
+   },
+   ...
+
+   };
+
+After the 'pmu-events.c' is generated, it is compiled and the resulting
+'pmu-events.o' is added to 'libperf.a' which is then used to build perf.
+
+NOTES:
+   1. Several CPUs can support same set of events and hence use a common
+  JSON file. Hence several entries in the pmu_events_map[] could map
+  to a single 'PMU events table'.
+
+   2. The 'pmu-events.h' has an extern declaration for the mapping table
+  and the generated 'pmu-events.c' defines this table.
+
+   3. _All_ known CPU tables for architecture are included in the perf
+  binary.
+
+At run time, perf determines the actual CPU it is running on, finds the
+matching events table and builds aliases for those events. This allows
+users to specify events by their name:
+
+   $ perf stat -e pm_1plus_ppc_cmpl sleep 1
+
+where 'pm_1plus_ppc_cmpl' is a Power8 PMU event.
+
+In case of errors when processing files in the tools/perf/pmu-events/arch
+directory, 'jevents' tries to create an empty mapping file to allow the perf
+build to succeed even if the PMU event aliases cannot be used.
+
+However some errors in processing may cause the perf build to fail.
+
+Mapfile format
+===
+
+The mapfile.csv format is expected to be:
+
+   Header line
+   CPUID,Version,File/path/name.json,Type
+
+where:
+
+   Comma:
+   is the required field delimiter (i.e other fields cannot
+   have commas within them).
+
+   Comments:
+   Lines in which the first character is either '\n' or '#'
+   are ignored.
+
+   Header line
+   The header line is the first line in the file, which is
+   _IGNORED_. It can be a comment (begin with '#') or empty.
+
+   CPUID:
+   CPUID is an arch-specific char string, that can be used
+   to identify CPU (and associate it with a set of PMU events
+   it supports). Multiple CPUIDS can point to the same
+   File/path/name.json.
+
+   Example:
+   CPUID == 'GenuineIntel-6-2E' (on x86).
+   CPUID == '004b0100' (PVR value in Powerpc)
+   Version:
+   is the Version of the mapfile.
+
+   File/path/name.json:
+   is the pathname for the JSON file, relative to the directory
+   containing the mapfile.csv
+
+   Type:
+   indicates whether the events or "core" or "uncore" events.
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More

[PATCH v14 18/19] perf, tools: Handle header line in mapfile

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

To work with existing mapfiles, assume that the first line in
'mapfile.csv' is a header line and skip over it.

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

Changelog[v2]
All architectures may not use the "Family" to identify. So,
assume first line is header.
---
 tools/perf/pmu-events/jevents.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c
index ea3474b..7347cca 100644
--- a/tools/perf/pmu-events/jevents.c
+++ b/tools/perf/pmu-events/jevents.c
@@ -462,7 +462,12 @@ static int process_mapfile(FILE *outfp, char *fpath)
 
print_mapping_table_prefix(outfp);
 
-   line_num = 0;
+   /* Skip first line (header) */
+   p = fgets(line, n, mapfp);
+   if (!p)
+   goto out;
+
+   line_num = 1;
while (1) {
char *cpuid, *version, *type, *fname;
 
@@ -506,8 +511,8 @@ static int process_mapfile(FILE *outfp, char *fpath)
fprintf(outfp, "},\n");
}
 
+out:
print_mapping_table_suffix(outfp);
-
return 0;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 11/19] perf, tools: Add a --no-desc flag to perf list

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

Add a --no-desc flag to perf list to not print the event descriptions
that were earlier added for JSON events. This may be useful to
get a less crowded listing.

It's still default to print descriptions as that is the more useful
default for most users.

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

v2: Rename --quiet to --no-desc. Add option to man page.
---
 tools/perf/Documentation/perf-list.txt |8 +++-
 tools/perf/builtin-list.c  |   12 
 tools/perf/util/parse-events.c |4 ++--
 tools/perf/util/parse-events.h |2 +-
 tools/perf/util/pmu.c  |4 ++--
 tools/perf/util/pmu.h  |2 +-
 6 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/tools/perf/Documentation/perf-list.txt 
b/tools/perf/Documentation/perf-list.txt
index bada893..9507552 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -8,13 +8,19 @@ perf-list - List all symbolic event types
 SYNOPSIS
 
 [verse]
-'perf list' [hw|sw|cache|tracepoint|pmu|event_glob]
+'perf list' [--no-desc] [hw|sw|cache|tracepoint|pmu|event_glob]
 
 DESCRIPTION
 ---
 This command displays the symbolic event types which can be selected in the
 various perf commands with the -e option.
 
+OPTIONS
+---
+--no-desc::
+Don't print descriptions.
+
+
 [[EVENT_MODIFIERS]]
 EVENT MODIFIERS
 ---
diff --git a/tools/perf/builtin-list.c b/tools/perf/builtin-list.c
index af5bd05..3f058f7 100644
--- a/tools/perf/builtin-list.c
+++ b/tools/perf/builtin-list.c
@@ -16,16 +16,20 @@
 #include "util/pmu.h"
 #include "util/parse-options.h"
 
+static bool desc_flag = true;
+
 int cmd_list(int argc, const char **argv, const char *prefix __maybe_unused)
 {
int i;
bool raw_dump = false;
struct option list_options[] = {
OPT_BOOLEAN(0, "raw-dump", &raw_dump, "Dump raw events"),
+   OPT_BOOLEAN('d', "desc", &desc_flag,
+   "Print extra event descriptions. --no-desc to not 
print."),
OPT_END()
};
const char * const list_usage[] = {
-   "perf list [hw|sw|cache|tracepoint|pmu|event_glob]",
+   "perf list [--no-desc] [hw|sw|cache|tracepoint|pmu|event_glob]",
NULL
};
 
@@ -40,7 +44,7 @@ int cmd_list(int argc, const char **argv, const char *prefix 
__maybe_unused)
printf("\nList of pre-defined events (to be used in -e):\n\n");
 
if (argc == 0) {
-   print_events(NULL, raw_dump);
+   print_events(NULL, raw_dump, !desc_flag);
return 0;
}
 
@@ -59,13 +63,13 @@ int cmd_list(int argc, const char **argv, const char 
*prefix __maybe_unused)
 strcmp(argv[i], "hwcache") == 0)
print_hwcache_events(NULL, raw_dump);
else if (strcmp(argv[i], "pmu") == 0)
-   print_pmu_events(NULL, raw_dump);
+   print_pmu_events(NULL, raw_dump, !desc_flag);
else {
char *sep = strchr(argv[i], ':'), *s;
int sep_idx;
 
if (sep == NULL) {
-   print_events(argv[i], raw_dump);
+   print_events(argv[i], raw_dump, !desc_flag);
continue;
}
sep_idx = sep - argv[i];
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 2a4d1ec..65f7572 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1521,7 +1521,7 @@ out_enomem:
 /*
  * Print the help text for the event symbols:
  */
-void print_events(const char *event_glob, bool name_only)
+void print_events(const char *event_glob, bool name_only, bool quiet_flag)
 {
print_symbol_events(event_glob, PERF_TYPE_HARDWARE,
event_symbols_hw, PERF_COUNT_HW_MAX, name_only);
@@ -1531,7 +1531,7 @@ void print_events(const char *event_glob, bool name_only)
 
print_hwcache_events(event_glob, name_only);
 
-   print_pmu_events(event_glob, name_only);
+   print_pmu_events(event_glob, name_only, quiet_flag);
 
if (event_glob != NULL)
return;
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 131f29b..d11f854 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -133,7 +133,7 @@ void parse_events_update_lists(struct list_head *list_event,
 void parse_events_evlist_error(struct parse_events_evlist *data,
   int idx, const char *str);
 
-void print_events(const char *event_glob, bool name_only);
+void print_events(const char *event_glob, bool name_only, bool quiet);
 
 struct event_symbol {
const char  *symbol;
diff --git a/tools

[PATCH v14 13/19] perf, tools, jevents: Add support for long descriptions

2015-06-04 Thread Sukadev Bhattiprolu

Implement support in jevents to parse long descriptions for events
that may have them in the JSON files. A follow on patch will make this
long description available to user through the 'perf list' command.

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

Changelog[v14]
- [Jiri Olsa] Break up independent parts of the patch into
  separate patches.
---
 tools/perf/pmu-events/jevents.c|   31 +++
 tools/perf/pmu-events/jevents.h|3 ++-
 tools/perf/pmu-events/pmu-events.h |1 +
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/tools/perf/pmu-events/jevents.c b/tools/perf/pmu-events/jevents.c
index 5f7603b..a8507c9 100644
--- a/tools/perf/pmu-events/jevents.c
+++ b/tools/perf/pmu-events/jevents.c
@@ -203,7 +203,7 @@ static void print_events_table_prefix(FILE *fp, const char 
*tblname)
 }
 
 static int print_events_table_entry(void *data, char *name, char *event,
-   char *desc)
+   char *desc, char *long_desc)
 {
FILE *outfp = data;
/*
@@ -215,6 +215,8 @@ static int print_events_table_entry(void *data, char *name, 
char *event,
fprintf(outfp, "\t.name = \"%s\",\n", name);
fprintf(outfp, "\t.event = \"%s\",\n", event);
fprintf(outfp, "\t.desc = \"%s\",\n", desc);
+   if (long_desc && long_desc[0])
+   fprintf(outfp, "\t.long_desc = \"%s\",\n", long_desc);
 
fprintf(outfp, "},\n");
 
@@ -235,7 +237,8 @@ static void print_events_table_suffix(FILE *outfp)
 
 /* Call func with each event in the json file */
 int json_events(const char *fn,
- int (*func)(void *data, char *name, char *event, char *desc),
+ int (*func)(void *data, char *name, char *event, char *desc,
+ char *long_desc),
  void *data)
 {
int err = -EIO;
@@ -254,6 +257,8 @@ int json_events(const char *fn,
tok = tokens + 1;
for (i = 0; i < tokens->size; i++) {
char *event = NULL, *desc = NULL, *name = NULL;
+   char *long_desc = NULL;
+   char *extra_desc = NULL;
struct msrmap *msr = NULL;
jsmntok_t *msrval = NULL;
jsmntok_t *precise = NULL;
@@ -279,6 +284,9 @@ int json_events(const char *fn,
} else if (json_streq(map, field, "BriefDescription")) {
addfield(map, &desc, "", "", val);
fixdesc(desc);
+   } else if (json_streq(map, field, "PublicDescription")) 
{
+   addfield(map, &long_desc, "", "", val);
+   fixdesc(long_desc);
} else if (json_streq(map, field, "PEBS") && nz) {
precise = val;
} else if (json_streq(map, field, "MSRIndex") && nz) {
@@ -287,10 +295,10 @@ int json_events(const char *fn,
msrval = val;
} else if (json_streq(map, field, "Errata") &&
   !json_streq(map, val, "null")) {
-   addfield(map, &desc, ". ",
+   addfield(map, &extra_desc, ". ",
" Spec update: ", val);
} else if (json_streq(map, field, "Data_LA") && nz) {
-   addfield(map, &desc, ". ",
+   addfield(map, &extra_desc, ". ",
" Supports address when precise",
NULL);
}
@@ -298,19 +306,26 @@ int json_events(const char *fn,
}
if (precise && !strstr(desc, "(Precise Event)")) {
if (json_streq(map, precise, "2"))
-   addfield(map, &desc, " ", "(Must be precise)",
-   NULL);
+   addfield(map, &extra_desc, " ",
+   "(Must be precise)", NULL);
else
-   addfield(map, &desc, " ",
+   addfield(map, &extra_desc, " ",
"(Precise event)", NULL);
}
+   if (desc && extra_desc)
+   addfield(map, &desc, " ", extra_desc, NULL);
+   if (long_desc && extra_desc)
+   addfield(map, &long_desc, " ", extra_desc, NULL);
if (msr != NULL)
addfield(map, &event, ",", msr->pname, msrval);
fixname(name);
-   err = func(data, name, event, desc);
+
+   err = func(data, name, event, desc, long_desc);
free(event);
free(desc);

[PATCH v14 15/19] perf, tools: Support long descriptions with perf list

2015-06-04 Thread Sukadev Bhattiprolu

Previously we were dropping the useful longer descriptions that some
events have in the event list completely. This patch makes them appear with
perf list.

Old perf list:

baclears:
  baclears.all
   [Counts the number of baclears]

vs new:

perf list -v:
...
baclears:
  baclears.all
   [The BACLEARS event counts the number of times the front end is
resteered, mainly when the Branch Prediction Unit cannot provide
a correct prediction and this is corrected by the Branch Address
Calculator at the front end. The BACLEARS.ANY event counts the
number of baclears for any type of branch]

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

Changelog[v14]
- [Jiri Olsa] Break up independent parts of the patch into
  separate patches.
---
 tools/perf/builtin-list.c |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-list.c b/tools/perf/builtin-list.c
index 3f058f7..d0f7a18 100644
--- a/tools/perf/builtin-list.c
+++ b/tools/perf/builtin-list.c
@@ -22,10 +22,13 @@ int cmd_list(int argc, const char **argv, const char 
*prefix __maybe_unused)
 {
int i;
bool raw_dump = false;
+   bool long_desc_flag = false;
struct option list_options[] = {
OPT_BOOLEAN(0, "raw-dump", &raw_dump, "Dump raw events"),
OPT_BOOLEAN('d', "desc", &desc_flag,
"Print extra event descriptions. --no-desc to not 
print."),
+   OPT_BOOLEAN('d', "long-desc", &long_desc_flag,
+   "Print longer event descriptions."),
OPT_END()
};
const char * const list_usage[] = {
@@ -44,7 +47,7 @@ int cmd_list(int argc, const char **argv, const char *prefix 
__maybe_unused)
printf("\nList of pre-defined events (to be used in -e):\n\n");
 
if (argc == 0) {
-   print_events(NULL, raw_dump, !desc_flag);
+   print_events(NULL, raw_dump, !desc_flag, long_desc_flag);
return 0;
}
 
@@ -63,13 +66,15 @@ int cmd_list(int argc, const char **argv, const char 
*prefix __maybe_unused)
 strcmp(argv[i], "hwcache") == 0)
print_hwcache_events(NULL, raw_dump);
else if (strcmp(argv[i], "pmu") == 0)
-   print_pmu_events(NULL, raw_dump, !desc_flag);
+   print_pmu_events(NULL, raw_dump, !desc_flag,
+   long_desc_flag);
else {
char *sep = strchr(argv[i], ':'), *s;
int sep_idx;
 
if (sep == NULL) {
-   print_events(argv[i], raw_dump, !desc_flag);
+   print_events(argv[i], raw_dump, !desc_flag,
+   long_desc_flag);
continue;
}
sep_idx = sep - argv[i];
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v14 12/19] perf, tools: Add override support for event list CPUID

2015-06-04 Thread Sukadev Bhattiprolu

From: Andi Kleen 

Add a PERF_CPUID variable to override the CPUID of the current CPU (within
the current architecture). This is useful for testing, so that all event
lists can be tested on a single system.

Signed-off-by: Andi Kleen 
Signed-off-by: Sukadev Bhattiprolu 

v2: Fix double free in earlier version.
Print actual CPUID being used with verbose option.
---
 tools/perf/util/pmu.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 790c64f..300975e 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -477,10 +477,16 @@ static int pmu_add_cpu_aliases(struct list_head *head)
struct pmu_event *pe;
char *cpuid;
 
-   cpuid = get_cpuid_str();
+   cpuid = getenv("PERF_CPUID");
+   if (cpuid)
+   cpuid = strdup(cpuid);
+   if (!cpuid)
+   cpuid = get_cpuid_str();
if (!cpuid)
return 0;
 
+   pr_debug("Using CPUID %s\n", cpuid);
+
i = 0;
while (1) {
map = &pmu_events_map[i++];
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V14 00/14] perf, tools: Add support for PMU events in JSON format

2015-06-04 Thread Sukadev Bhattiprolu

CPUs support a large number of performance monitoring events (PMU events)
and often these events are very specific to an architecture/model of the
CPU. To use most of these PMU events with perf, we currently have to identify
them by their raw codes:

perf stat -e r100f2 sleep 1

This patchset allows architectures to specify these PMU events in JSON
files located in 'tools/perf/pmu-events/arch/' of the mainline tree.
The events from the JSON files for the architecture are then built into
the perf binary.

At run time, perf identifies the specific set of events for the CPU and
creates "event aliases". These aliases allow users to specify events by
"name" as:

perf stat -e pm_1plus_ppc_cmpl sleep 1

The file, 'tools/perf/pmu-events/README' in [PATCH 14/14] gives more
details.

Note:
- All known events tables for the architecture are included in the
  perf binary.

- For architectures that don't have any JSON files, an empty mapping
  table is created and they should continue to build)

Thanks to input from Andi Kleen, Jiri Olsa, Namhyung Kim and Ingo Molnar.

These patches are available from:

https://github.com:sukadev/linux.git 

Branch  Description
--
json-v14Source Code only 
json-files-3x86 and Powerpc datafiles only
json-v14-with-data  Both code and data (build/test)

NOTE:   Only "source code" patches (i.e those in json-v14) are being emailed.
Please pull the "data files" from the json-files-3 branch.

Changelog[v14]
Comments from Jiri Olsa:
- Change parameter name/type for pmu_add_cpu_aliases (from void *data
  to list_head *head)
- Use asprintf() in file_name_to_tablename() and simplify/reorg code.
- Use __weak definition from 
- Use fopen() with mode "w" and eliminate unlink()
- Remove minor TODO.
- Add error check for return value from strdup() in print_pmu_events().
- Move independent changes from patches 3,11,12 .. to separate patches
  for easier review/backport.
- Clarify mapfile's "header line support" in patch description.
- Fix build failure with DEBUG=1

Comment from Andi Kleen:
- In tools/perf/pmu-events/Build, check for 'mapfile.csv' rather than
  'mapfile*'

Misc:
- Minor changes/clarifications to tools/perf/pmu-events/README.


Changelog[v13]
Version: Individual patches have their own history :-) that I am
preserving. Patchset version (v13) is for overall patchset and is
somewhat arbitrary.

- Added support for "categories" of events to perf
- Add mapfile, jevents build dependency on pmu-events.c
- Silence jevents when parsing JSON files unless V=1 is specified
- Cleanup error messages
- Fix memory leak with ->cpuid
- Rebase to Arnaldo's tree
- Allow overriding CPUID via environment variable
- Support long descriptions for events
- Handle header line in mapfile.csv
- Cleanup JSON files (trim PublicDescription if identical to/prefix of
  BriefDescription field)


Andi Kleen (10):
  perf, tools: Add jsmn `jasmine' JSON parser
  perf, tools, jevents: Program to convert JSON file to C style file
  perf, tools: Allow events with dot
  perf, tools: Support CPU id matching for x86 v2
  perf, tools: Support alias descriptions
  perf, tools: Query terminal width and use in perf list
  perf, tools: Add a --no-desc flag to perf list
  perf, tools: Add override support for event list CPUID
  perf, tools: Add support for event list topics
  perf, tools: Handle header line in mapfile

Sukadev Bhattiprolu (9):
  Use __weak definition from 
  perf, tools: Split perf_pmu__new_alias()
  perf, tools: Use pmu_events table to create aliases
  perf, tools: Support CPU ID matching for Powerpc
  perf, tools, jevents: Add support for long descriptions
  perf, tools: Add alias support for long descriptions
  perf, tools: Support long descriptions with perf list
  perf, tools, jevents: Add support for event topics
  perf, tools: Add README for info on parsing JSON/map files

 tools/perf/Documentation/perf-list.txt |8 +-
 tools/perf/Makefile.perf   |   25 +-
 tools/perf/arch/powerpc/util/header.c  |   11 +
 tools/perf/arch/x86/util/header.c  |   24 +-
 tools/perf/builtin-list.c  |   17 +-
 tools/perf/pmu-events/Build|   11 +
 tools/perf/pmu-events/README   |  122 ++
 tools/perf/pmu-events/jevents.c|  712 
 tools/perf/pmu-events/jevents.h|   18 +
 tools/perf/pmu-events/jsmn.c   |  313 ++
 tools/perf/pmu-events/jsmn.h   |   67 +++
 tools/perf/pmu-events/json.c   |  162 
 t

Re: [PATCH v3 4/6] dt-bindings: Consolidate ChipIdea USB ci13xxx bindings

2015-06-04 Thread Peter Chen

On Fri, May 29, 2015 at 11:38:44AM -0500, Rob Herring wrote:
> Combine the ChipIdea USB binding into a single document to reduce
> duplication and fragmentation. This marks use of the old PHY bindings as
> deprecated. Future compatible bindings should use generic PHY binding.
> 
> Signed-off-by: Rob Herring 
> Cc: Ivan T. Ivanov 
> Cc: Peter Chen 
> Cc: Daniel Tang 
> Cc: Pawel Moll 
> Cc: Mark Rutland 
> Cc: Ian Campbell 
> Cc: Kumar Gala 
> Cc: devicet...@vger.kernel.org
> ---
>  .../devicetree/bindings/usb/ci-hdrc-imx.txt| 35 
> --
>  .../devicetree/bindings/usb/ci-hdrc-qcom.txt   | 17 ---
>  .../devicetree/bindings/usb/ci-hdrc-usb2.txt   | 22 +-
>  .../devicetree/bindings/usb/ci-hdrc-zevio.txt  | 17 ---
>  4 files changed, 21 insertions(+), 70 deletions(-)
>  delete mode 100644 Documentation/devicetree/bindings/usb/ci-hdrc-imx.txt
>  delete mode 100644 Documentation/devicetree/bindings/usb/ci-hdrc-qcom.txt
>  delete mode 100644 Documentation/devicetree/bindings/usb/ci-hdrc-zevio.txt
> 
> diff --git a/Documentation/devicetree/bindings/usb/ci-hdrc-imx.txt 
> b/Documentation/devicetree/bindings/usb/ci-hdrc-imx.txt
> deleted file mode 100644
> index 38a5480..000
> --- a/Documentation/devicetree/bindings/usb/ci-hdrc-imx.txt
> +++ /dev/null
> @@ -1,35 +0,0 @@
> -* Freescale i.MX ci13xxx usb controllers
> -
> -Required properties:
> -- compatible: Should be "fsl,imx27-usb"
> -- reg: Should contain registers location and length
> -- interrupts: Should contain controller interrupt
> -- fsl,usbphy: phandle of usb phy that connects to the port
> -
> -Recommended properies:
> -- phy_type: the type of the phy connected to the core. Should be one
> -  of "utmi", "utmi_wide", "ulpi", "serial" or "hsic". Without this
> -  property the PORTSC register won't be touched
> -- dr_mode: One of "host", "peripheral" or "otg". Defaults to "otg"
> -
> -Optional properties:
> -- fsl,usbmisc: phandler of non-core register device, with one argument
> -  that indicate usb controller index
> -- vbus-supply: regulator for vbus
> -- disable-over-current: disable over current detect
> -- external-vbus-divider: enables off-chip resistor divider for Vbus
> -- maximum-speed: limit the maximum connection speed to "full-speed".
> -- tpl-support: TPL (Targeted Peripheral List) feature for targeted hosts
> -
> -Examples:
> -usb@02184000 { /* USB OTG */
> - compatible = "fsl,imx6q-usb", "fsl,imx27-usb";
> - reg = <0x02184000 0x200>;
> - interrupts = <0 43 0x04>;
> - fsl,usbphy = <&usbphy1>;
> - fsl,usbmisc = <&usbmisc 0>;
> - disable-over-current;
> - external-vbus-divider;
> - maximum-speed = "full-speed";
> - tpl-support;
> -};
> diff --git a/Documentation/devicetree/bindings/usb/ci-hdrc-qcom.txt 
> b/Documentation/devicetree/bindings/usb/ci-hdrc-qcom.txt
> deleted file mode 100644
> index f2899b5..000
> --- a/Documentation/devicetree/bindings/usb/ci-hdrc-qcom.txt
> +++ /dev/null
> @@ -1,17 +0,0 @@
> -Qualcomm CI13xxx (Chipidea) USB controllers
> -
> -Required properties:
> -- compatible:   should contain "qcom,ci-hdrc"
> -- reg:  offset and length of the register set in the memory map
> -- interrupts:   interrupt-specifier for the controller interrupt.
> -- usb-phy:  phandle for the PHY device
> -- dr_mode:  Should be "peripheral"
> -
> -Examples:
> - gadget@f9a55000 {
> - compatible = "qcom,ci-hdrc";
> - reg = <0xf9a55000 0x400>;
> - dr_mode = "peripheral";
> - interrupts = <0 134 0>;
> - usb-phy = <&usbphy0>;
> - };
> diff --git a/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt 
> b/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt
> index 27f8b1e..553e2fa 100644
> --- a/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt
> +++ b/Documentation/devicetree/bindings/usb/ci-hdrc-usb2.txt
> @@ -1,15 +1,35 @@
>  * USB2 ChipIdea USB controller for ci13xxx
>  
>  Required properties:
> -- compatible: should be "chipidea,usb2"
> +- compatible: should be one of:
> + "fsl,imx27-usb"
> + "lsi,zevio-usb"
> + "qcom,ci-hdrc"
> + "chipidea,usb2"
>  - reg: base address and length of the registers
>  - interrupts: interrupt for the USB controller
>  
> +Recommended properies:
> +- phy_type: the type of the phy connected to the core. Should be one
> +  of "utmi", "utmi_wide", "ulpi", "serial" or "hsic". Without this
> +  property the PORTSC register won't be touched.
> +- dr_mode: One of "host", "peripheral" or "otg". Defaults to "otg"
> +
> +Deprecated properties:
> +- usb-phy:  phandle for the PHY device. Use "phys" instead.
> +- fsl,usbphy: phandle of usb phy that connects to the port. Use "phys" 
> instead.
> +
>  Optional properties:
>  - clocks: reference to the USB clock
>  - phys: reference to the USB PHY
>  - phy-names: should be "usb-phy"
>  - vbus-supply: reference to the VBUS regulator
> +- maximum-speed: limit the max

[PATCH v2 0/2] Support for CEVA SATA Host controller

2015-06-04 Thread Suneel Garapati

Adds support for CEVA SATA Host controller found on Xilinx Zynq
Ultrascale+ MPSoC.

Changes v2
 - change module license to GPL v2

Suneel Garapati (2):
  devicetree:bindings: add devicetree bindings for ceva ahci
  drivers: ata: add support for Ceva sata host controller

 .../devicetree/bindings/ata/ahci-ceva.txt  |  20 ++
 drivers/ata/Kconfig|   9 +
 drivers/ata/Makefile   |   1 +
 drivers/ata/ahci_ceva.c| 225 +
 4 files changed, 255 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/ata/ahci-ceva.txt
 create mode 100644 drivers/ata/ahci_ceva.c

--
2.1.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] x86, tsc: Allow for high latency in quick_pit_calibrate()

2015-06-04 Thread Ingo Molnar


* George Spelvin  wrote:

> It's running at 3.4 GHz, so I expect 729478 ticks per 256 PIT counts, and 
> 415039 
> ticks per 8192 Hz RTC tick.

> (PIT reads are 1353 ns each, while RTC reads are 1142 ns.)
> 
> RTC edge at  99172986783, delta   0, range   7764, iter 7
> RTC edge at  99173401719, delta  414936, range   7764, iter 106
> RTC edge at  99173816543, delta  414824, range   7764, iter 106
> RTC edge at  99174231391, delta  414848, range   7764, iter 106
> RTC edge at  99174646119, delta  414728, range   7740, iter 106

> +static inline unsigned
> +rtc_wait_bit(u64 *tscp, unsigned long *deltap)
> +{
> + int count = 0;
> + u64 prev_tsc, tsc = 0;
> +
> + do {
> + if (++count > 5000)
> + return 0;
> + prev_tsc = tsc;
> + tsc = get_cycles();
> + } while (~inb(RTC_PORT(1)) & RTC_PF);   /* Wait for bit 6 to be set */
> + *deltap = get_cycles() - prev_tsc;
> + *tscp = tsc;

> +/* This is skanky stuff that requries rewritten RTC locking to do properly */

[ Note that no RTC locking is needed so early during bootup: this is the boot 
CPU 
  only, with only a single task running, guaranteed. ]

So your code is very close to how I did the RTC sampling, except that I got:

[0.00] tsc: RTC edge 57 from  0 to 64, at  29694678517, delta: 
246360, jitter: 2456, loops:7,35194 cycles/loop
[0.00] tsc: RTC edge 58 from  0 to 64, at  29695169485, delta: 
490968, jitter:   244608, loops:  118, 4160 cycles/loop
[0.00] tsc: RTC edge 59 from  0 to 64, at  29695413981, delta: 
244496, jitter:  -246472, loops:6,40749 cycles/loop
[0.00] tsc: RTC edge 60 from  0 to 64, at  29695660661, delta: 
246680, jitter: 2184, loops:7,35240 cycles/loop
[0.00] tsc: RTC edge 61 from  0 to 64, at  29695904853, delta: 
244192, jitter:-2488, loops:6,40698 cycles/loop
[0.00] tsc: RTC edge 62 from  0 to 64, at  29696151141, delta: 
246288, jitter: 2096, loops:7,35184 cycles/loop
[0.00] tsc: RTC edge 63 from  0 to 64, at  29696396445, delta: 
245304, jitter: -984, loops:6,40884 cycles/loop
[0.00] tsc: RTC edge 64 from  0 to 64, at  29696642669, delta: 
246224, jitter:  920, loops:7,35174 cycles/loop
[0.00] tsc: RTC edge 65 from  0 to 64, at  29696887245, delta: 
244576, jitter:-1648, loops:6,40762 cycles/loop
[0.00] tsc: RTC edge 66 from  0 to 64, at  29697377909, delta: 
490664, jitter:   246088, loops:  117, 4193 cycles/loop
[0.00] tsc: RTC edge 67 from  0 to 64, at  29697622701, delta: 
244792, jitter:  -245872, loops:6,40798 cycles/loop
[0.00] tsc: RTC edge 68 from  0 to 64, at  29697868773, delta: 
246072, jitter: 1280, loops:7,35153 cycles/loop
[0.00] tsc: RTC edge 69 from  0 to 64, at  29700569301, delta:
2700528, jitter:  2454456, loops:   13,   207732 cycles/loop
[0.00] tsc: RTC edge 70 from  0 to 64, at  29700813805, delta: 
244504, jitter: -2456024, loops:6,40750 cycles/loop
[0.00] tsc: RTC edge 71 from  0 to 64, at  29701060125, delta: 
246320, jitter: 1816, loops:7,35188 cycles/loop
[0.00] tsc: RTC edge 72 from  0 to 64, at  29701550189, delta: 
490064, jitter:   243744, loops:  117, 4188 cycles/loop
[0.00] tsc: RTC edge 73 from  0 to 64, at  29701796677, delta: 
246488, jitter:  -243576, loops:7,35212 cycles/loop
[0.00] tsc: RTC edge 74 from  0 to 64, at  29702040829, delta: 
244152, jitter:-2336, loops:6,40692 cycles/loop
[0.00] tsc: RTC edge 75 from  0 to 64, at  29702287597, delta: 
246768, jitter: 2616, loops:7,35252 cycles/loop
[0.00] tsc: RTC edge 76 from  0 to 64, at  29702531741, delta: 
244144, jitter:-2624, loops:6,40690 cycles/loop
[0.00] tsc: RTC edge 77 from  0 to 64, at  29702778341, delta: 
246600, jitter: 2456, loops:7,35228 cycles/loop
[0.00] tsc: RTC edge 78 from  0 to 64, at  29703022661, delta: 
244320, jitter:-2280, loops:6,40720 cycles/loop
[0.00] tsc: RTC edge 79 from  0 to 64, at  29703514245, delta: 
491584, jitter:   247264, loops:  118, 4165 cycles/loop
[0.00] tsc: RTC edge 80 from  0 to 64, at  29703759165, delta: 
244920, jitter:  -246664, loops:6,4

Re: [PATCH v2 0/2] Add MediaTek display PWM driver

2015-06-04 Thread YH Huang

On Mon, 2015-05-25 at 10:14 +0800, Yingjoe Chen wrote:
> On Thu, 2015-05-21 at 21:29 +0800, YH Huang wrote:

This patch series add the use of display PWM driver and documentation 
for Mediatek SoCs. The driver is used to support the backlight of 
the panel. This is based on v4.1-rc1. 

> > YH Huang (2):
> >   dt-bindings: pwm: add MediaTek display PWM bindings
> >   pwm: add MediaTek display PWM driver support
> > 
> >  .../devicetree/bindings/pwm/pwm-mtk-disp.txt   |  25 +++
> >  drivers/pwm/Kconfig|  10 +
> >  drivers/pwm/Makefile   |   1 +
> >  drivers/pwm/pwm-mtk-disp.c | 228 
> > +
> >  4 files changed, 264 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/pwm/pwm-mtk-disp.txt
> >  create mode 100644 drivers/pwm/pwm-mtk-disp.c
> 
> Hi YH,
> 
> It would be easier for reviewer if you have a summary here on what you
> have changed compare to last version.
> Also, please add patch series summary even for v2, it remind reviewer
> what this series is about.
> 
> Joe.C
> 

The patch v2 is refined with everybody's suggestions.
It is much readable and consistent.

If anyone has any suggestions, please just let me know.
Thank you.

Regards,
YH Huang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] gpio_wdt: change initcall level

2015-06-04 Thread Jean-Baptiste Theou

Hi Guenter,

I based my work on the work done in mpc8xxx_wdt.c, which is in mainline.

The point of my patch is for a built-in scenario.
I have an external chip who controls the watchdog, and it need to have 
it IN pin toggle within 1.6s, otherwise it trigger the watchdog.

With a default gpio_wdt built-in module, module_init initcall level is 
too late, and the board reboot (the watchdog cannot be disabled, I am 
using "always-running" property of this module.)

The point of my patch is to start the watchdog at arch_init call level,
and the "tweak" for late init is due to the fact that miscdev is not 
ready at the level of initcall, as explained on the comment.

If there is some part that aren't clear and if you have a better idea 
on how to raise the level of initcall for this module, on a cleaner 
way, I am all hears.

Best regards,

On Thu, 4 Jun 2015 21:37:03 -0700
Guenter Roeck  wrote:

> On 06/04/2015 12:21 PM, Jean-Baptiste Theou wrote:
> > gpio_wdt may need to start the GPIO toggle as soon as possible,
> > when the watchdog cannot be disabled. Raise the initcall to
> > arch_initcall.
> >
> > We need to split the initiation, because of miscdev, as done in
> > mpc8xxx_wdt.c
> >
> > Signed-off-by: Jean-Baptiste Theou 
> > ---
> >   drivers/watchdog/gpio_wdt.c | 78 
> > ++---
> >   1 file changed, 74 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/watchdog/gpio_wdt.c b/drivers/watchdog/gpio_wdt.c
> > index cbc313d..8ecfe7e 100644
> > --- a/drivers/watchdog/gpio_wdt.c
> > +++ b/drivers/watchdog/gpio_wdt.c
> > @@ -14,6 +14,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -223,10 +224,11 @@ static int gpio_wdt_probe(struct platform_device 
> > *pdev)
> >
> > setup_timer(&priv->timer, gpio_wdt_hwping, (unsigned long)&priv->wdd);
> >
> > -   ret = watchdog_register_device(&priv->wdd);
> > +#ifdef MODULE
> > +   ret = gpio_wdt_init_late();
> > if (ret)
> > return ret;
> > -
> > +#endif
> > priv->notifier.notifier_call = gpio_wdt_notify_sys;
> > ret = register_reboot_notifier(&priv->notifier);
> > if (ret)
> > @@ -235,10 +237,13 @@ static int gpio_wdt_probe(struct platform_device 
> > *pdev)
> > if (priv->always_running)
> > gpio_wdt_start_impl(priv);
> >
> > +   platform_set_drvdata(pdev, priv);
> > return 0;
> >
> >   error_unregister:
> > -   watchdog_unregister_device(&priv->wdd);
> > +#ifdef MODULE
> > +   ret = gpio_wdt_remove_late(&priv->wdd);
> > +#endif
> > return ret;
> >   }
> >
> > @@ -267,7 +272,72 @@ static struct platform_driver gpio_wdt_driver = {
> > .probe  = gpio_wdt_probe,
> > .remove = gpio_wdt_remove,
> >   };
> > -module_platform_driver(gpio_wdt_driver);
> > +
> > +/*
> > + * We do wdt initialization in two steps: arch_initcall probes the wdt
> > + * very early to start pinging the watchdog (misc devices are not yet
> > + * available), and later module_init() just registers the misc device.
> > + */
> > +static int gpio_wdt_init_late(void)
> > +{
> > +   struct platform_device *pdev;
> > +   struct device_node *wdt_node;
> > +   struct gpio_wdt_priv *priv;
> > +   int ret;
> > +
> > +   for_each_compatible_node(wdt_node, NULL, "linux,wdt-gpio") {
> > +   pdev = of_find_device_by_node(wdt_node);
> > +   priv = platform_get_drvdata(pdev);
> > +   if (&priv->wdd) {
> > +   ret = watchdog_register_device(&priv->wdd);
> > +   if (ret)
> > +   return ret;
> > +   } else {
> > +   dev_err(&pdev->dev, "Unable to register the 
> > watchdog\n");
> > +   return -1;
> > +   }
> > +   }
> > +   return 0;
> > +}
> > +#ifndef MODULE
> > +module_init(gpio_wdt_init_late);
> > +#endif
> > +
> > +#ifdef MODULE
> > +int gpio_wdt_remove_late(void)
> > +{
> > +   struct platform_device *pdev;
> > +   struct device_node *wdt_node;
> > +   struct gpio_wdt_priv *priv;
> > +   int ret;
> > +
> > +   for_each_compatible_node(wdt_node, NULL, "linux,wdt-gpio") {
> > +   pdev = of_find_device_by_node(wdt_node);
> > +   priv = platform_get_drvdata(pdev);
> > +   if (&priv->wdd) {
> > +   ret = watchdog_unregister_device(&priv->wdd);
> > +   if (ret)
> > +   return ret;
> > +   } else {
> > +   dev_err(&pdev->dev, "Unable to register the 
> > watchdog\n");
> > +   return -1;
> > +   }
> > +   }
> > +   return 0;
> > +}
> > +#endif
> > +
> > +static int __init gpio_wdt_init(void)
> > +{
> > +   return platform_driver_register(&gpio_wdt_driver);
> > +}
> > +arch_initcall(gpio_wdt_init);
> > +
> > +static void __exit gpio_wdt_exit(void)
> > +{
> > +   platform_driver_unregister(&gpio_wdt_driver);
> > +}
> > +module_exit(gpio_wdt_exit);
> >
> >   MODULE_AUTHOR("Alexander Shiyan ");
>

[PATCH v2 1/2] devicetree:bindings: add devicetree bindings for ceva ahci

2015-06-04 Thread Suneel Garapati

adds bindings for CEVA AHCI SATA controller. optional property
broken-gen2 is useful incase of hardware speed limitation.

Signed-off-by: Suneel Garapati 
---
 Documentation/devicetree/bindings/ata/ahci-ceva.txt | 20 
 1 file changed, 20 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/ata/ahci-ceva.txt

diff --git a/Documentation/devicetree/bindings/ata/ahci-ceva.txt 
b/Documentation/devicetree/bindings/ata/ahci-ceva.txt
new file mode 100644
index 000..7ca8b97
--- /dev/null
+++ b/Documentation/devicetree/bindings/ata/ahci-ceva.txt
@@ -0,0 +1,20 @@
+Binding for CEVA AHCI SATA Controller
+
+Required properties:
+  - reg: Physical base address and size of the controller's register area.
+  - compatible: Compatibility string. Must be 'ceva,ahci-1v84'.
+  - clocks: Input clock specifier. Refer to common clock bindings.
+  - interrupts: Interrupt specifier. Refer to interrupt binding.
+
+Optional properties:
+  - ceva,broken-gen2: limit to gen1 speed instead of gen2.
+
+Examples:
+   ahci@fd0c {
+   compatible = "ceva,ahci-1v84";
+   reg = <0xfd0c 0x200>;
+   interrupt-parent = <&gic>;
+   interrupts = <0 133 4>;
+   clocks = <&clkc SATA_CLK_ID>;
+   ceva,broken-gen2;
+   };
--
2.1.2
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/2] drivers: ata: add support for Ceva sata host controller

2015-06-04 Thread Suneel Garapati

Adds support for Ceva sata host controller on Xilinx
Zynq UltraScale+ MPSoC.

Signed-off-by: Suneel Garapati 
---
Changes v2
 - Change module license string to GPL v2
---
 drivers/ata/Kconfig |   9 ++
 drivers/ata/Makefile|   1 +
 drivers/ata/ahci_ceva.c | 225 
 3 files changed, 235 insertions(+)
 create mode 100644 drivers/ata/ahci_ceva.c

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index b4524f4..6d17a3b 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -133,6 +133,15 @@ config AHCI_IMX

  If unsure, say N.

+config AHCI_CEVA
+   tristate "CEVA AHCI SATA support"
+   depends on OF
+   help
+ This option enables support for the CEVA AHCI SATA.
+ It can be found on the Xilinx Zynq UltraScale+ MPSoC.
+
+ If unsure, say N.
+
 config AHCI_MVEBU
tristate "Marvell EBU AHCI SATA support"
depends on ARCH_MVEBU
diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile
index 5154753..af70919 100644
--- a/drivers/ata/Makefile
+++ b/drivers/ata/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_SATA_SIL24)  += sata_sil24.o
 obj-$(CONFIG_SATA_DWC) += sata_dwc_460ex.o
 obj-$(CONFIG_SATA_HIGHBANK)+= sata_highbank.o libahci.o
 obj-$(CONFIG_AHCI_BRCMSTB) += ahci_brcmstb.o libahci.o libahci_platform.o
+obj-$(CONFIG_AHCI_CEVA)+= ahci_ceva.o libahci.o 
libahci_platform.o
 obj-$(CONFIG_AHCI_DA850)   += ahci_da850.o libahci.o libahci_platform.o
 obj-$(CONFIG_AHCI_IMX) += ahci_imx.o libahci.o libahci_platform.o
 obj-$(CONFIG_AHCI_MVEBU)   += ahci_mvebu.o libahci.o libahci_platform.o
diff --git a/drivers/ata/ahci_ceva.c b/drivers/ata/ahci_ceva.c
new file mode 100644
index 000..559d960
--- /dev/null
+++ b/drivers/ata/ahci_ceva.c
@@ -0,0 +1,225 @@
+/*
+ * Copyright (C) 2015 Xilinx, Inc.
+ * CEVA AHCI SATA platform driver
+ *
+ * based on the AHCI SATA platform driver by Jeff Garzik and Anton Vorontsov
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "ahci.h"
+
+/* Vendor Specific Register Offsets */
+#define AHCI_VEND_PCFG  0xA4
+#define AHCI_VEND_PPCFG 0xA8
+#define AHCI_VEND_PP2C  0xAC
+#define AHCI_VEND_PP3C  0xB0
+#define AHCI_VEND_PP4C  0xB4
+#define AHCI_VEND_PP5C  0xB8
+#define AHCI_VEND_PAXIC 0xC0
+#define AHCI_VEND_PTC   0xC8
+
+/* Vendor Specific Register bit definitions */
+#define PAXIC_ADBW_BW64 0x1
+#define PAXIC_MAWIDD   (1 << 8)
+#define PAXIC_MARIDD   (1 << 16)
+#define PAXIC_OTL  (0x4 << 20)
+
+#define PCFG_TPSS_VAL  (0x32 << 16)
+#define PCFG_TPRS_VAL  (0x2 << 12)
+#define PCFG_PAD_VAL   0x2
+
+#define PPCFG_TTA  0x1FFFE
+#define PPCFG_PSSO_EN  (1 << 28)
+#define PPCFG_PSS_EN   (1 << 29)
+#define PPCFG_ESDF_EN  (1 << 31)
+
+#define PP2C_CIBGMN0x0F
+#define PP2C_CIBGMX(0x25 << 8)
+#define PP2C_CIBGN (0x18 << 16)
+#define PP2C_CINMP (0x29 << 24)
+
+#define PP3C_CWBGMN0x04
+#define PP3C_CWBGMX(0x0B << 8)
+#define PP3C_CWBGN (0x08 << 16)
+#define PP3C_CWNMP (0x0F << 24)
+
+#define PP4C_BMX   0x0a
+#define PP4C_BNM   (0x08 << 8)
+#define PP4C_SFD   (0x4a << 16)
+#define PP4C_PTST  (0x06 << 24)
+
+#define PP5C_RIT   0x60216
+#define PP5C_RCT   (0x7f0 << 20)
+
+#define PTC_RX_WM_VAL  0x40
+#define PTC_RSVD   (1 << 27)
+
+#define PORT0_BASE 0x100
+#define PORT1_BASE 0x180
+
+/* Port Control Register Bit Definitions */
+#define PORT_SCTL_SPD_GEN2 (0x2 << 4)
+#define PORT_SCTL_SPD_GEN1 (0x1 << 4)
+#define PORT_SCTL_IPM  (0x3 << 8)
+
+#define PORT_BASE  0x100
+#define PORT_OFFSET0x80
+#define NR_PORTS   2
+#define DRV_NAME   "ahci-ceva"
+#define CEVA_FLAG_BROKEN_GEN2  1
+
+struct ceva_ahci_priv {
+   struct platform_device *ahci_pdev;
+   int flags;
+};
+
+static struct ata_port_operations ahci_ceva_ops = {
+   .inherits = &ahci_platform_ops,
+};
+
+static const struct ata_port_info ahci_ceva_port_info = {
+   .flags  = AHCI_FLAG_COMMON,
+   .pio_mask   = ATA_PIO4,
+   .udma_mask  = ATA_UDMA6,
+   .port_ops   = &ahci_ceva_ops,
+};
+
+static void ahci_ceva_setup(struct ahci_host_priv *hpriv)
+{
+   void __iomem *mmio = hpriv->mmio;
+   struct ceva_ahci_priv *cevapriv = hpriv->plat_data;
+   u32 tmp;
+   int i;
+
+   /*
+* AXI Data bus w

Re: [PATCH RFC] x86, tsc: Allow for high latency in quick_pit_calibrate()

2015-06-04 Thread Ingo Molnar


* George Spelvin  wrote:

> Ingo Molnar wrote:
> > - Alternatively, I also tried a different method: to set up the RTC
> >   periodic IRQ during early boot, but not have an IRQ handler, polling
> >   RTC_PF in the rtc_cmos_read(RTC_INTR_FLAGS) IRQ status byte.
> >
> >   Unfortunately when I do this then PIO based RTC accesses can take
> >   tens of thousands of cycles, and the resulting jitter is pretty bad
> >   and hard to filter:
> 
> Did you use rtc_cmos_read()?  [...]

Yeah, so initially I did, but then after I noticed the overhead I introduced:

+unsigned char rtc_cmos_read_again(void)
+{
+   return inb(RTC_PORT(1));
+}
+

which compiles to a single INB instruction.

This didn't change the delay/cost behavior.

The numbers I cited, with tens of thousands of cycles per iteration, were from 
such an optimized poll loop already.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 4/9] iommu, x86: Save the mode (posted or remapped) of an IRTE

2015-06-04 Thread Feng Wu

This patch adds a new field in struct irq_2_iommu, which can
capture whether the entry is in posted mode or remapped mode.

Signed-off-by: Feng Wu 
Suggested-by: Thomas Gleixner 
---
 drivers/iommu/intel_irq_remapping.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 9bbc235..028d628 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -18,6 +18,11 @@
 
 #include "irq_remapping.h"
 
+enum irq_mode {
+   IRQ_REMAPPING,
+   IRQ_POSTING,
+};
+
 struct ioapic_scope {
struct intel_iommu *iommu;
unsigned int id;
@@ -37,6 +42,7 @@ struct irq_2_iommu {
u16 irte_index;
u16 sub_handle;
u8  irte_mask;
+   enum irq_mode mode;
 };
 
 struct intel_ir_data {
@@ -104,6 +110,7 @@ static int alloc_irte(struct intel_iommu *iommu, int irq,
irq_iommu->irte_index =  index;
irq_iommu->sub_handle = 0;
irq_iommu->irte_mask = mask;
+   irq_iommu->mode = IRQ_REMAPPING;
}
raw_spin_unlock_irqrestore(&irq_2_ir_lock, flags);
 
@@ -144,6 +151,8 @@ static int modify_irte(struct irq_2_iommu *irq_iommu,
__iommu_flush_cache(iommu, irte, sizeof(*irte));
 
rc = qi_flush_iec(iommu, index, 0);
+
+   irq_iommu->mode = irte->pst ? IRQ_POSTING : IRQ_REMAPPING;
raw_spin_unlock_irqrestore(&irq_2_ir_lock, flags);
 
return rc;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 3/9] iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip

2015-06-04 Thread Feng Wu

Implement irq_set_vcpu_affinity for intel_ir_chip.

Signed-off-by: Feng Wu 
Reviewed-by: Jiang Liu 
Acked-by: David Woodhouse 
---
 arch/x86/include/asm/irq_remapping.h |  5 +
 drivers/iommu/intel_irq_remapping.c  | 43 
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/irq_remapping.h 
b/arch/x86/include/asm/irq_remapping.h
index 0953723..202e040 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -57,6 +57,11 @@ static inline struct irq_domain 
*arch_get_ir_parent_domain(void)
return x86_vector_domain;
 }
 
+struct vcpu_data {
+   u64 pi_desc_addr;   /* Physical address of PI Descriptor */
+   u32 vector; /* Guest vector of the interrupt */
+};
+
 #else  /* CONFIG_IRQ_REMAP */
 
 static inline void set_irq_remapping_broken(void) { }
diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 8fad71c..9bbc235 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1013,10 +1013,53 @@ static void intel_ir_compose_msi_msg(struct irq_data 
*irq_data,
*msg = ir_data->msi_entry;
 }
 
+static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
+{
+   struct intel_ir_data *ir_data = data->chip_data;
+   struct irte irte_pi;
+   struct vcpu_data *vcpu_pi_info;
+
+   /* stop posting interrupts, back to remapping mode */
+   if (!vcpu_info) {
+   modify_irte(&ir_data->irq_2_iommu, &ir_data->irte_entry);
+   } else {
+   vcpu_pi_info = (struct vcpu_data *)vcpu_info;
+
+   /*
+* "ir_data->irte_entry" saves the remapped format of IRTE,
+* which being a cached irte is still updated when setting
+* the affinity even when we are in posted mode. So this makes
+* it possible to switch back to remapped mode from posted mode,
+* we can just set "ir_data->irte_entry" to hardware for that
+* purpose.
+*/
+   memcpy(&irte_pi, &ir_data->irte_entry, sizeof(struct irte));
+
+   irte_pi.p_urgent = 0;
+   irte_pi.p_vector = vcpu_pi_info->vector;
+   irte_pi.pda_l = (vcpu_pi_info->pi_desc_addr >>
+   (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
+   irte_pi.pda_h = (vcpu_pi_info->pi_desc_addr >> 32) &
+   ~(-1UL << PDA_HIGH_BIT);
+
+   irte_pi.p_res0 = 0;
+   irte_pi.p_res1 = 0;
+   irte_pi.p_res2 = 0;
+   irte_pi.p_res3 = 0;
+
+   irte_pi.p_pst = 1;
+
+   modify_irte(&ir_data->irq_2_iommu, &irte_pi);
+   }
+
+   return 0;
+}
+
 static struct irq_chip intel_ir_chip = {
.irq_ack = ir_ack_apic_edge,
.irq_set_affinity = intel_ir_set_affinity,
.irq_compose_msi_msg = intel_ir_compose_msi_msg,
+   .irq_set_vcpu_affinity = intel_ir_set_vcpu_affinity,
 };
 
 static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 9/9] iommu, x86: Properly handler PI for IOMMU hotplug

2015-06-04 Thread Feng Wu

Return error when inserting a new IOMMU which doesn't support PI
if PI is currently in use.

Signed-off-by: Feng Wu 
---
 drivers/iommu/intel_irq_remapping.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 554e203..4fb3576 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1360,6 +1360,9 @@ int dmar_ir_hotplug(struct dmar_drhd_unit *dmaru, bool 
insert)
return -EINVAL;
if (!ecap_ir_support(iommu->ecap))
return 0;
+   if (irq_remapping_cap(IRQ_POSTING_CAP) &&
+   !cap_pi_support(iommu->cap))
+   return -EBUSY;
 
if (insert) {
if (!iommu->ir_table)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 5/9] iommu, x86: No need to migrating irq for VT-d Posted-Interrupts

2015-06-04 Thread Feng Wu

We don't need to migrate the irqs for VT-d Posted-Interrupts here.
When 'pst' is set in IRTE, the associated irq will be posted to
guests instead of interrupt remapping. The destination of the
interrupt is set in Posted-Interrupts Descriptor, and the migration
happens during vCPU scheduling.

However, we still update the cached irte here, which can be used
when changing back to remapping mode.

Signed-off-by: Feng Wu 
Reviewed-by: Jiang Liu 
Acked-by: David Woodhouse 
---
 drivers/iommu/intel_irq_remapping.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 028d628..c30845a 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1002,7 +1002,10 @@ intel_ir_set_affinity(struct irq_data *data, const 
struct cpumask *mask,
 */
irte->vector = cfg->vector;
irte->dest_id = IRTE_DEST(cfg->dest_apicid);
-   modify_irte(&ir_data->irq_2_iommu, irte);
+
+   /* Update the hardware only if the interrupt is in remapped mode. */
+   if (ir_data->irq_2_iommu.mode == IRQ_REMAPPING)
+   modify_irte(&ir_data->irq_2_iommu, irte);
 
/*
 * After this point, all the interrupts will start arriving
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 2/9] iommu: dmar: Extend struct irte for VT-d Posted-Interrupts

2015-06-04 Thread Feng Wu

From: Thomas Gleixner 

The IRTE (Interrupt Remapping Table Entry) is either an entry for
remapped or for posted interrupts. The hardware distiguishes between
remapped and posted entries by bit 15 in the low 64 bit of the
IRTE. If cleared the entry is remapped, if set it's posted.

The entries have common fields and dependent on the posted bit fields
with different meanings.

Extend struct irte to handle the differences between remap and posted
mode by having three structs in the unions:

- Shared
- Remapped
- Posted

Signed-off-by: Thomas Gleixner 
Signed-off-by: Feng Wu 
---
 include/linux/dmar.h | 70 +---
 1 file changed, 55 insertions(+), 15 deletions(-)

diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 8473756..0dbcabc 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -185,33 +185,73 @@ static inline int dmar_device_remove(void *handle)
 
 struct irte {
union {
+   /* Shared between remapped and posted mode*/
struct {
-   __u64   present : 1,
-   fpd : 1,
-   dst_mode: 1,
-   redir_hint  : 1,
-   trigger_mode: 1,
-   dlvry_mode  : 3,
-   avail   : 4,
-   __reserved_1: 4,
-   vector  : 8,
-   __reserved_2: 8,
-   dest_id : 32;
+   __u64   present : 1,  /*  0  */
+   fpd : 1,  /*  1  */
+   __res0  : 6,  /*  2 -  6 */
+   avail   : 4,  /*  8 - 11 */
+   __res1  : 3,  /* 12 - 14 */
+   pst : 1,  /* 15  */
+   vector  : 8,  /* 16 - 23 */
+   __res2  : 40; /* 24 - 63 */
+   };
+
+   /* Remapped mode */
+   struct {
+   __u64   r_present   : 1,  /*  0  */
+   r_fpd   : 1,  /*  1  */
+   dst_mode: 1,  /*  2  */
+   redir_hint  : 1,  /*  3  */
+   trigger_mode: 1,  /*  4  */
+   dlvry_mode  : 3,  /*  5 -  7 */
+   r_avail : 4,  /*  8 - 11 */
+   r_res0  : 4,  /* 12 - 15 */
+   r_vector: 8,  /* 16 - 23 */
+   r_res1  : 8,  /* 24 - 31 */
+   dest_id : 32; /* 32 - 63 */
+   };
+
+   /* Posted mode */
+   struct {
+   __u64   p_present   : 1,  /*  0  */
+   p_fpd   : 1,  /*  1  */
+   p_res0  : 6,  /*  2 -  7 */
+   p_avail : 4,  /*  8 - 11 */
+   p_res1  : 2,  /* 12 - 13 */
+   p_urgent: 1,  /* 14  */
+   p_pst   : 1,  /* 15  */
+   p_vector: 8,  /* 16 - 23 */
+   p_res2  : 14, /* 24 - 37 */
+   pda_l   : 26; /* 38 - 63 */
};
__u64 low;
};
 
union {
+   /* Shared between remapped and posted mode*/
struct {
-   __u64   sid : 16,
-   sq  : 2,
-   svt : 2,
-   __reserved_3: 44;
+   __u64   sid : 16,  /* 64 - 79  */
+   sq  : 2,   /* 80 - 81  */
+   svt : 2,   /* 82 - 83  */
+   __res3  : 44;  /* 84 - 127 */
+   };
+
+   /* Posted mode*/
+   struct {
+   __u64   p_sid   : 16,  /* 64 - 79  */
+   p_sq: 2,   /* 80 - 81  */
+   p_svt   : 2,   /* 82 - 83  */
+   p_res3  : 12,  /* 84 - 95  */
+   pda_h   : 32;  /* 96 - 127 */
};
__u64 high;
};
 };
 
+#define PDA_LOW_BIT26
+#define PDA_HIGH_BIT   32
+
 enu

[v9 6/9] iommu, x86: Add cap_pi_support() to detect VT-d PI capability

2015-06-04 Thread Feng Wu

Add helper function to detect VT-d Posted-Interrupts capability.

Signed-off-by: Feng Wu 
Reviewed-by: Jiang Liu 
Acked-by: David Woodhouse 
---
 include/linux/intel-iommu.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 0af9b03..0c251be 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -87,6 +87,7 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
 /*
  * Decoding Capability Register
  */
+#define cap_pi_support(c)  (((c) >> 59) & 1)
 #define cap_read_drain(c)  (((c) >> 55) & 1)
 #define cap_write_drain(c) (((c) >> 54) & 1)
 #define cap_max_amask_val(c)   (((c) >> 48) & 0x3f)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] x86, tsc: Allow for high latency in quick_pit_calibrate()

2015-06-04 Thread George Spelvin

FWIW, I wrote my own test routine, with some interesting results.
It's a rude bodge and obviously not kernel-quality, but included if
anyone wants to mess with it.

My machine is an X79 motherboard with a common ITE IT8728F SuperIO
chip providing both RTC and PIT.

The intersting bit is that I can double the speed of the PIT code, and
the really interesting part is that the RTC code is 18% faster still
(85% of the time).

It's running at 3.4 GHz, so I expect 729478 ticks per 256 PIT
counts, and 415039 ticks per 8192 Hz RTC tick.


Anyway, here are the results using the current pit_expect_msb().
The values printed are the returned tsc, the delta to the previous one,
and the uncertainty range, which is the time for two reads
94 inb() operations).


PIT edge at  99066193034, delta   0, range  18372
PIT edge at  99066918775, delta  725741, range  18372
PIT edge at  99067644199, delta  725424, range  18372
PIT edge at  99068379191, delta  734992, range  18372
PIT edge at  99069104615, delta  725424, range  18372
PIT edge at  99069839127, delta  734512, range  18372
PIT edge at  99070564551, delta  725424, range  18372
PIT edge at  99071299583, delta  735032, range  18348
PIT edge at  99072025530, delta  725947, range  18372
PIT edge at  99072750839, delta  725309, range  18372
PIT edge at  99073485447, delta  734608, range  18372
PIT edge at  99074210778, delta  725331, range  18372
PIT edge at  99074945471, delta  734693, range  18372
PIT edge at  99075670807, delta  725336, range  18372
PIT edge at  99076406543, delta  735736, range  18372
PIT edge at  99077132874, delta  726331, range  18372
PIT edge at  99077858095, delta  725221, range  18372
PIT edge at  99078593719, delta  735624, range  18372
PIT edge at  99079319255, delta  725536, range  18372
PIT edge at  99080053767, delta  734512, range  18372
PIT edge at  99080779079, delta  725312, range  18372
PIT edge at  99081504322, delta  725243, range  18372
PIT edge at  99082239311, delta  734989, range  18372
PIT edge at  99082964554, delta  725243, range  18372
PIT edge at  99083699543, delta  734989, range  18372
PIT edge at  99084425602, delta  726059, range  18372
PIT edge at  99085160095, delta  734493, range  18372
PIT edge at  99085885311, delta  725216, range  18372
PIT edge at  99086610535, delta  725224, range  18372
PIT edge at  99087345751, delta  735216, range  18372
PIT edge at  99088071399, delta  725648, range  18372
PIT edge at  99088805911, delta  734512, range  18372
PIT edge at  99089531519, delta  725608, range  18372
PIT edge at  99090266327, delta  734808, range  18372
PIT edge at  99090991567, delta  725240, range  18372
PIT edge at  99091716767, delta  725200, range  18372
PIT edge at  99092451279, delta  734512, range  18372
PIT edge at  99093176615, delta  725336, range  18487
PIT edge at  99093911423, delta  734808, range  18372
PIT edge at  99094636847, delta  725424, range  18372
PIT edge at  99095371447, delta  734600, range  18372
PIT edge at  99096096671, delta  725224, range  18372
PIT edge at  99096831703, delta  735032, range  18372
PIT edge at  99097557535, delta  725832, range  18372
PIT edge at  99098282959, delta  725424, range  18372
PIT edge at  99099018063, delta  735104, range  18372
PIT edge at  99099743303, delta  725240, range  18372
PIT edge at  99100477703, delta  734400, range  18372
PIT edge at  99101203015, delta  725312, range  18372
PIT edge at  99101937415, delta  734400, range  18372

Here's the same for an optimized PIT routine, which places the PIT in
msbyte-only mode, so only needs one read to poll the PIT.

It also prints the number of iterations inside the PIT spin loop.

Note that it's exactly twice the speed, but the variance
is much higher.

PIT edge at  99131203367, delta   0, range   9215, iter 158
PIT edge at  99131929383, delta  726016, range   9172, iter 157
PIT edge at  99132659519, delta  730136, range   9215, iter 158
PIT edge at  99133389546, delta  730027, range   9172, iter 158
PIT edge at  99134120047, delta  730501, range   9188, iter 158
PIT edge at  99134850095, delta  730048, range   9508, iter 158
PIT edge at  99135580303, delta  730208, range   9188, iter 158
PIT edge at  99136310623, delta  730320, range   9188, iter 158
PIT edge at  99137035935, delta  725312, range   9193, iter 157
PIT edge at  99137765754, delta  729819, range   9172, iter 158
PIT edge at  99138495666, delta  729912, range   9172, iter 158
PIT edge at  99139225578, delta  729912, range   9172, iter 158
PIT edge at  99139955511, delta  729933, range   9172, iter 158
PIT edge at  99140685311, delta  729800, range   9212, iter 158
PIT edge at  99141415743, delta  730432, range   9215, iter 158
PIT edge at  99142146247, delta  730504, range   9169, iter 158
PIT edge at  99142872303, delta  726056, range   9215, iter 157
PIT edge at  99143603031, delta  730728, range   9215, iter 158
PIT edge at  99144333559, delta  730528, range   9169, iter 158
PIT edge at  99145063879, delta  730320, range   9193,

[v9 7/9] iommu, x86: Setup Posted-Interrupts capability for Intel iommu

2015-06-04 Thread Feng Wu

Set Posted-Interrupts capability for Intel iommu when IR is enabled,
clear it when IR is disabled.

Signed-off-by: Feng Wu 
---
 drivers/iommu/intel_irq_remapping.c | 30 ++
 drivers/iommu/irq_remapping.c   |  2 ++
 drivers/iommu/irq_remapping.h   |  3 +++
 3 files changed, 35 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index c30845a..554e203 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -580,6 +580,26 @@ error:
return -ENODEV;
 }
 
+/*
+ * Set Posted-Interrupts capability.
+ */
+static inline void set_irq_posting_cap(void)
+{
+   struct dmar_drhd_unit *drhd;
+   struct intel_iommu *iommu;
+
+   if (!disable_irq_post) {
+   intel_irq_remap_ops.capability |= 1 << IRQ_POSTING_CAP;
+
+   for_each_iommu(iommu, drhd)
+   if (!cap_pi_support(iommu->cap)) {
+   intel_irq_remap_ops.capability &=
+   ~(1 << IRQ_POSTING_CAP);
+   break;
+   }
+   }
+}
+
 static int __init intel_enable_irq_remapping(void)
 {
struct dmar_drhd_unit *drhd;
@@ -655,6 +675,8 @@ static int __init intel_enable_irq_remapping(void)
 
irq_remapping_enabled = 1;
 
+   set_irq_posting_cap();
+
pr_info("Enabled IRQ remapping in %s mode\n", eim ? "x2apic" : "xapic");
 
return eim ? IRQ_REMAP_X2APIC_MODE : IRQ_REMAP_XAPIC_MODE;
@@ -855,6 +877,12 @@ static void disable_irq_remapping(void)
 
iommu_disable_irq_remapping(iommu);
}
+
+   /*
+* Clear Posted-Interrupts capability.
+*/
+   if (!disable_irq_post)
+   intel_irq_remap_ops.capability &= ~(1 << IRQ_POSTING_CAP);
 }
 
 static int reenable_irq_remapping(int eim)
@@ -882,6 +910,8 @@ static int reenable_irq_remapping(int eim)
if (!setup)
goto error;
 
+   set_irq_posting_cap();
+
return 0;
 
 error:
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index fc78b0d..ed605a9 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -22,6 +22,8 @@ int irq_remap_broken;
 int disable_sourceid_checking;
 int no_x2apic_optout;
 
+int disable_irq_post = 1;
+
 static int disable_irq_remap;
 static struct irq_remap_ops *remap_ops;
 
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index b6ca30d..039c7af 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -34,6 +34,8 @@ extern int disable_sourceid_checking;
 extern int no_x2apic_optout;
 extern int irq_remapping_enabled;
 
+extern int disable_irq_post;
+
 struct irq_remap_ops {
/* The supported capabilities */
int capability;
@@ -69,6 +71,7 @@ extern void ir_ack_apic_edge(struct irq_data *data);
 
 #define irq_remapping_enabled 0
 #define irq_remap_broken  0
+#define disable_irq_post  1
 
 #endif /* CONFIG_IRQ_REMAP */
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 8/9] iommu, x86: define irq_remapping_cap()

2015-06-04 Thread Feng Wu

This patch adds a new interface irq_remapping_cap() to detect
whether irq remapping supports new features, such as VT-d
Posted-Interrupts. We export this function out, so that KVM
code can check this and use this mechanism properly.

Signed-off-by: Feng Wu 
Reviewed-by: Jiang Liu 
---
 arch/x86/include/asm/irq_remapping.h | 2 ++
 drivers/iommu/irq_remapping.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/irq_remapping.h 
b/arch/x86/include/asm/irq_remapping.h
index 202e040..61aa8ad 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -35,6 +35,7 @@ enum irq_remap_cap {
IRQ_POSTING_CAP = 0,
 };
 
+extern bool irq_remapping_cap(enum irq_remap_cap cap);
 extern void set_irq_remapping_broken(void);
 extern int irq_remapping_prepare(void);
 extern int irq_remapping_enable(void);
@@ -64,6 +65,7 @@ struct vcpu_data {
 
 #else  /* CONFIG_IRQ_REMAP */
 
+static bool irq_remapping_cap(enum irq_remap_cap cap) { return 0; }
 static inline void set_irq_remapping_broken(void) { }
 static inline int irq_remapping_prepare(void) { return -ENODEV; }
 static inline int irq_remapping_enable(void) { return -ENODEV; }
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index ed605a9..2d99930 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -81,6 +81,15 @@ void set_irq_remapping_broken(void)
irq_remap_broken = 1;
 }
 
+bool irq_remapping_cap(enum irq_remap_cap cap)
+{
+   if (!remap_ops || disable_irq_post)
+   return 0;
+
+   return (remap_ops->capability & (1 << cap));
+}
+EXPORT_SYMBOL_GPL(irq_remapping_cap);
+
 int __init irq_remapping_prepare(void)
 {
if (disable_irq_remap)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 0/9] Add VT-d Posted-Interrupts support - IOMMU part

2015-06-04 Thread Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

This series was part of http://thread.gmane.org/gmane.linux.kernel.iommu/7708. 
To make things clear, send out IOMMU part here.

This patch-set is based on the lastest x86/apic branch of tip tree.

Divide the whole series which contain multiple components into three parts:
- Prerequisite changes to irq subsystem (already merged in tip tree x86/apic 
branch)
- IOMMU part (in this series)
- KVM and VFIO parts (will send out this part once the first two parts are 
accepted)

v8->v9:
* Remove member "irte_pi_entry" in struct intel_ir_data.
* Some changes to the comments.

v7->v8:
* Save the irq mode (posted or remapped) of an IRTE in struct irq_2_iommu.
* Use this new mode to decide whether update the hardware when
modifying irte in intel_ir_set_affinity().

v6->v7:
* Add an static inline helper function set_irq_posting_cap() to set
the PI capability.
* Add some comments for the new member "ir_data->irte_pi_entry".

v5->v6:
* Extend 'struct irte' for VT-d Posted-Interrupts, combine remapped
and posted mode into one irte structure.

v4->v5:
* Abstract modify_irte() to accept two format of irte.

v3->v4:
* Change capability to a int variant flags instead of a function call.
* Add hotplug case for VT-d PI.

Feng Wu (8):
  iommu: Add new member capability to struct irq_remap_ops
  iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
  iommu, x86: Save the mode (posted or remapped) of an IRTE
  iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
  iommu, x86: Add cap_pi_support() to detect VT-d PI capability
  iommu, x86: Setup Posted-Interrupts capability for Intel iommu
  iommu, x86: define irq_remapping_cap()
  iommu, x86: Properly handler PI for IOMMU hotplug

Thomas Gleixner (1):
  iommu: dmar: Extend struct irte for VT-d Posted-Interrupts

 arch/x86/include/asm/irq_remapping.h | 11 +
 drivers/iommu/intel_irq_remapping.c  | 90 +++-
 drivers/iommu/irq_remapping.c| 11 +
 drivers/iommu/irq_remapping.h|  6 +++
 include/linux/dmar.h | 70 ++--
 include/linux/intel-iommu.h  |  1 +
 6 files changed, 173 insertions(+), 16 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[v9 1/9] iommu: Add new member capability to struct irq_remap_ops

2015-06-04 Thread Feng Wu

This patch adds a new member capability to struct irq_remap_ops,
this new function ops can be used to check whether some
features are supported, such as VT-d Posted-Interrupts.

Signed-off-by: Feng Wu 
Reviewed-by: Jiang Liu 
---
 arch/x86/include/asm/irq_remapping.h | 4 
 drivers/iommu/irq_remapping.h| 3 +++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/irq_remapping.h 
b/arch/x86/include/asm/irq_remapping.h
index 78974fb..0953723 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -31,6 +31,10 @@ struct irq_alloc_info;
 
 #ifdef CONFIG_IRQ_REMAP
 
+enum irq_remap_cap {
+   IRQ_POSTING_CAP = 0,
+};
+
 extern void set_irq_remapping_broken(void);
 extern int irq_remapping_prepare(void);
 extern int irq_remapping_enable(void);
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index 91d5a11..b6ca30d 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -35,6 +35,9 @@ extern int no_x2apic_optout;
 extern int irq_remapping_enabled;
 
 struct irq_remap_ops {
+   /* The supported capabilities */
+   int capability;
+
/* Initializes hardware and makes it ready for remapping interrupts */
int  (*prepare)(void);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

linux-next: manual merge of the vfs tree with the tree

2015-06-04 Thread m...@ellerman.id.au

Hi Al,

Today's linux-next merge of the vfs tree got a conflict in fs/namei.c between
commit 890458a43dbd ("path: New helpers path_get_pin/path_put_unpin for path
pin") from the nfsd tree and commit 894bc8c4662b ("namei: remove restrictions
on nesting depth") from the vfs tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

cheers

diff --cc fs/namei.c
index d41a29efca67,2dad0eaf91d3..
--- a/fs/namei.c
+++ b/fs/namei.c
@@@ -492,32 -492,7 +492,33 @@@ void path_put(const struct path *path
  }
  EXPORT_SYMBOL(path_put);
  
 +/**
 + * path_get_pin - get a reference to a path's dentry
 + *and pin to path's vfsmnt
 + * @path: path to get the reference to
 + * @p: the fs_pin pin to vfsmnt
 + */
 +void path_get_pin(struct path *path, struct fs_pin *p)
 +{
 +  dget(path->dentry);
 +  pin_insert_group(p, path->mnt, NULL);
 +}
 +EXPORT_SYMBOL(path_get_pin);
 +
 +/**
 + * path_put_unpin - put a reference to a path's dentry
 + *  and remove pin to path's vfsmnt
 + * @path: path to put the reference to
 + * @p: the fs_pin removed from vfsmnt
 + */
 +void path_put_unpin(struct path *path, struct fs_pin *p)
 +{
 +  dput(path->dentry);
 +  pin_remove(p);
 +}
 +EXPORT_SYMBOL(path_put_unpin);
 +
+ #define EMBEDDED_LEVELS 2
  struct nameidata {
struct path path;
struct qstr last;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] staging: fbtft: fix out of bound access

2015-06-04 Thread Joe Perches

On Fri, 2015-06-05 at 10:22 +0530, Sudip Mukherjee wrote:
> On Thu, Jun 04, 2015 at 01:48:31PM -0700, Joe Perches wrote:
[]
> ccing you just slipped out of my mind.

No worries.

> > > diff --git a/drivers/staging/fbtft/fbtft-core.c 
> > > b/drivers/staging/fbtft/fbtft-core.c
> > []
> > > @@ -1067,8 +1067,6 @@ static int fbtft_init_display_dt(struct fbtft_par 
> > > *par)
> > >   const __be32 *p;
> > >   u32 val;
> > >   int buf[64], i, j;
> > []
> > >   par->fbtftops.write_register(par, i,
> > >   buf[0], buf[1], buf[2], buf[3],
> > 
> > It seems there are only 2 callers of (*write_register)()
> > and the arguments are always an in-order array int[64]
> > 
> > Maybe it'd be nicer to change the prototypes of the
> > write_register functions to take a const int * 
> > instead of pushing 64 ints on the stack.
> yes, I will send it as a separate patch as that is another change.

I looked at it a bit more and there's a macro that calls
write_register so there are actually many more call sites.

It's a bit non trivial to change the macro as all the
called (*write_register) functions would need changing
and these functions use va_list.

Maybe if you _really_ feel like it, but it's a bit of work.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: "Directly mapped persistent memory page cache"

2015-06-04 Thread Dan Williams

On Tue, May 12, 2015 at 7:47 AM, Jerome Glisse  wrote:
> On Tue, May 12, 2015 at 10:53:47AM +1000, Dave Chinner wrote:
>> On Mon, May 11, 2015 at 11:18:36AM +0200, Ingo Molnar wrote:
>> IMO, we need to be designing around the concept that the filesytem
>> manages the pmem space, and the MM subsystem simply uses the block
>> mapping information provided to it from the filesystem to decide how
>> it references and maps the regions into the user's address space or
>> for DMA. The mm subsystem does not manage the pmem space, it's
>> alignment or how it is allocated to user files. Hence page mappings
>> can only be - at best - reactive to what the filesystem does with
>> it's free space. The mm subsystem already has to query the block
>> layer to get mappings on page faults, so it's only a small stretch
>> to enhance the DAX mapping request to ask for a large page mapping
>> rather than a 4k mapping.  If the fs can't do a large page mapping,
>> you'll get a 4k aligned mapping back.
>>
>> What I'm trying to say is that the mapping behaviour needs to be
>> designed with the way filesystems and the mm subsystem interact in
>> mind, not from a pre-formed "direct Io is bad, we must use the page
>> cache" point of view. The filesystem and the mm subsystem must
>> co-operate to allow things like large page mappings to be made and
>> hence looking at the problem purely from a mm<->pmem device
>> perspective as you are ignores an important chunk of the system:
>> the part that actually manages the pmem space...
>
> I am all for letting the filesystem manage pmem, but i think having
> struct page expose to mm allow the mm side to stay ignorant of what
> is really behind. Also if i could share more code with other i would
> be happier :)
>

As this thread is directly referencing one of the topics listed for
the Persistent Memory microconference I do not think it is
unreasonable to shamelessly hijack it to promote Linux Plumbers 2015.
Tomorrow is the deadline for earlybird registration and topic
submission tool is now open for submission of this or any other
persistent memory topic.

https://linuxplumbersconf.org/2015/attend/
https://linuxplumbersconf.org/2015/how-to-submit-microconference-discussions-topics/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 3/3] Introduce trace log output function for STM

2015-06-04 Thread Chunyan Zhang

This patch introduced a few functions to print the event trace log to
STM buffer when the trace event happen and the event information
would be committed to ring buffer.

Before outputting the trace log to STM, we have to get the human readable
trace log content and print it into a local buffer in the format of a
string, the function 'trace_event_buf_vprintf()' is just for this purpose.

Signed-off-by: Chunyan Zhang 
---
 kernel/trace/Makefile   |  1 +
 kernel/trace/trace_output_stm.c | 99 +
 2 files changed, 100 insertions(+)
 create mode 100644 kernel/trace/trace_output_stm.c

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 9b1044e..002de34 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -67,4 +67,5 @@ obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o

 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o

+obj-$(CONFIG_STM_TRACE_EVENT) += trace_output_stm.o
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_output_stm.c b/kernel/trace/trace_output_stm.c
new file mode 100644
index 000..1cf6d87
--- /dev/null
+++ b/kernel/trace/trace_output_stm.c
@@ -0,0 +1,99 @@
+#include 
+#include 
+#include 
+#include "trace.h"
+
+#define STM_OUTPUT_STRLEN 128
+
+/* store the event trace log for STM */
+struct trace_buffer_stm {
+   char buffer[STM_OUTPUT_STRLEN];
+   unsigned int used_len;
+   unsigned int size;
+};
+
+static struct trace_buffer_stm *trace_event_stm_buffer;
+static struct trace_seq *stm_tmp_seq;
+static int stm_buffers_allocated;
+
+void trace_event_buf_vprintf(struct trace_buffer_stm *tb, const char *fmt, ...)
+{
+   va_list ap;
+   char *buffer = tb->buffer + tb->used_len;
+   unsigned int size = tb->size - tb->used_len;
+
+   va_start(ap, fmt);
+   tb->used_len += vsnprintf(buffer, size, fmt, ap);
+   va_end(ap);
+}
+EXPORT_SYMBOL_GPL(trace_event_buf_vprintf);
+
+static inline void stm_buf_reset(struct trace_buffer_stm *tb)
+{
+   tb->used_len = 0;
+}
+
+void trace_event_stm_log(struct ftrace_event_buffer *fbuffer)
+{
+
+   struct trace_seq *p = stm_tmp_seq;
+   struct trace_buffer_stm *tb;
+   struct ftrace_event_call *event_call = fbuffer->ftrace_file->event_call;
+   struct trace_entry *entry = (struct trace_entry *)fbuffer->entry;
+
+   if (!stm_buffers_allocated)
+   return;
+
+   tb = trace_event_stm_buffer;
+
+   if (event_call->output_stm)
+   event_call->output_stm(p, entry, tb);
+
+   stm_trace_event_write(tb->buffer, tb->used_len);
+
+   stm_buf_reset(tb);
+}
+EXPORT_SYMBOL_GPL(trace_event_stm_log);
+
+static int alloc_stm_tmp_seq(void)
+{
+   struct trace_seq *seq;
+
+   seq = kzalloc(sizeof(struct trace_seq), GFP_KERNEL);
+   if (!seq)
+   return -ENOMEM;
+
+   stm_tmp_seq = seq;
+
+   return 0;
+}
+
+static int alloc_stm_trace_buffer(void)
+{
+   struct trace_buffer_stm *buffer;
+
+   buffer = kzalloc(sizeof(struct trace_buffer_stm), GFP_KERNEL);
+   if (!buffer)
+   return -ENOMEM;
+
+   buffer->used_len = 0;
+   buffer->size = ARRAY_SIZE(buffer->buffer);
+
+   trace_event_stm_buffer = buffer;
+
+   return 0;
+}
+
+static __init int trace_stm_init_buffers(void)
+{
+   if (alloc_stm_trace_buffer())
+   return -ENOMEM;
+
+   if (alloc_stm_tmp_seq())
+   return -ENOMEM;
+
+   stm_buffers_allocated = 1;
+
+   return 0;
+}
+fs_initcall(trace_stm_init_buffers);
--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] CHROMIUM: elants_i2c: Solved previous issue on 3.10 and 3.14.

2015-06-04 Thread Dmitry Torokhov

Hi James,

On Wed, Jun 03, 2015 at 03:06:16PM +0800, james.chen wrote:
> From: "james.chen" 
> 
> This patch refer 3.10 driver code to solve firmware upgrade
> issue(Change 266813) and enable noise-immunity(Change 243875).
> 
> BUG=chrome-os-partner:39373
> TEST=Test Elan Touch Screen on cyan project without problems.

As I mentioned elsewhere this does not simply address firmware upgrade
and noise immunity issues but rather overlays version that can be found
in 3.10 ChromeOS kernel on top of mainline driver, removing a lot of
cleanups and fixes that went into preparing the driver for mainline
inclusion.

Thanks.

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 2/3] Trace log handler for logging into STM blocks

2015-06-04 Thread Chunyan Zhang

Adding the function 'trace_event_stm_output_##call' for printing events
trace log into STM blocks.

This patch also added a function call at where the events have been
committed to ring buffer to export the trace event information to
STM blocks.

Signed-off-by: Chunyan Zhang 
---
 include/linux/ftrace_event.h | 15 ++
 include/trace/ftrace.h   | 47 
 2 files changed, 62 insertions(+)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 46e83c2..f0c7426 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -14,6 +14,7 @@ struct trace_buffer;
 struct tracer;
 struct dentry;
 struct bpf_prog;
+struct trace_buffer_stm;

 struct trace_print_flags {
unsigned long   mask;
@@ -304,6 +305,9 @@ struct ftrace_event_call {
 */
int flags; /* static flags of different events */

+   void (*output_stm)(struct trace_seq *tmp_seq, void *entry,
+  struct trace_buffer_stm *tb);
+
 #ifdef CONFIG_PERF_EVENTS
int perf_refcount;
struct hlist_head __percpu  *perf_events;
@@ -423,6 +427,17 @@ enum event_trigger_type {
ETT_EVENT_ENABLE= (1 << 3),
 };

+#ifdef CONFIG_STM_TRACE_EVENT
+extern void trace_event_stm_log(struct ftrace_event_buffer *fbuffer);
+extern void trace_event_buf_vprintf(struct trace_buffer_stm *tb,
+   const char *fmt, ...) __attribute__ ((weak));
+extern void stm_trace_event_write(const char *buf, unsigned len);
+#else
+static inline void trace_event_stm_log(struct ftrace_event_buffer *fbuffer) {}
+static inline void trace_event_buf_vprintf(struct trace_buffer_stm *tb,
+   const char *fmt, ...) {}
+#endif
+
 extern int filter_match_preds(struct event_filter *filter, void *rec);

 extern int filter_check_discard(struct ftrace_event_file *file, void *rec,
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 37d4b10..cc1b426 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -302,6 +302,50 @@ TRACE_MAKE_SYSTEM_STR();
})

 #undef DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)  \
+static notrace void \
+trace_event_stm_output_##call(struct trace_seq *tmp_seq,   \
+void *entry,   \
+struct trace_buffer_stm *trace_buf)\
+{  \
+   struct ftrace_raw_##call *field = entry;\
+   struct trace_seq *p = tmp_seq;  \
+   \
+   trace_seq_init(p);  \
+   \
+   trace_event_buf_vprintf(trace_buf, print);  \
+   \
+   return; \
+}
+
+#undef DEFINE_EVENT_PRINT
+#define DEFINE_EVENT_PRINT(template, call, proto, args, print) \
+static notrace void\
+trace_event_stm_output_##call(struct trace_seq *tmp_seq,   \
+void *entry,   \
+struct trace_buffer_stm *trace_buf)\
+{  \
+   struct trace_seq *p = tmp_seq;  \
+   struct trace_entry *ent = entry;\
+   struct ftrace_raw_##template *field;\
+   \
+   if (ent->type != event_##call.event.type) { \
+   WARN_ON_ONCE(1);\
+   return; \
+   }   \
+   \
+   field = (typeof(field))entry;   \
+   \
+   trace_seq_init(p);  \
+   \
+   trace_event_buf_vprintf(trace_buf, print);  \
+   \
+   return; \
+}
+
+#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
+
+#undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto

[RFC PATCH 0/3] Integration of trace events with System Trace IP blocks

2015-06-04 Thread Chunyan Zhang

IP blocks allowing a variety of trace sources to log debugging
information to a pre-defined area have been introduced on a couple of
architecture [1][2]. These system trace blocks (also known as STM)
typically follow the MIPI STPv2 protocol [3] and provide a system wide
logging facility to any device, running a kernel or not, with access
to the block's log entry port(s).  Since each trace message has a
timestamp is it possible to correlate events happening in the entire
system rather than being confined to the logging facility of a single
entity.

This patch is using a very simple "stm_source" introduced in [2] to
duplicate the output of the trace event subsystem to an STM, in this
case coresight STM.  That way logging information generated by the
trace event subsystem and gathered in the coresight sink can be used
in conjunction with trace data from other board components, also
collected in the same trace sink.  This example is using coresight but
the same would apply to any architecture wishing to do the same.

The goal of this RFC is to solicit comments on the method used to
connect trace event logging with STMs (using the generic STM API)
rather than function "stm_ftrace_write()" itself, which was provided
for completeness of the proof of concept only.

I'm eager to see your comments on this, and if you have some good
ideas that can slow down the overhead, please let me know.

Regards,
Chunyan


[1]. https://lkml.org/lkml/2015/2/4/729
[2]. http://comments.gmane.org/gmane.linux.kernel/1914526
[3]. http://mipi.org/specifications/debug#STP

Chunyan Zhang (2):
  Trace log handler for logging into STM blocks
  Introduce trace log output function for STM

Mathieu Poirier (1):
  STM trace event: Adding generic buffer interface driver

 drivers/stm/Kconfig | 11 +
 drivers/stm/Makefile|  2 +
 drivers/stm/stm_trace_event.c   | 46 +++
 include/linux/ftrace_event.h| 15 +++
 include/trace/ftrace.h  | 47 +++
 kernel/trace/Makefile   |  1 +
 kernel/trace/trace_output_stm.c | 99 +
 7 files changed, 221 insertions(+)
 create mode 100644 drivers/stm/stm_trace_event.c
 create mode 100644 kernel/trace/trace_output_stm.c

--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 1/3] STM trace event: Adding generic buffer interface driver

2015-06-04 Thread Chunyan Zhang

From: Mathieu Poirier 

This patch adds a driver that models itself as an stm_source and
who's sole purpose is to export an interface to the rest of the
kernel.  Once the stm and stm_source have been linked via sysfs,
everything that is passed to the interface will endup in the STM
trace engine.

Signed-off-by: Mathieu Poirier 
Signed-off-by: Chunyan Zhang 
---
 drivers/stm/Kconfig   | 11 +++
 drivers/stm/Makefile  |  2 ++
 drivers/stm/stm_trace_event.c | 46 +++
 3 files changed, 59 insertions(+)
 create mode 100644 drivers/stm/stm_trace_event.c

diff --git a/drivers/stm/Kconfig b/drivers/stm/Kconfig
index 6f2db70..8ead418 100644
--- a/drivers/stm/Kconfig
+++ b/drivers/stm/Kconfig
@@ -25,3 +25,14 @@ config STM_SOURCE_CONSOLE

  If you want to send kernel console messages over STM devices,
  say Y.
+
+config STM_TRACE_EVENT
+   tristate "Redirect/copy the output from kernel trace event to
STM engine"
+   depends on STM
+   help
+ This option can be used to redirect or copy the output from
kernel trace
+ event to STM engine. Enabling this option will introduce a slight
+ timing effect.
+
+ If you want to send kernel trace event messages over STM devices,
+ say Y.
diff --git a/drivers/stm/Makefile b/drivers/stm/Makefile
index 74baf59..55b152c 100644
--- a/drivers/stm/Makefile
+++ b/drivers/stm/Makefile
@@ -5,3 +5,5 @@ stm_core-y  := core.o policy.o
 obj-$(CONFIG_STM_DUMMY)+= dummy_stm.o

 obj-$(CONFIG_STM_SOURCE_CONSOLE)   += console.o
+
+obj-$(CONFIG_STM_TRACE_EVENT)  += stm_trace_event.o
diff --git a/drivers/stm/stm_trace_event.c b/drivers/stm/stm_trace_event.c
new file mode 100644
index 000..0d787ce
--- /dev/null
+++ b/drivers/stm/stm_trace_event.c
@@ -0,0 +1,46 @@
+/*
+ * Simple kernel driver to link kernel trace event and an STM device
+ * Copyright (c) 2015, Linaro Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static struct stm_source_data stm_trace_event_data = {
+   .name   = "stm_trace_event",
+   .nr_chans   = 1,
+};
+
+void stm_trace_event_write(const char *buf, unsigned len)
+{
+   stm_source_write(&stm_trace_event_data, 0, buf, len);
+}
+
+static int stm_trace_event_init(void)
+{
+   return stm_source_register_device(NULL, &stm_trace_event_data);
+}
+
+static void stm_trace_event_exit(void)
+{
+   stm_source_unregister_device(&stm_trace_event_data);
+}
+
+module_init(stm_trace_event_init);
+module_exit(stm_trace_event_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("stm_trace_event driver");
+MODULE_AUTHOR("Mathieu Poirier ");
--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] serial: earlycon: Add support for big-endian MMIO accesses

2015-06-04 Thread Vineet Gupta

On Monday 25 May 2015 09:24 AM, Noam Camus wrote:
> From: Noam Camus 
> 
> Support command line parameters of the form:
> earlycon=,io|mmio|mmio32|mmio32be,,
> 
> This commit seem to be needed even after commit:
> serial: 8250: Add support for big-endian MMIO accesses
> c627f2ceb692e8a9358b64ac2d139314e7bb0d17
> 
> Signed-off-by: Noam Camus 
> ---
>  Documentation/kernel-parameters.txt |9 +
>  drivers/tty/serial/earlycon.c   |9 ++---
>  drivers/tty/serial/serial_core.c|7 +--
>  3 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 61ab162..55bb093 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -959,14 +959,15 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   uart[8250],io,[,options]
>   uart[8250],mmio,[,options]
>   uart[8250],mmio32,[,options]
> + uart[8250],mmio32be,[,options]
>   uart[8250],0x[,options]
>   Start an early, polled-mode console on the 8250/16550
>   UART at the specified I/O port or MMIO address.
>   MMIO inter-register address stride is either 8-bit
> - (mmio) or 32-bit (mmio32).
> - If none of [io|mmio|mmio32],  is assumed to be
> - equivalent to 'mmio'. 'options' are specified in the
> - same format described for "console=ttyS"; if
> + (mmio) or 32-bit (mmio32 or mmio32be).
> + If none of [io|mmio|mmio32|mmio32be],  is assumed
> + to be equivalent to 'mmio'. 'options' are specified
> + in the same format described for "console=ttyS"; if
>   unspecified, the h/w is not initialized.
>  
>   pl011,
> diff --git a/drivers/tty/serial/earlycon.c b/drivers/tty/serial/earlycon.c
> index 5fdc9f3..a840732 100644
> --- a/drivers/tty/serial/earlycon.c
> +++ b/drivers/tty/serial/earlycon.c
> @@ -72,6 +72,7 @@ static int __init parse_options(struct earlycon_device 
> *device, char *options)
>  
>   switch (port->iotype) {
>   case UPIO_MEM32:
> + case UPIO_MEM32BE:
>   port->regshift = 2; /* fall-through */
>   case UPIO_MEM:
>   port->mapbase = addr;
> @@ -90,9 +91,11 @@ static int __init parse_options(struct earlycon_device 
> *device, char *options)
>   strlcpy(device->options, options, length);
>   }
>  
> - if (port->iotype == UPIO_MEM || port->iotype == UPIO_MEM32)
> + if (port->iotype == UPIO_MEM || port->iotype == UPIO_MEM32 ||
> + port->iotype == UPIO_MEM32BE)
>   pr_info("Early serial console at MMIO%s 0x%llx (options 
> '%s')\n",
> - (port->iotype == UPIO_MEM32) ? "32" : "",
> + (port->iotype == UPIO_MEM) ? "" :
> + (port->iotype == UPIO_MEM32) ? "32" : "32be",
>   (unsigned long long)port->mapbase,
>   device->options);
>   else
> @@ -133,7 +136,7 @@ static int __init register_earlycon(char *buf, const 
> struct earlycon_id *match)
>   *
>   *   Registers the earlycon console matching the earlycon specified
>   *   in the param string @buf. Acceptable param strings are of the form
> - *  ,io|mmio|mmio32,,
> + *  ,io|mmio|mmio32|mmio32be,,
>   *  ,0x,
>   *  ,
>   *  
> diff --git a/drivers/tty/serial/serial_core.c 
> b/drivers/tty/serial/serial_core.c
> index 0b7bb12..1124090 100644
> --- a/drivers/tty/serial/serial_core.c
> +++ b/drivers/tty/serial/serial_core.c
> @@ -1816,8 +1816,8 @@ uart_get_console(struct uart_port *ports, int nr, 
> struct console *co)
>   *   @options: ptr for  field; NULL if not present (out)
>   *
>   *   Decodes earlycon kernel command line parameters of the form
> - *  earlycon=,io|mmio|mmio32,,
> - *  console=,io|mmio|mmio32,,
> + *  earlycon=,io|mmio|mmio32|mmio32be,,
> + *  console=,io|mmio|mmio32|mmio32be,,
>   *
>   *   The optional form
>   *  earlycon=,0x,
> @@ -1835,6 +1835,9 @@ int uart_parse_earlycon(char *p, unsigned char *iotype, 
> unsigned long *addr,
>   } else if (strncmp(p, "mmio32,", 7) == 0) {
>   *iotype = UPIO_MEM32;
>   p += 7;
> + } else if (strncmp(p, "mmio32be,", 9) == 0) {
> + *iotype = UPIO_MEM32BE;
> + p += 9;
>   } else if (strncmp(p, "io,", 3) == 0) {
>   *iotype = UPIO_PORT;
>   p += 3;
> 


Ping ! Peter, Rob, Kevin - can u please take a look.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 1/5] random: Blocking API for accessing nonblocking_pool

2015-06-04 Thread Herbert Xu

On Tue, May 19, 2015 at 10:18:05PM +0800, Herbert Xu wrote:
> On Tue, May 19, 2015 at 09:50:28AM -0400, Theodore Ts'o wrote:
> >
> > Finally, this is only going to block *once*, when the system is
> > initially botting up.  Why is it so important that we get the
> > asynchronous nature of this right, and why can't we solve it simply by
> > just simply doing the work in a workqueue, with a completion barrier
> > getting triggered once /dev/random initializes itself, and just simply
> > blocking the module unload until /dev/random is initialized?
> 
> I guess I'm still thinking of the old work queue code before
> Tejun's cmwq work.  Yes blocking in a work queue should be fine
> as there is usually just one DRBG instance.

It looks like waiting for it in a workqueue isn't good enough
after all.  I just got this in a KVM machine:

INFO: task kworker/0:1:121 blocked for more than 120 seconds.
  Tainted: G   O4.1.0-rc1+ #34
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/0:1 D 88001eb47d18 0   121  2 0x
Workqueue: events drbg_async_seed [drbg]
 88001eb47d18 88001e1bcec0 88001eb84010 0246
 88001eb48000 0020 88001d54ea50 
 88001f613080 88001eb47d38 813fe692 0020
Call Trace:
 [] schedule+0x32/0x80
 [] get_blocking_random_bytes+0x65/0xa0
 [] ? add_wait_queue+0x60/0x60
 [] drbg_async_seed+0x2c/0xc0 [drbg]
 [] process_one_work+0x129/0x310
 [] worker_thread+0x119/0x430
 [] ? __schedule+0x7fb/0x85e
 [] ? process_scheduled_works+0x40/0x40
 [] kthread+0xc4/0xe0
 [] ? proc_cap_handler+0x180/0x1b0
 [] ? kthread_freezable_should_stop+0x60/0x60
 [] ret_from_fork+0x42/0x70
 [] ? kthread_freezable_should_stop+0x60/0x60

Steffen, I think we need to revisit the idea of having a list
of callbacks.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4 01/11] block: make generic_make_request handle arbitrarily sized bios

2015-06-04 Thread Ming Lin

On Thu, Jun 4, 2015 at 5:06 PM, Mike Snitzer  wrote:
> On Thu, Jun 04 2015 at  6:21pm -0400,
> Ming Lin  wrote:
>
>> On Thu, Jun 4, 2015 at 2:06 PM, Mike Snitzer  wrote:
>> >
>> > We need to test on large HW raid setups like a Netapp filer (or even
>> > local SAS drives connected via some SAS controller).  Like a 8+2 drive
>> > RAID6 or 8+1 RAID5 setup.  Testing with MD raid on JBOD setups with 8
>> > devices is also useful.  It is larger RAID setups that will be more
>> > sensitive to IO sizes being properly aligned on RAID stripe and/or chunk
>> > size boundaries.
>>
>> I'll test it on large HW raid setup.
>>
>> Here is HW RAID5 setup with 19 278G HDDs on Dell R730xd(2sockets/48
>> logical cpus/264G mem).
>> http://minggr.net/pub/20150604/hw_raid5.jpg
>>
>> The stripe size is 64K.
>>
>> I'm going to test ext4/btrfs/xfs on it.
>> "bs" set to 1216k(64K * 19 = 1216k)
>> and run 48 jobs.
>
> Definitely an odd blocksize (though 1280K full stripe is pretty common
> for 10+2 HW RAID6 w/ 128K chunk size).

I can change it to 10 HDDs HW RAID6 w/ 128K chunk size, then use bs=1280K

>
>> [global]
>> ioengine=libaio
>> iodepth=64
>> direct=1
>> runtime=1800
>> time_based
>> group_reporting
>> numjobs=48
>> rw=read
>>
>> [job1]
>> bs=1216K
>> directory=/mnt
>> size=1G
>
> How does time_based relate to size=1G?  It'll rewrite the same 1 gig
> file repeatedly?

Above job file is for read.
For write, I think so.
Do is make sense for performance test?

>
>> Or do you have other suggestions of what tests I should run?
>
> You're welcome to run this job but I'll also check with others here to
> see what fio jobs we used in the recent past when assessing performance
> of the dm-crypt parallelization changes.

That's very helpful.

>
> Also, a lot of care needs to be taken to eliminate jitter in the system
> while the test is running.  We got a lot of good insight from Bart Van
> Assche on that and put it to practice.  I'll see if we can (re)summarize
> that too.

Very helpful too.

Thanks.

>
> Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] extcon: max77843: Clear IRQ bits state before request IRQ

2015-06-04 Thread Chanwoo Choi

On 06/05/2015 01:54 PM, MyungJoo Ham wrote:
>>   
>>  IRQ signal before driver probe is needless because driver sends
>> current state after platform booting done.
>> So, this patch clears MUIC IRQ bits before request IRQ.
>>
>> Signed-off-by: Jaewon Kim 
>> ---
>>  drivers/extcon/extcon-max77843.c |9 +
>>  1 file changed, 9 insertions(+)
> 
> Q1. Is this because the pending bits are USELESS?
> or because the pendeing bits incurs INCORRECT behaviors?

The max77843 datasheet includes following sentence:
- "All bits are cleared after a read" about INT1/INT2/INT3 register.
There are no problem about interrupt handling.

> 
> Q2. Does clearing (by reading) INT1 do everything you need?
> What about INT2 and INT3?

The MAXIM MAX77843 MUIC support the one more interrupts (e.g., ADC1K, VBVolt, 
ChgTyp ...).
The each interrupt is included in the one register among INT1/2/3.

This patch clear the all interrupts of MAX77843 before requesting the 
interrupts.

> 
> Q3. I presume that "driver sends current state after..." is
> coming from the invokation of "queue_delayed_work()" at the end 
> of the probe function. It appears that you are only serving
> the pending status of "cable detection" with it while INT1
> seems to have more functionalities. Does that delayed work
> do everything that are pending, really?

When completed kernel booting, the delayed work of extcon-max77843.c driver
use the MAX77843_MUIC_STATUSx register to detect the type of connected
external connectors. So, there are no problme about clearing all bits of 
INT1/2/3 interrupt register.

If user-space platform don't finish the initialization of all user-process 
daemons
and extcon driver send the uevent during only kernel booting, the uevent is not 
handled
on user-space daemons.

Thanks,
Chanwoo Choi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interaction issue of intel wifi and broadcom bluetooth - it appears that nobody feels responsible for doing something

2015-06-04 Thread Jonas Thiem

Hi Jeremiah,

thanks for responding!

I did have my mobile phone very nearby also connected to the bluetooth
headphones while my laptop was still using 11n wifi. I didn't have any
noticeable issues with bluetooth there.

But I got the feeling that my phone's android drivers + hardware for
bluetooth are tuned better than the laptop ones, so maybe that just
means the phone is just better at jumping frequencies to avoid.

I guess the best test would be the same laptop model in direct
proximity, but sadly I only own that laptop once. ;)

I hope wireless interference wouldn't rule out that some driver work
would be considered to make it work better - after all, both chips are
in the same laptop and as per the intel comment, bluetooth is supposed
to work despite of wifi activity.

Regards,
Jonas Thiem

On 06/05/2015 06:45 AM, Jeremiah Mahler wrote:
> Jonas,
> 
> On Fri, Jun 05, 2015 at 01:00:32AM +0200, Jonas Thiem wrote:
>> Hi *,
>>
>> this is my first post to this mailing list, sorry if it's not supposed
>> to go here. (also CC in responses would be nice since I'm not
>> subscribed)
>>
>> I filed a bug about an intel centrino wifi interaction with broadcom's 
>> BCM2045B:
>> https://bugzilla.kernel.org/show_bug.cgi?id=97101
>>
> Those are some unhelpful replies :-(
> 
>> In short, the two seem to kinda fight over the wireless spectrum and
>> both drop connections all the time - unless the 'iwlwifi' module is
>> loaded with 11n_disabled=1.
>>
> 
> I don't have a solution but I think the problem is interesting.
> 
> Both Bluetooth and 11n share the same frequency band near 2.4 GHz so it
> is possible that they could conflict.  If you had two laptops, and you
> ran just Bluetooth on one and just 11n on the other, would they
> both have problems?  This would tell as whether it was something inside
> the kernel or if it was really wireless interference.
> 
> [...]
>>
>> Regards,
>> Jonas Thiem
> 

signature.asc
Description: OpenPGP digital signature

Re: [PATCHv5] [media] saa7164: use an MSI interrupt when available

2015-06-04 Thread Brendan McGrath


Hi Kyle,

Great to hear you haven't had any problems since applying this patch! 
I'm looking forward to seeing it in the Linux master branch too.


Version 5 of the patch has been accepted and committed to the media tree 
by Mauro:

http://git.linuxtv.org/cgit.cgi/media_tree.git/commit/?id=77978089ddc90347644cc057e6b6cd169ac9abd4

I'm guessing it will therefore go in to the main Linux tree with the 
release of kernel version v4.2? (or it can be submitted as a fix for 
v4.1 - but I have no idea how a patch is selected on that criteria).


Hopefully someone can confirm or elaborate.

Regards,
Brendan McGrath


On 05/06/15 14:42, Kyle Sanderson wrote:

This has been plaguing users for years (there's a number of threads on
the Ubuntu board). I've been using revision 1 of the patch without
issue since early February. This is from having to constantly reboot
the system to flawless recording. If something has been outstanding
from Brendan, please let me know and I'll happily make the requested
changes.

Can we please merge this? There are at-least three consumers in this
thread alone that have confirmed this fixes the saa7164 driver for the
HVR-2250 device.
Kyle.

PS: I can't seem to source out who owns this in the MAINTAINERS file?

On Thu, Apr 9, 2015 at 11:39 PM, Brendan McGrath
 wrote:

Enhances driver to use an MSI interrupt when available.

Adds the module option 'enable_msi' (type bool) which by default is
enabled. Can be set to 'N' to disable.

Fixes (or can reduce the occurrence of) a crash which is most commonly
reported when both digital tuners of the saa7164 chip is in use. A
reported example can be found here:
http://permalink.gmane.org/gmane.linux.drivers.video-input-infrastructure/83948

Reviewed-by: Steven Toth 
Signed-off-by: Brendan McGrath 
---
Changes since v4:
   - improved readability by taking on suggestions made by Mauro
   - the msi variable in the saa7164_dev structure is now a bool

Thanks Mauro - good suggestions and I think I've taken on board all of them.

  drivers/media/pci/saa7164/saa7164-core.c | 66 
  drivers/media/pci/saa7164/saa7164.h  |  1 +
  2 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/drivers/media/pci/saa7164/saa7164-core.c 
b/drivers/media/pci/saa7164/saa7164-core.c
index 9cf3c6c..5e4a9f0 100644
--- a/drivers/media/pci/saa7164/saa7164-core.c
+++ b/drivers/media/pci/saa7164/saa7164-core.c
@@ -85,6 +85,11 @@ module_param(guard_checking, int, 0644);
  MODULE_PARM_DESC(guard_checking,
 "enable dma sanity checking for buffer overruns");

+static bool enable_msi = true;
+module_param(enable_msi, bool, 0444);
+MODULE_PARM_DESC(enable_msi,
+   "enable the use of an msi interrupt if available");
+
  static unsigned int saa7164_devcount;

  static DEFINE_MUTEX(devlist);
@@ -1184,6 +1189,39 @@ static int saa7164_thread_function(void *data)
 return 0;
  }

+static bool saa7164_enable_msi(struct pci_dev *pci_dev, struct saa7164_dev 
*dev)
+{
+   int err;
+
+   if (!enable_msi) {
+   printk(KERN_WARNING "%s() MSI disabled by module parameter 
'enable_msi'"
+  , __func__);
+   return false;
+   }
+
+   err = pci_enable_msi(pci_dev);
+
+   if (err) {
+   printk(KERN_ERR "%s() Failed to enable MSI interrupt."
+   " Falling back to a shared IRQ\n", __func__);
+   return false;
+   }
+
+   /* no error - so request an msi interrupt */
+   err = request_irq(pci_dev->irq, saa7164_irq, 0,
+   dev->name, dev);
+
+   if (err) {
+   /* fall back to legacy interrupt */
+   printk(KERN_ERR "%s() Failed to get an MSI interrupt."
+  " Falling back to a shared IRQ\n", __func__);
+   pci_disable_msi(pci_dev);
+   return false;
+   }
+
+   return true;
+}
+
  static int saa7164_initdev(struct pci_dev *pci_dev,
const struct pci_device_id *pci_id)
  {
@@ -1230,13 +1268,22 @@ static int saa7164_initdev(struct pci_dev *pci_dev,
 goto fail_irq;
 }

-   err = request_irq(pci_dev->irq, saa7164_irq,
-   IRQF_SHARED, dev->name, dev);
-   if (err < 0) {
-   printk(KERN_ERR "%s: can't get IRQ %d\n", dev->name,
-   pci_dev->irq);
-   err = -EIO;
-   goto fail_irq;
+   /* irq bit */
+   if (saa7164_enable_msi(pci_dev, dev)) {
+   dev->msi = true;
+   } else {
+   /* if we have an error (i.e. we don't have an interrupt)
+or msi is not enabled - fallback to shared interrupt */
+
+   err = request_irq(pci_dev->irq, saa7164_irq,
+   IRQF_SHARED, dev->name, dev);
+
+   if (err < 0) {
+   printk(KERN_ERR "%s: can't get IRQ %d\n", dev->name,
+

Re: [PATCH 2/4] ARC: [axs101] support early 8250 uart

2015-06-04 Thread Vineet Gupta

+CC  linux-serial

On Thursday 14 May 2015 06:34 PM, Vineet Gupta wrote:
> On Thursday 14 May 2015 06:23 PM, Arnd Bergmann wrote:
> 
> On Thursday 14 May 2015 15:48:42 Alexey Brodkin wrote:
> 
> 
>> >
>> > chosen {
>> > -   bootargs = "console=tty0 console=ttyS3,115200n8 
>> > consoleblank=0";
>> > +   bootargs = "earlycon=uart8250,mmio32,0xe0022000,115200n8 
>> > console=tty0 console=ttyS3,115200n8 consoleblank=0";
>> > };
>> >  };
>> >
> 
> When you do earlycon with DT, better use a 'stdout-path' property that points
> to the device, and just put 'earlycon' without arguments on the command line.
> 
> Arnd
> 
> 
> Sure ! I tried that once (3.16) and even the dts patch got merged but had to 
> be reverted out !
> 
> 2014-07-27 22524b02b17b Revert "ARC: [arcfpga] stdout-path now suffices for 
> earlycon/console"
> 
> Let me see if that works again since serial land has seen some significant 
> churn in recent times
> 
> Thx for pointing this out !

so specifying console with stdout-path works for me,

-  bootargs = "earlycon=uart8250,mmio32,0xf000,115200n8 console=tty0
console=ttyS0,115200n8 consoleblank=0 debug";
+  bootargs = "earlycon=uart8250,mmio32,0xf000,115200n8";
+  stdout-path = &uart0;
..

But I don't see earlycon working with paramless earlycon

-  bootargs = "earlycon=uart8250,mmio32,0xf000,115200n8";
+  bootargs = "earlycon";
   stdout-path = &uart0;

And I don't see how it would work for others as of 4.1-rc6
Relevant config items I have are:

CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_OF_PLATFORM=y
...

There are 2 earlyparam handlers for earlcon,
(1) param_setup_earlycon() -> setup_earlycon() -> register_console()
(2) setup_of_earlycon() -> early_init_dt_scan_chosen_serial -> 
of_setup_earlycon()

#1 only works when arg to earlycon is *not NULL*
#2 only works when arg is *NULL*.

For my case, #2 bails out early as __earlycon_of_table happens to be empty.

8071d8a0 T __earlycon_of_table
8071d8a0 00c4 t __earlycon_of_table_sentinel

This make sense since I don't see any OF_EARLYCON_DECLARE() in 8250 driver.
As a quick hack I added one in 8250/8250_early.c

@@ -152,3 +154,4 @@ static int __init early_serial8250_setup(struct
earlycon_device *device,
 }
 EARLYCON_DECLARE(uart8250, early_serial8250_setup);
 EARLYCON_DECLARE(uart, early_serial8250_setup);
+OF_EARLYCON_DECLARE(uart8250, "ns8250", early_serial8250_setup);

I needed another fine adjustment as of_setup_earlycon() assumes mmio, while it
needs to be memio32 for my case.

@@ -199,7 +199,7 @@ int __init of_setup_earlycon(unsigned long addr,
int err;
struct uart_port *port = &early_console_dev.port;

-   port->iotype = UPIO_MEM;
+   port->iotype = UPIO_MEM32;

With this paramless earlycon works.

Now both the above are hacks, but I want to understand if I'm missing something 
in
ARC port or does core need some adjustments along the lines of above, since
presumably others have it working !

P.S. with respect to the original patch, I would fold it into for-next with 
change
to stdout-path and keep earlycon as before - we can fix it up later.

Thx,
-vineet

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

What Time?

2015-06-04 Thread Jane

What Time?
Did you get my mail? What time should i call you?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 >

1 - 100 of 777 matches

Mail list logo