Re: [PATCH 3/4] scripts/gdb: Add rb tree iterating utilities

2019-03-26 Thread Jan Kiszka

On 26.03.19 18:05, Stephen Boyd wrote:

Quoting Kieran Bingham (2019-03-26 01:52:10)

Hi Stephen,

On 25/03/2019 18:45, Stephen Boyd wrote:

Implement gdb functions for rb_first(), rb_last(), rb_next(), and
rb_prev(). These can be useful to iterate through the kernel's red-black
trees.


I definitely approve of getting data-structure helpers into scripts/gdb,
as it will greatly assist debug options but my last attempt to do this
was with the radix-tree which I had to give up on as the internals were
changing rapidly and caused continuous breakage to the helpers.


Thanks for the background on radix-tree. I haven't looked at that yet,
but I suppose I'll want to have that too at some point.



Do you foresee any similar issue here? Or is the corresponding RB code
in the kernel fairly 'stable'?


Please could we make sure whomever maintains the RBTree code is aware of
the python implementation?

That said, MAINTAINERS doesn't actually seem to list any ownership over
the rb-tree code, and get_maintainers.pl [0] seems to be pointing at
Andrew as the probable route in for that code so perhaps that's already
in place :D


I don't think that the rb tree implementation is going to change. It
feels similar to the list API. I suppose this problem of keeping things
in sync is a more general problem than just data-structures changing.
The only solution I can offer is to have more testing and usage of these
scripts. Unless gdb can "simulate" or run arbitrary code for us then I
think we're stuck reimplementing kernel internal code in gdb scripts so
that we can get debug info out.



Could we possibly leave some link in form of comment in the related headers or 
implementations? Won't magically solve the problem but at least increase changes 
that author actually read them when they start changing the C implementations.


Jan

--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


Re: [PATCH] ASoC: core: Fix use-after-free after deferred card registration

2019-03-26 Thread Curtis Malainey
This has already been patched. See
https://mailman.alsa-project.org/pipermail/alsa-devel/2019-March/146150.html

On Tue, Mar 26, 2019 at 10:23 AM Guenter Roeck  wrote:
>
> If snd_soc_register_card() fails because one of its links fails
> to instantiate with -EPROBE_DEFER, and the to-be-registered link
> is a legacy link, a subsequent retry will trigger a use-after-free
> and quite often a system crash.
>
> Example:
>
> byt-max98090 byt-max98090: ASoC: failed to init link Baytrail Audio
> byt-max98090 byt-max98090: snd_soc_register_card failed -517
> 
> BUG: KASAN: use-after-free in snd_soc_init_platform+0x233/0x312
> Read of size 8 at addr 888067c43070 by task kworker/1:1/23
>
> snd_soc_init_platform() allocates memory attached to the card device.
> This memory is released when the card device is released. However,
> the pointer to the memory (dai_link->platforms) is only cleared from
> soc_cleanup_platform(), which is called from soc_cleanup_card_resources(),
> but not if snd_soc_register_card() fails early.
>
> Add the missing call to soc_cleanup_platform() in the error handling
> code of snd_soc_register_card() to fix the problem.
>
> Fixes: 78a24e10cd94 ("ASoC: soc-core: clear platform pointers on error")
> Cc: Curtis Malainey 
> Signed-off-by: Guenter Roeck 
> ---
>  sound/soc/soc-core.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
> index 93d316d5bf8e..6bf9884d0863 100644
> --- a/sound/soc/soc-core.c
> +++ b/sound/soc/soc-core.c
> @@ -2799,6 +2799,7 @@ int snd_soc_register_card(struct snd_soc_card *card)
> if (ret) {
> dev_err(card->dev, "ASoC: failed to init link %s\n",
> link->name);
> +   soc_cleanup_platform(card);
> mutex_unlock(_mutex);
> return ret;
> }
> --
> 2.7.4
>


Re: [RESEND PATCH v6 02/11] dt-bindings: power: supply: add DT bindings for max77650

2019-03-26 Thread Bartosz Golaszewski
pt., 22 mar 2019 o 10:00 Pavel Machek  napisaƂ(a):
>
> On Mon 2019-03-18 18:40:31, Bartosz Golaszewski wrote:
> > From: Bartosz Golaszewski 
> >
> > Add the DT binding document for the battery charger module of max77650.
> >
> > Signed-off-by: Bartosz Golaszewski 
> > ---
> >  .../power/supply/max77650-charger.txt | 27 +++
> >  1 file changed, 27 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/power/supply/max77650-charger.txt
> >
> > diff --git 
> > a/Documentation/devicetree/bindings/power/supply/max77650-charger.txt 
> > b/Documentation/devicetree/bindings/power/supply/max77650-charger.txt
> > new file mode 100644
> > index ..d25c95369616
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/power/supply/max77650-charger.txt
> > @@ -0,0 +1,27 @@
> > +Battery charger driver for MAX77650 PMIC from Maxim Integrated.
> > +
> > +This module is part of the MAX77650 MFD device. For more details
> > +see Documentation/devicetree/bindings/mfd/max77650.txt.
> > +
> > +The charger is represented as a sub-node of the PMIC node on the device 
> > tree.
> > +
> > +Required properties:
> > +
> > +- compatible:Must be "maxim,max77650-charger"
> > +
> > +Optional properties:
> > +
> > +- min-microvolt: Minimum CHGIN regulation voltage (in microvolts). 
> > Must be
> > + one of: 400, 410, 420, 430, 440,
> > + 450, 460, 470.
>
> Probably needs "max," prefix. And .. what does this mean? Will charger
> shutdown if input is less than this?
>

The charger will enter the undervoltage lockout state and stop
charging, this is explained in the manual, so I don't think the
bindings are the right place to add this information.

Bart

> > +- curr-lim-microamp: CHGIN input current limit (in microamps). Must be one 
> > of:
> > + 95000, 19, 285000, 38, 475000.
>
> "current-limit-microamp", I guess. And probably "max,current-limit-microamp".
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Re: [PATCHv4 16/28] PCI: mobiveil: refactor Mobiveil PCIe Host Bridge IP driver

2019-03-26 Thread Lorenzo Pieralisi
On Mon, Mar 11, 2019 at 09:32:04AM +, Z.q. Hou wrote:
> From: Hou Zhiqiang 
> 
> As the Mobiveil PCIe controller support RC DAUL mode, and to
> make platforms which integrated the Mobiveil PCIe IP more easy
> to add their drivers, this patch moved the Mobiveil driver to
> a new directory 'drivers/pci/controller/mobiveil' and refactored
> it according to the abstraction of RC (EP driver will be added
> later).

I do not want to create a subdirectory for every controller that
can work in RC so drop this patch, more so given that it will
be required "later", we will create a directory when and if we
actually have to.

Thanks,
Lorenzo

> Signed-off-by: Hou Zhiqiang 
> Reviewed-by: Minghuan Lian 
> Reviewed-by: Subrahmanya Lingappa 
> ---
> V4:
>  - no change
> 
>  MAINTAINERS   |   2 +-
>  drivers/pci/controller/Kconfig|  11 +-
>  drivers/pci/controller/Makefile   |   2 +-
>  drivers/pci/controller/mobiveil/Kconfig   |  24 +
>  drivers/pci/controller/mobiveil/Makefile  |   4 +
>  .../pcie-mobiveil-host.c} | 528 +++---
>  .../controller/mobiveil/pcie-mobiveil-plat.c  |  54 ++
>  .../pci/controller/mobiveil/pcie-mobiveil.c   | 228 
>  .../pci/controller/mobiveil/pcie-mobiveil.h   | 187 +++
>  9 files changed, 587 insertions(+), 453 deletions(-)
>  create mode 100644 drivers/pci/controller/mobiveil/Kconfig
>  create mode 100644 drivers/pci/controller/mobiveil/Makefile
>  rename drivers/pci/controller/{pcie-mobiveil.c => 
> mobiveil/pcie-mobiveil-host.c} (55%)
>  create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil-plat.c
>  create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil.c
>  create mode 100644 drivers/pci/controller/mobiveil/pcie-mobiveil.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1e64279f338a..1013e74b14f2 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11877,7 +11877,7 @@ M:Subrahmanya Lingappa 
> 
>  L:   linux-...@vger.kernel.org
>  S:   Supported
>  F:   Documentation/devicetree/bindings/pci/mobiveil-pcie.txt
> -F:   drivers/pci/controller/pcie-mobiveil.c
> +F:   drivers/pci/controller/mobiveil/pcie-mobiveil*
>  
>  PCI DRIVER FOR MVEBU (Marvell Armada 370 and Armada XP SOC support)
>  M:   Thomas Petazzoni 
> diff --git a/drivers/pci/controller/Kconfig b/drivers/pci/controller/Kconfig
> index 6671946dbf66..0e981ed00a75 100644
> --- a/drivers/pci/controller/Kconfig
> +++ b/drivers/pci/controller/Kconfig
> @@ -241,16 +241,6 @@ config PCIE_MEDIATEK
> Say Y here if you want to enable PCIe controller support on
> MediaTek SoCs.
>  
> -config PCIE_MOBIVEIL
> - bool "Mobiveil AXI PCIe controller"
> - depends on ARCH_ZYNQMP || COMPILE_TEST
> - depends on OF
> - depends on PCI_MSI_IRQ_DOMAIN
> - help
> -   Say Y here if you want to enable support for the Mobiveil AXI PCIe
> -   Soft IP. It has up to 8 outbound and inbound windows
> -   for address translation and it is a PCIe Gen4 IP.
> -
>  config PCIE_TANGO_SMP8759
>   bool "Tango SMP8759 PCIe controller (DANGEROUS)"
>   depends on ARCH_TANGO && PCI_MSI && OF
> @@ -281,4 +271,5 @@ config VMD
> module will be called vmd.
>  
>  source "drivers/pci/controller/dwc/Kconfig"
> +source "drivers/pci/controller/mobiveil/Kconfig"
>  endmenu
> diff --git a/drivers/pci/controller/Makefile b/drivers/pci/controller/Makefile
> index d56a507495c5..b79a615041a0 100644
> --- a/drivers/pci/controller/Makefile
> +++ b/drivers/pci/controller/Makefile
> @@ -26,11 +26,11 @@ obj-$(CONFIG_PCIE_ROCKCHIP) += pcie-rockchip.o
>  obj-$(CONFIG_PCIE_ROCKCHIP_EP) += pcie-rockchip-ep.o
>  obj-$(CONFIG_PCIE_ROCKCHIP_HOST) += pcie-rockchip-host.o
>  obj-$(CONFIG_PCIE_MEDIATEK) += pcie-mediatek.o
> -obj-$(CONFIG_PCIE_MOBIVEIL) += pcie-mobiveil.o
>  obj-$(CONFIG_PCIE_TANGO_SMP8759) += pcie-tango.o
>  obj-$(CONFIG_VMD) += vmd.o
>  # pcie-hisi.o quirks are needed even without CONFIG_PCIE_DW
>  obj-y+= dwc/
> +obj-y+= mobiveil/
>  
>  
>  # The following drivers are for devices that use the generic ACPI
> diff --git a/drivers/pci/controller/mobiveil/Kconfig 
> b/drivers/pci/controller/mobiveil/Kconfig
> new file mode 100644
> index ..64343c07bfed
> --- /dev/null
> +++ b/drivers/pci/controller/mobiveil/Kconfig
> @@ -0,0 +1,24 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +menu "Mobiveil PCIe Core Support"
> + depends on PCI
> +
> +config PCIE_MOBIVEIL
> + bool
> +
> +config PCIE_MOBIVEIL_HOST
> +bool
> + depends on PCI_MSI_IRQ_DOMAIN
> +select PCIE_MOBIVEIL
> +
> +config PCIE_MOBIVEIL_PLAT
> + bool "Mobiveil AXI PCIe controller"
> + depends on ARCH_ZYNQMP || COMPILE_TEST
> + depends on OF
> + select PCIE_MOBIVEIL_HOST
> + help
> +   Say Y here if you want to enable support for the Mobiveil AXI PCIe
> +   Soft IP. It has up to 8 outbound and 

[PATCH v7 04/11] dt-bindings: input: add DT bindings for max77650

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add the DT binding document for the onkey module of max77650.

Signed-off-by: Bartosz Golaszewski 
Reviewed-by: Rob Herring 
---
 .../bindings/input/max77650-onkey.txt | 26 +++
 1 file changed, 26 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/input/max77650-onkey.txt

diff --git a/Documentation/devicetree/bindings/input/max77650-onkey.txt 
b/Documentation/devicetree/bindings/input/max77650-onkey.txt
new file mode 100644
index ..477dc74f452a
--- /dev/null
+++ b/Documentation/devicetree/bindings/input/max77650-onkey.txt
@@ -0,0 +1,26 @@
+Onkey driver for MAX77650 PMIC from Maxim Integrated.
+
+This module is part of the MAX77650 MFD device. For more details
+see Documentation/devicetree/bindings/mfd/max77650.txt.
+
+The onkey controller is represented as a sub-node of the PMIC node on
+the device tree.
+
+Required properties:
+
+- compatible:  Must be "maxim,max77650-onkey".
+
+Optional properties:
+- linux,code:  The key-code to be reported when the key is pressed.
+   Defaults to KEY_POWER.
+- maxim,onkey-slide:   The system's button is a slide switch, not the default
+   push button.
+
+Example:
+
+
+   onkey {
+   compatible = "maxim,max77650-onkey";
+   linux,code = ;
+   maxim,onkey-slide;
+   };
-- 
2.20.1



[PATCH v7 02/11] dt-bindings: power: supply: add DT bindings for max77650

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add the DT binding document for the battery charger module of max77650.

Signed-off-by: Bartosz Golaszewski 
---
 .../power/supply/max77650-charger.txt | 27 +++
 1 file changed, 27 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/power/supply/max77650-charger.txt

diff --git 
a/Documentation/devicetree/bindings/power/supply/max77650-charger.txt 
b/Documentation/devicetree/bindings/power/supply/max77650-charger.txt
new file mode 100644
index ..fef188144386
--- /dev/null
+++ b/Documentation/devicetree/bindings/power/supply/max77650-charger.txt
@@ -0,0 +1,27 @@
+Battery charger driver for MAX77650 PMIC from Maxim Integrated.
+
+This module is part of the MAX77650 MFD device. For more details
+see Documentation/devicetree/bindings/mfd/max77650.txt.
+
+The charger is represented as a sub-node of the PMIC node on the device tree.
+
+Required properties:
+
+- compatible:  Must be "maxim,max77650-charger"
+
+Optional properties:
+
+- min-microvolt:   Minimum CHGIN regulation voltage (in microvolts). Must 
be
+   one of: 400, 410, 420, 430, 440,
+   450, 460, 470.
+- current-limit-microamp:  CHGIN input current limit (in microamps). Must
+   be one of: 95000, 19, 285000, 38, 
475000.
+
+Example:
+
+
+   charger {
+   compatible = "maxim,max77650-charger";
+   min-microvolt = <420>;
+   curr-lim-microamp = <285000>;
+   };
-- 
2.20.1



[PATCH v7 03/11] dt-bindings: leds: add DT bindings for max77650

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add the DT binding document for the LEDs module of max77650.

Signed-off-by: Bartosz Golaszewski 
Reviewed-by: Rob Herring 
---
 .../bindings/leds/leds-max77650.txt   | 57 +++
 1 file changed, 57 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/leds/leds-max77650.txt

diff --git a/Documentation/devicetree/bindings/leds/leds-max77650.txt 
b/Documentation/devicetree/bindings/leds/leds-max77650.txt
new file mode 100644
index ..3a67115cc1da
--- /dev/null
+++ b/Documentation/devicetree/bindings/leds/leds-max77650.txt
@@ -0,0 +1,57 @@
+LED driver for MAX77650 PMIC from Maxim Integrated.
+
+This module is part of the MAX77650 MFD device. For more details
+see Documentation/devicetree/bindings/mfd/max77650.txt.
+
+The LED controller is represented as a sub-node of the PMIC node on
+the device tree.
+
+This device has three current sinks.
+
+Required properties:
+
+- compatible:  Must be "maxim,max77650-led"
+- #address-cells:  Must be <1>.
+- #size-cells: Must be <0>.
+
+Each LED is represented as a sub-node of the LED-controller node. Up to
+three sub-nodes can be defined.
+
+Required properties of the sub-node:
+
+
+- reg: Must be <0>, <1> or <2>.
+
+Optional properties of the sub-node:
+
+
+- label:   See Documentation/devicetree/bindings/leds/common.txt
+- linux,default-trigger: See Documentation/devicetree/bindings/leds/common.txt
+
+For more details, please refer to the generic GPIO DT binding document
+.
+
+Example:
+
+
+   leds {
+   compatible = "maxim,max77650-led";
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   led@0 {
+   reg = <0>;
+   label = "blue:usr0";
+   };
+
+   led@1 {
+   reg = <1>;
+   label = "red:usr1";
+   linux,default-trigger = "heartbeat";
+   };
+
+   led@2 {
+   reg = <2>;
+   label = "green:usr2";
+   };
+   };
-- 
2.20.1



Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-03-26 Thread Joel Savitz
Forgot to add cc's... my bad.

Best,
Joel Savitz

On Tue, Mar 26, 2019 at 1:31 PM Joel Savitz  wrote:
>
> Ping!
>
> Does anyone have any comments or concerns about this patch?
>
> Best,
> Joel Savitz
>
> Best,
> Joel Savitz
>
>
> On Thu, Mar 7, 2019 at 9:42 AM Joel Savitz  wrote:
> >
> > On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz  wrote:
> > >
> > > If a process is limited by taskset (i.e. cpuset) to only be allowed to
> > > run on cpu N, and then cpu N is offlined via hotplug, the process will
> > > be assigned the current value of its cpuset cgroup's effective_cpus field
> > > in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> > > This argument's value does not makes sense for this case, because
> > > task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> > > to reflect the new value of cpu_active_mask after cpu N is removed from
> > > the mask. While this may make sense for the cgroup affinity mask, it
> > > does not make sense on a per-task basis, as a task that was previously
> > > limited to only be run on cpu N will be limited to every cpu _except_ for
> > > cpu N after it is offlined/onlined via hotplug.
> > >
> > > Pre-patch behavior:
> > >
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   ff
> > > Cpus_allowed_list:  0-7
> > >
> > > $ taskset -p 4 $$
> > > pid 19202's current affinity mask: f
> > > pid 19202's new affinity mask: 4
> > >
> > > $ grep Cpus /proc/self/status
> > > Cpus_allowed:   04
> > > Cpus_allowed_list:  2
> > >
> > > # echo off > /sys/devices/system/cpu/cpu2/online
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   0b
> > > Cpus_allowed_list:  0-1,3
> > >
> > > # echo on > /sys/devices/system/cpu/cpu2/online
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   0b
> > > Cpus_allowed_list:  0-1,3
> > >
> > > On a patched system, the final grep produces the following
> > > output instead:
> > >
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   ff
> > > Cpus_allowed_list:  0-7
> > >
> > > This patch changes the above behavior by instead simply resetting the mask
> > > to cpu_possible_mask.
> > >
> > > Signed-off-by: Joel Savitz 
> > > ---
> > >  kernel/cgroup/cpuset.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > > index 479743db6c37..5f65a2167bdf 100644
> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> > > struct cpumask *pmask)
> > >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > >  {
> > > rcu_read_lock();
> > > -   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> > > +   do_set_cpus_allowed(tsk, cpu_possible_mask);
> > > rcu_read_unlock();
> > >
> > > /*
> > > --
> > > 2.20.1
> > >


[PATCH v7 05/11] mfd: core: document mfd_add_devices()

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add a kernel doc for mfd_add_devices().

Signed-off-by: Bartosz Golaszewski 
Acked-by: Pavel Machek 
---
 drivers/mfd/mfd-core.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/mfd/mfd-core.c b/drivers/mfd/mfd-core.c
index 94e3f32ce935..0898a8db1747 100644
--- a/drivers/mfd/mfd-core.c
+++ b/drivers/mfd/mfd-core.c
@@ -269,6 +269,20 @@ static int mfd_add_device(struct device *parent, int id,
return ret;
 }
 
+/**
+ * mfd_add_devices - register a set of child devices
+ *
+ * @parent: Parent device for all sub-nodes.
+ * @id: Platform device id. If >= 0, each sub-device will have its cell_id
+ *  added to this number and use it as the platform device id.
+ * @cells: Array of mfd cells describing sub-devices.
+ * @n_devs: Number of sub-devices to register.
+ * @mem_base: Parent register range resource for sub-devices.
+ * @irq_base: Base of the range of virtual interrupt numbers allocated for
+ *this MFD device. Unused if @domain is specified.
+ * @domain: Interrupt domain used to create mappings for HW interrupt numbers
+ *  specificed in sub-devices' IRQ resources.
+ */
 int mfd_add_devices(struct device *parent, int id,
const struct mfd_cell *cells, int n_devs,
struct resource *mem_base,
-- 
2.20.1



[PATCH v7 06/11] mfd: max77650: new core mfd driver

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add the core mfd driver for max77650 PMIC. We define five sub-devices
for which the drivers will be added in subsequent patches.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/mfd/Kconfig  |  14 +++
 drivers/mfd/Makefile |   1 +
 drivers/mfd/max77650.c   | 234 +++
 include/linux/mfd/max77650.h |  59 +
 4 files changed, 308 insertions(+)
 create mode 100644 drivers/mfd/max77650.c
 create mode 100644 include/linux/mfd/max77650.h

diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig
index 0ce2d8dfc5f1..ade04e124aa0 100644
--- a/drivers/mfd/Kconfig
+++ b/drivers/mfd/Kconfig
@@ -733,6 +733,20 @@ config MFD_MAX77620
  provides common support for accessing the device; additional drivers
  must be enabled in order to use the functionality of the device.
 
+config MFD_MAX77650
+   tristate "Maxim MAX77650/77651 PMIC Support"
+   depends on I2C
+   depends on OF || COMPILE_TEST
+   select MFD_CORE
+   select REGMAP_I2C
+   help
+ Say Y here to add support for Maxim Semiconductor MAX77650 and
+ MAX77651 Power Management ICs. This is the core multifunction
+ driver for interacting with the device. The module name is
+ 'max77650'. Additional drivers can be enabled in order to use
+ the following functionalities of the device: GPIO, regulator,
+ charger, LED, onkey.
+
 config MFD_MAX77686
tristate "Maxim Semiconductor MAX77686/802 PMIC Support"
depends on I2C
diff --git a/drivers/mfd/Makefile b/drivers/mfd/Makefile
index b4569ed7f3f3..5727d099c16f 100644
--- a/drivers/mfd/Makefile
+++ b/drivers/mfd/Makefile
@@ -155,6 +155,7 @@ obj-$(CONFIG_MFD_DA9150)+= da9150-core.o
 
 obj-$(CONFIG_MFD_MAX14577) += max14577.o
 obj-$(CONFIG_MFD_MAX77620) += max77620.o
+obj-$(CONFIG_MFD_MAX77650) += max77650.o
 obj-$(CONFIG_MFD_MAX77686) += max77686.o
 obj-$(CONFIG_MFD_MAX77693) += max77693.o
 obj-$(CONFIG_MFD_MAX77843) += max77843.o
diff --git a/drivers/mfd/max77650.c b/drivers/mfd/max77650.c
new file mode 100644
index ..7a6c0a5cf602
--- /dev/null
+++ b/drivers/mfd/max77650.c
@@ -0,0 +1,234 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Copyright (C) 2018 BayLibre SAS
+// Author: Bartosz Golaszewski 
+//
+// Core MFD driver for MAXIM 77650/77651 charger/power-supply.
+// Programming manual: https://pdfserv.maximintegrated.com/en/an/AN6428.pdf
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAX77650_INT_GPI_F_MSK BIT(0)
+#define MAX77650_INT_GPI_R_MSK BIT(1)
+#define MAX77650_INT_GPI_MSK \
+   (MAX77650_INT_GPI_F_MSK | MAX77650_INT_GPI_R_MSK)
+#define MAX77650_INT_nEN_F_MSK BIT(2)
+#define MAX77650_INT_nEN_R_MSK BIT(3)
+#define MAX77650_INT_TJAL1_R_MSK   BIT(4)
+#define MAX77650_INT_TJAL2_R_MSK   BIT(5)
+#define MAX77650_INT_DOD_R_MSK BIT(6)
+
+#define MAX77650_INT_THM_MSK   BIT(0)
+#define MAX77650_INT_CHG_MSK   BIT(1)
+#define MAX77650_INT_CHGIN_MSK BIT(2)
+#define MAX77650_INT_TJ_REG_MSKBIT(3)
+#define MAX77650_INT_CHGIN_CTRL_MSKBIT(4)
+#define MAX77650_INT_SYS_CTRL_MSK  BIT(5)
+#define MAX77650_INT_SYS_CNFG_MSK  BIT(6)
+
+#define MAX77650_INT_GLBL_OFFSET   0
+#define MAX77650_INT_CHG_OFFSET1
+
+#define MAX77650_SBIA_LPM_MASK BIT(5)
+#define MAX77650_SBIA_LPM_DISABLED 0x00
+
+enum {
+   MAX77650_INT_GPI,
+   MAX77650_INT_nEN_F,
+   MAX77650_INT_nEN_R,
+   MAX77650_INT_TJAL1_R,
+   MAX77650_INT_TJAL2_R,
+   MAX77650_INT_DOD_R,
+   MAX77650_INT_THM,
+   MAX77650_INT_CHG,
+   MAX77650_INT_CHGIN,
+   MAX77650_INT_TJ_REG,
+   MAX77650_INT_CHGIN_CTRL,
+   MAX77650_INT_SYS_CTRL,
+   MAX77650_INT_SYS_CNFG,
+};
+
+static const struct resource max77650_charger_resources[] = {
+   DEFINE_RES_IRQ_NAMED(MAX77650_INT_CHG, "CHG"),
+   DEFINE_RES_IRQ_NAMED(MAX77650_INT_CHGIN, "CHGIN"),
+};
+
+static const struct resource max77650_gpio_resources[] = {
+   DEFINE_RES_IRQ_NAMED(MAX77650_INT_GPI, "GPI"),
+};
+
+static const struct resource max77650_onkey_resources[] = {
+   DEFINE_RES_IRQ_NAMED(MAX77650_INT_nEN_F, "nEN_F"),
+   DEFINE_RES_IRQ_NAMED(MAX77650_INT_nEN_R, "nEN_R"),
+};
+
+static const struct mfd_cell max77650_cells[] = {
+   {
+   .name   = "max77650-regulator",
+   .of_compatible  = "maxim,max77650-regulator",
+   },
+   {
+   .name   = "max77650-charger",
+   .of_compatible  = "maxim,max77650-charger",
+   .resources  = max77650_charger_resources,
+   .num_resources  = ARRAY_SIZE(max77650_charger_resources),
+   },
+   {
+   .name   = "max77650-gpio",
+   .of_compatible  = 

[PATCH v7 09/11] leds: max77650: add LEDs support

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

This adds basic support for LEDs for the max77650 PMIC. The device has
three current sinks for driving LEDs.

Signed-off-by: Bartosz Golaszewski 
Acked-by: Jacek Anaszewski 
---
 drivers/leds/Kconfig |   6 ++
 drivers/leds/Makefile|   1 +
 drivers/leds/leds-max77650.c | 147 +++
 3 files changed, 154 insertions(+)
 create mode 100644 drivers/leds/leds-max77650.c

diff --git a/drivers/leds/Kconfig b/drivers/leds/Kconfig
index a72f97fca57b..d8c70cc6a714 100644
--- a/drivers/leds/Kconfig
+++ b/drivers/leds/Kconfig
@@ -608,6 +608,12 @@ config LEDS_TLC591XX
  This option enables support for Texas Instruments TLC59108
  and TLC59116 LED controllers.
 
+config LEDS_MAX77650
+   tristate "LED support for Maxim MAX77650 PMIC"
+   depends on LEDS_CLASS && MFD_MAX77650
+   help
+ LEDs driver for MAX77650 family of PMICs from Maxim Integrated.
+
 config LEDS_MAX77693
tristate "LED support for MAX77693 Flash"
depends on LEDS_CLASS_FLASH
diff --git a/drivers/leds/Makefile b/drivers/leds/Makefile
index 4c1b0054f379..f48b2404dbb7 100644
--- a/drivers/leds/Makefile
+++ b/drivers/leds/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_LEDS_MC13783)+= leds-mc13783.o
 obj-$(CONFIG_LEDS_NS2) += leds-ns2.o
 obj-$(CONFIG_LEDS_NETXBIG) += leds-netxbig.o
 obj-$(CONFIG_LEDS_ASIC3)   += leds-asic3.o
+obj-$(CONFIG_LEDS_MAX77650)+= leds-max77650.o
 obj-$(CONFIG_LEDS_MAX77693)+= leds-max77693.o
 obj-$(CONFIG_LEDS_MAX8997) += leds-max8997.o
 obj-$(CONFIG_LEDS_LM355x)  += leds-lm355x.o
diff --git a/drivers/leds/leds-max77650.c b/drivers/leds/leds-max77650.c
new file mode 100644
index ..6b74ce9cac12
--- /dev/null
+++ b/drivers/leds/leds-max77650.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Copyright (C) 2018 BayLibre SAS
+// Author: Bartosz Golaszewski 
+//
+// LED driver for MAXIM 77650/77651 charger/power-supply.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAX77650_LED_NUM_LEDS  3
+
+#define MAX77650_LED_A_BASE0x40
+#define MAX77650_LED_B_BASE0x43
+
+#define MAX77650_LED_BR_MASK   GENMASK(4, 0)
+#define MAX77650_LED_EN_MASK   GENMASK(7, 6)
+
+#define MAX77650_LED_MAX_BRIGHTNESSMAX77650_LED_BR_MASK
+
+/* Enable EN_LED_MSTR. */
+#define MAX77650_LED_TOP_DEFAULT   BIT(0)
+
+#define MAX77650_LED_ENABLEGENMASK(7, 6)
+#define MAX77650_LED_DISABLE   0x00
+
+#define MAX77650_LED_A_DEFAULT MAX77650_LED_DISABLE
+/* 100% on duty */
+#define MAX77650_LED_B_DEFAULT GENMASK(3, 0)
+
+struct max77650_led {
+   struct led_classdev cdev;
+   struct regmap *map;
+   unsigned int regA;
+   unsigned int regB;
+};
+
+static struct max77650_led *max77650_to_led(struct led_classdev *cdev)
+{
+   return container_of(cdev, struct max77650_led, cdev);
+}
+
+static int max77650_led_brightness_set(struct led_classdev *cdev,
+  enum led_brightness brightness)
+{
+   struct max77650_led *led = max77650_to_led(cdev);
+   int val, mask;
+
+   mask = MAX77650_LED_BR_MASK | MAX77650_LED_EN_MASK;
+
+   if (brightness == LED_OFF)
+   val = MAX77650_LED_DISABLE;
+   else
+   val = MAX77650_LED_ENABLE | brightness;
+
+   return regmap_update_bits(led->map, led->regA, mask, val);
+}
+
+static int max77650_led_probe(struct platform_device *pdev)
+{
+   struct device_node *of_node, *child;
+   struct max77650_led *leds, *led;
+   struct device *parent;
+   struct device *dev;
+   struct regmap *map;
+   const char *label;
+   int rv, num_leds;
+   u32 reg;
+
+   dev = >dev;
+   parent = dev->parent;
+   of_node = dev->of_node;
+
+   if (!of_node)
+   return -ENODEV;
+
+   leds = devm_kcalloc(dev, sizeof(*leds),
+   MAX77650_LED_NUM_LEDS, GFP_KERNEL);
+   if (!leds)
+   return -ENOMEM;
+
+   map = dev_get_regmap(dev->parent, NULL);
+   if (!map)
+   return -ENODEV;
+
+   num_leds = of_get_child_count(of_node);
+   if (!num_leds || num_leds > MAX77650_LED_NUM_LEDS)
+   return -ENODEV;
+
+   for_each_child_of_node(of_node, child) {
+   rv = of_property_read_u32(child, "reg", );
+   if (rv || reg >= MAX77650_LED_NUM_LEDS)
+   return -EINVAL;
+
+   led = [reg];
+   led->map = map;
+   led->regA = MAX77650_LED_A_BASE + reg;
+   led->regB = MAX77650_LED_B_BASE + reg;
+   led->cdev.brightness_set_blocking = max77650_led_brightness_set;
+   led->cdev.max_brightness = MAX77650_LED_MAX_BRIGHTNESS;
+
+   label = of_get_property(child, "label", 

[PATCH v7 10/11] input: max77650: add onkey support

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add support for the push- and slide-button events for max77650.

Signed-off-by: Bartosz Golaszewski 
Acked-by: Dmitry Torokhov 
Acked-by: Pavel Machek 
---
 drivers/input/misc/Kconfig  |   9 +++
 drivers/input/misc/Makefile |   1 +
 drivers/input/misc/max77650-onkey.c | 121 
 3 files changed, 131 insertions(+)
 create mode 100644 drivers/input/misc/max77650-onkey.c

diff --git a/drivers/input/misc/Kconfig b/drivers/input/misc/Kconfig
index e15ed1bb8558..85bc675eecd3 100644
--- a/drivers/input/misc/Kconfig
+++ b/drivers/input/misc/Kconfig
@@ -190,6 +190,15 @@ config INPUT_M68K_BEEP
tristate "M68k Beeper support"
depends on M68K
 
+config INPUT_MAX77650_ONKEY
+   tristate "Maxim MAX77650 ONKEY support"
+   depends on MFD_MAX77650
+   help
+ Support the ONKEY of the MAX77650 PMIC as an input device.
+
+ To compile this driver as a module, choose M here: the module
+ will be called max77650-onkey.
+
 config INPUT_MAX77693_HAPTIC
tristate "MAXIM MAX77693/MAX77843 haptic controller support"
depends on (MFD_MAX77693 || MFD_MAX77843) && PWM
diff --git a/drivers/input/misc/Makefile b/drivers/input/misc/Makefile
index b936c5b1d4ac..ffd72161c79b 100644
--- a/drivers/input/misc/Makefile
+++ b/drivers/input/misc/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_INPUT_IXP4XX_BEEPER) += ixp4xx-beeper.o
 obj-$(CONFIG_INPUT_KEYSPAN_REMOTE) += keyspan_remote.o
 obj-$(CONFIG_INPUT_KXTJ9)  += kxtj9.o
 obj-$(CONFIG_INPUT_M68K_BEEP)  += m68kspkr.o
+obj-$(CONFIG_INPUT_MAX77650_ONKEY) += max77650-onkey.o
 obj-$(CONFIG_INPUT_MAX77693_HAPTIC)+= max77693-haptic.o
 obj-$(CONFIG_INPUT_MAX8925_ONKEY)  += max8925_onkey.o
 obj-$(CONFIG_INPUT_MAX8997_HAPTIC) += max8997_haptic.o
diff --git a/drivers/input/misc/max77650-onkey.c 
b/drivers/input/misc/max77650-onkey.c
new file mode 100644
index ..fbf6caab7217
--- /dev/null
+++ b/drivers/input/misc/max77650-onkey.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Copyright (C) 2018 BayLibre SAS
+// Author: Bartosz Golaszewski 
+//
+// ONKEY driver for MAXIM 77650/77651 charger/power-supply.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAX77650_ONKEY_MODE_MASK   BIT(3)
+#define MAX77650_ONKEY_MODE_PUSH   0x00
+#define MAX77650_ONKEY_MODE_SLIDE  BIT(3)
+
+struct max77650_onkey {
+   struct input_dev *input;
+   unsigned int code;
+};
+
+static irqreturn_t max77650_onkey_falling(int irq, void *data)
+{
+   struct max77650_onkey *onkey = data;
+
+   input_report_key(onkey->input, onkey->code, 0);
+   input_sync(onkey->input);
+
+   return IRQ_HANDLED;
+}
+
+static irqreturn_t max77650_onkey_rising(int irq, void *data)
+{
+   struct max77650_onkey *onkey = data;
+
+   input_report_key(onkey->input, onkey->code, 1);
+   input_sync(onkey->input);
+
+   return IRQ_HANDLED;
+}
+
+static int max77650_onkey_probe(struct platform_device *pdev)
+{
+   int irq_r, irq_f, error, mode;
+   struct max77650_onkey *onkey;
+   struct device *dev, *parent;
+   struct regmap *map;
+   unsigned int type;
+
+   dev = >dev;
+   parent = dev->parent;
+
+   map = dev_get_regmap(parent, NULL);
+   if (!map)
+   return -ENODEV;
+
+   onkey = devm_kzalloc(dev, sizeof(*onkey), GFP_KERNEL);
+   if (!onkey)
+   return -ENOMEM;
+
+   error = device_property_read_u32(dev, "linux,code", >code);
+   if (error)
+   onkey->code = KEY_POWER;
+
+   if (device_property_read_bool(dev, "maxim,onkey-slide")) {
+   mode = MAX77650_ONKEY_MODE_SLIDE;
+   type = EV_SW;
+   } else {
+   mode = MAX77650_ONKEY_MODE_PUSH;
+   type = EV_KEY;
+   }
+
+   error = regmap_update_bits(map, MAX77650_REG_CNFG_GLBL,
+  MAX77650_ONKEY_MODE_MASK, mode);
+   if (error)
+   return error;
+
+   irq_f = platform_get_irq_byname(pdev, "nEN_F");
+   if (irq_f < 0)
+   return irq_f;
+
+   irq_r = platform_get_irq_byname(pdev, "nEN_R");
+   if (irq_r < 0)
+   return irq_r;
+
+   onkey->input = devm_input_allocate_device(dev);
+   if (!onkey->input)
+   return -ENOMEM;
+
+   onkey->input->name = "max77650_onkey";
+   onkey->input->phys = "max77650_onkey/input0";
+   onkey->input->id.bustype = BUS_I2C;
+   input_set_capability(onkey->input, type, onkey->code);
+
+   error = devm_request_any_context_irq(dev, irq_f, max77650_onkey_falling,
+IRQF_ONESHOT, "onkey-down", onkey);
+   if (error < 0)
+   return error;
+
+   error = devm_request_any_context_irq(dev, irq_r, max77650_onkey_rising,
+

[PATCH v7 11/11] MAINTAINERS: add an entry for max77650 mfd driver

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

I plan on extending this set of drivers so add myself as maintainer.

Signed-off-by: Bartosz Golaszewski 
---
 MAINTAINERS | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3e5a5d263f29..b32fe859c341 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9407,6 +9407,20 @@ S:   Maintained
 F: Documentation/devicetree/bindings/sound/max9860.txt
 F: sound/soc/codecs/max9860.*
 
+MAXIM MAX77650 PMIC MFD DRIVER
+M: Bartosz Golaszewski 
+L: linux-kernel@vger.kernel.org
+S: Maintained
+F: Documentation/devicetree/bindings/*/*max77650.txt
+F: Documentation/devicetree/bindings/*/max77650*.txt
+F: include/linux/mfd/max77650.h
+F: drivers/mfd/max77650.c
+F: drivers/regulator/max77650-regulator.c
+F: drivers/power/supply/max77650-charger.c
+F: drivers/input/misc/max77650-onkey.c
+F: drivers/leds/leds-max77650.c
+F: drivers/gpio/gpio-max77650.c
+
 MAXIM MAX77802 PMIC REGULATOR DEVICE DRIVER
 M: Javier Martinez Canillas 
 L: linux-kernel@vger.kernel.org
-- 
2.20.1



[PATCH v7 08/11] gpio: max77650: add GPIO support

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add GPIO support for max77650 mfd device. This PMIC exposes a single
GPIO line.

Signed-off-by: Bartosz Golaszewski 
Reviewed-by: Linus Walleij 
---
 drivers/gpio/Kconfig |   7 ++
 drivers/gpio/Makefile|   1 +
 drivers/gpio/gpio-max77650.c | 190 +++
 3 files changed, 198 insertions(+)
 create mode 100644 drivers/gpio/gpio-max77650.c

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index 3f50526a771f..c4f912104440 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -1112,6 +1112,13 @@ config GPIO_MAX77620
  driver also provides interrupt support for each of the gpios.
  Say yes here to enable the max77620 to be used as gpio controller.
 
+config GPIO_MAX77650
+   tristate "Maxim MAX77650/77651 GPIO support"
+   depends on MFD_MAX77650
+   help
+ GPIO driver for MAX77650/77651 PMIC from Maxim Semiconductor.
+ These chips have a single pin that can be configured as GPIO.
+
 config GPIO_MSIC
bool "Intel MSIC mixed signal gpio support"
depends on (X86 || COMPILE_TEST) && MFD_INTEL_MSIC
diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
index 54d55274b93a..075722d8317d 100644
--- a/drivers/gpio/Makefile
+++ b/drivers/gpio/Makefile
@@ -80,6 +80,7 @@ obj-$(CONFIG_GPIO_MAX7300)+= gpio-max7300.o
 obj-$(CONFIG_GPIO_MAX7301) += gpio-max7301.o
 obj-$(CONFIG_GPIO_MAX732X) += gpio-max732x.o
 obj-$(CONFIG_GPIO_MAX77620)+= gpio-max77620.o
+obj-$(CONFIG_GPIO_MAX77650)+= gpio-max77650.o
 obj-$(CONFIG_GPIO_MB86S7X) += gpio-mb86s7x.o
 obj-$(CONFIG_GPIO_MENZ127) += gpio-menz127.o
 obj-$(CONFIG_GPIO_MERRIFIELD)  += gpio-merrifield.o
diff --git a/drivers/gpio/gpio-max77650.c b/drivers/gpio/gpio-max77650.c
new file mode 100644
index ..3f03f4e8956c
--- /dev/null
+++ b/drivers/gpio/gpio-max77650.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Copyright (C) 2018 BayLibre SAS
+// Author: Bartosz Golaszewski 
+//
+// GPIO driver for MAXIM 77650/77651 charger/power-supply.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAX77650_GPIO_DIR_MASK BIT(0)
+#define MAX77650_GPIO_INVAL_MASK   BIT(1)
+#define MAX77650_GPIO_DRV_MASK BIT(2)
+#define MAX77650_GPIO_OUTVAL_MASK  BIT(3)
+#define MAX77650_GPIO_DEBOUNCE_MASKBIT(4)
+
+#define MAX77650_GPIO_DIR_OUT  0x00
+#define MAX77650_GPIO_DIR_IN   BIT(0)
+#define MAX77650_GPIO_OUT_LOW  0x00
+#define MAX77650_GPIO_OUT_HIGH BIT(3)
+#define MAX77650_GPIO_DRV_OPEN_DRAIN   0x00
+#define MAX77650_GPIO_DRV_PUSH_PULLBIT(2)
+#define MAX77650_GPIO_DEBOUNCE BIT(4)
+
+#define MAX77650_GPIO_DIR_BITS(_reg) \
+   ((_reg) & MAX77650_GPIO_DIR_MASK)
+#define MAX77650_GPIO_INVAL_BITS(_reg) \
+   (((_reg) & MAX77650_GPIO_INVAL_MASK) >> 1)
+
+struct max77650_gpio_chip {
+   struct regmap *map;
+   struct gpio_chip gc;
+   int irq;
+};
+
+static int max77650_gpio_direction_input(struct gpio_chip *gc,
+unsigned int offset)
+{
+   struct max77650_gpio_chip *chip = gpiochip_get_data(gc);
+
+   return regmap_update_bits(chip->map,
+ MAX77650_REG_CNFG_GPIO,
+ MAX77650_GPIO_DIR_MASK,
+ MAX77650_GPIO_DIR_IN);
+}
+
+static int max77650_gpio_direction_output(struct gpio_chip *gc,
+ unsigned int offset, int value)
+{
+   struct max77650_gpio_chip *chip = gpiochip_get_data(gc);
+   int mask, regval;
+
+   mask = MAX77650_GPIO_DIR_MASK | MAX77650_GPIO_OUTVAL_MASK;
+   regval = value ? MAX77650_GPIO_OUT_HIGH : MAX77650_GPIO_OUT_LOW;
+   regval |= MAX77650_GPIO_DIR_OUT;
+
+   return regmap_update_bits(chip->map,
+ MAX77650_REG_CNFG_GPIO, mask, regval);
+}
+
+static void max77650_gpio_set_value(struct gpio_chip *gc,
+   unsigned int offset, int value)
+{
+   struct max77650_gpio_chip *chip = gpiochip_get_data(gc);
+   int rv, regval;
+
+   regval = value ? MAX77650_GPIO_OUT_HIGH : MAX77650_GPIO_OUT_LOW;
+
+   rv = regmap_update_bits(chip->map, MAX77650_REG_CNFG_GPIO,
+   MAX77650_GPIO_OUTVAL_MASK, regval);
+   if (rv)
+   dev_err(gc->parent, "cannot set GPIO value: %d\n", rv);
+}
+
+static int max77650_gpio_get_value(struct gpio_chip *gc,
+  unsigned int offset)
+{
+   struct max77650_gpio_chip *chip = gpiochip_get_data(gc);
+   unsigned int val;
+   int rv;
+
+   rv = regmap_read(chip->map, MAX77650_REG_CNFG_GPIO, );
+   if (rv)
+   return rv;
+
+   return MAX77650_GPIO_INVAL_BITS(val);
+}
+
+static int max77650_gpio_get_direction(struct gpio_chip *gc,
+

[PATCH v7 07/11] power: supply: max77650: add support for battery charger

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add basic support for the battery charger for max77650 PMIC.

Signed-off-by: Bartosz Golaszewski 
---
 drivers/power/supply/Kconfig|   7 +
 drivers/power/supply/Makefile   |   1 +
 drivers/power/supply/max77650-charger.c | 367 
 3 files changed, 375 insertions(+)
 create mode 100644 drivers/power/supply/max77650-charger.c

diff --git a/drivers/power/supply/Kconfig b/drivers/power/supply/Kconfig
index e901b9879e7e..0230c96fa94d 100644
--- a/drivers/power/supply/Kconfig
+++ b/drivers/power/supply/Kconfig
@@ -499,6 +499,13 @@ config CHARGER_DETECTOR_MAX14656
  Revision 1.2 and can be found e.g. in Kindle 4/5th generation
  readers and certain LG devices.
 
+config CHARGER_MAX77650
+   tristate "Maxim MAX77650 battery charger driver"
+   depends on MFD_MAX77650
+   help
+ Say Y to enable support for the battery charger control of MAX77650
+ PMICs.
+
 config CHARGER_MAX77693
tristate "Maxim MAX77693 battery charger driver"
depends on MFD_MAX77693
diff --git a/drivers/power/supply/Makefile b/drivers/power/supply/Makefile
index b731c2a9b695..b73eb8c5c1a9 100644
--- a/drivers/power/supply/Makefile
+++ b/drivers/power/supply/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_CHARGER_MANAGER) += charger-manager.o
 obj-$(CONFIG_CHARGER_LTC3651)  += ltc3651-charger.o
 obj-$(CONFIG_CHARGER_MAX14577) += max14577_charger.o
 obj-$(CONFIG_CHARGER_DETECTOR_MAX14656)+= max14656_charger_detector.o
+obj-$(CONFIG_CHARGER_MAX77650) += max77650-charger.o
 obj-$(CONFIG_CHARGER_MAX77693) += max77693_charger.o
 obj-$(CONFIG_CHARGER_MAX8997)  += max8997_charger.o
 obj-$(CONFIG_CHARGER_MAX8998)  += max8998_charger.o
diff --git a/drivers/power/supply/max77650-charger.c 
b/drivers/power/supply/max77650-charger.c
new file mode 100644
index ..e7cca32944bd
--- /dev/null
+++ b/drivers/power/supply/max77650-charger.c
@@ -0,0 +1,367 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// Copyright (C) 2018 BayLibre SAS
+// Author: Bartosz Golaszewski 
+//
+// Battery charger driver for MAXIM 77650/77651 charger/power-supply.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAX77650_CHARGER_ENABLED   BIT(0)
+#define MAX77650_CHARGER_DISABLED  0x00
+#define MAX77650_CHARGER_CHG_EN_MASK   BIT(0)
+
+#define MAX77650_CHG_DETAILS_MASK  GENMASK(7, 4)
+#define MAX77650_CHG_DETAILS_BITS(_reg) \
+   (((_reg) & MAX77650_CHG_DETAILS_MASK) >> 4)
+
+/* Charger is OFF. */
+#define MAX77650_CHG_OFF   0x00
+/* Charger is in prequalification mode. */
+#define MAX77650_CHG_PREQ  0x01
+/* Charger is in fast-charge constant current mode. */
+#define MAX77650_CHG_ON_CURR   0x02
+/* Charger is in JEITA modified fast-charge constant-current mode. */
+#define MAX77650_CHG_ON_CURR_JEITA 0x03
+/* Charger is in fast-charge constant-voltage mode. */
+#define MAX77650_CHG_ON_VOLT   0x04
+/* Charger is in JEITA modified fast-charge constant-voltage mode. */
+#define MAX77650_CHG_ON_VOLT_JEITA 0x05
+/* Charger is in top-off mode. */
+#define MAX77650_CHG_ON_TOPOFF 0x06
+/* Charger is in JEITA modified top-off mode. */
+#define MAX77650_CHG_ON_TOPOFF_JEITA   0x07
+/* Charger is done. */
+#define MAX77650_CHG_DONE  0x08
+/* Charger is JEITA modified done. */
+#define MAX77650_CHG_DONE_JEITA0x09
+/* Charger is suspended due to a prequalification timer fault. */
+#define MAX77650_CHG_SUSP_PREQ_TIM_FAULT   0x0a
+/* Charger is suspended due to a fast-charge timer fault. */
+#define MAX77650_CHG_SUSP_FAST_CHG_TIM_FAULT   0x0b
+/* Charger is suspended due to a battery temperature fault. */
+#define MAX77650_CHG_SUSP_BATT_TEMP_FAULT  0x0c
+
+#define MAX77650_CHGIN_DETAILS_MASKGENMASK(3, 2)
+#define MAX77650_CHGIN_DETAILS_BITS(_reg) \
+   (((_reg) & MAX77650_CHGIN_DETAILS_MASK) >> 2)
+
+#define MAX77650_CHGIN_UNDERVOLTAGE_LOCKOUT0x00
+#define MAX77650_CHGIN_OVERVOLTAGE_LOCKOUT 0x01
+#define MAX77650_CHGIN_OKAY0x11
+
+#define MAX77650_CHARGER_CHG_MASK  BIT(1)
+#define MAX77650_CHARGER_CHG_CHARGING(_reg) \
+   (((_reg) & MAX77650_CHARGER_CHG_MASK) > 1)
+
+#define MAX77650_CHARGER_VCHGIN_MIN_MASK   0xc0
+#define MAX77650_CHARGER_VCHGIN_MIN_SHIFT(_val)((_val) << 5)
+
+#define MAX77650_CHARGER_ICHGIN_LIM_MASK   0x1c
+#define MAX77650_CHARGER_ICHGIN_LIM_SHIFT(_val)((_val) << 2)
+
+struct max77650_charger_data {
+   struct regmap *map;
+   struct device *dev;
+};
+
+static enum power_supply_property max77650_charger_properties[] = {
+   POWER_SUPPLY_PROP_STATUS,
+   POWER_SUPPLY_PROP_ONLINE,
+   POWER_SUPPLY_PROP_CHARGE_TYPE
+};
+
+static const unsigned int 

Re: [PATCH net-next v5 05/22] ethtool: introduce ethtool netlink interface

2019-03-26 Thread Michal Kubecek
On Tue, Mar 26, 2019 at 05:36:40PM +0100, Jiri Pirko wrote:
> Mon, Mar 25, 2019 at 06:08:09PM CET, mkube...@suse.cz wrote:
> >+/* genetlink setup */
> >+
> >+static const struct genl_ops ethtool_genl_ops[] = {
> 
> Please be consistent with prefixes. Either use "ethtool_" or "ethnl_"
> for all functions and variables in this code.

OK

> >+/* module setup */
> >+
> >+static int __init ethnl_init(void)
> >+{
> >+int ret;
> >+
> >+ret = genl_register_family(_genl_family);
> >+if (WARN(ret < 0, "ethtool: genetlink family registration failed"))
> 
> Why do you need this warning? Please avoid it.

I'm confused now... few days ago you replied "+1" to the idea:

  http://lkml.kernel.org/r/20190321162105.GU2087@nanopsycho

I agreed that panic() (which is what e.g. rtnetlink does) would be an
overkill but I would be definitely opposed to not having anything in the
log at all and just silently going on without the interface (which may
result in misconfigured network). I believe that if this fails, it is
a sign of something going very wrong inside the kernel so that the "W"
taint flag would be appropriate.

Michal


[PATCH v7 00/11] mfd: add support for max77650 PMIC

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

This series adds support for max77650 ultra low-power PMIC. It provides
the core mfd driver and a set of five sub-drivers for the regulator,
power supply, gpio, leds and input subsystems.

Patches 1-4 add the DT binding documents. Patch 5 documents mfd_add_devices().
Patches 6-10 add all drivers. Last patch adds a MAINTAINERS entry for this
device.

The regulator part is already upstream.

v1 -> v2:
=

General:
- use C++ style comments for the SPDX license identifier and the
  copyright header
- s/MODULE_LICENSE("GPL")/MODULE_LICENSE("GPL v2")/
- lookup the virtual interrupt numbers in the MFD driver, setup
  resources for child devices and use platform_get_irq_byname()
  in sub-drivers
- picked up review tags
- use devm_request_any_context_irq() for interrupt requests

LEDs:
- changed the max77650_leds_ prefix to max77650_led_
- drop the max77650_leds structure as the only field it held was the
  regmap pointer, move said pointer to struct max77650_led
- change the driver name to "max77650-led"
- drop the last return value check and return the result of
  regmap_write() directly
- change the labeling scheme to one consistent with other LED drivers

ONKEY:
- drop the key reporting helper and call the input functions directly
  from interrupt handlers
- rename the rv local variable to error
- drop parent device asignment

Regulator:
- drop the unnecessary init_data lookup from the driver code
- drop unnecessary include

Charger:
- disable the charger on driver remove
- change the power supply type to POWER_SUPPLY_TYPE_USB

GPIO:
- drop interrupt support until we have correct implementation of hierarchical
  irqs in gpiolib

v2 -> v3:
=

General:
- dropped regulator patches as they're already in Mark Brown's branch

LED:
- fix the compatible string in the DT binding example
- use the max_brightness property
- use a common prefix ("MAX77650_LED") for all defines in the driver

MFD:
- add the MODULE_DEVICE_TABLE()
- add a sentinel to the of_device_id array
- constify the pointers to irq names
- use an enum instead of defines for interrupt indexes

v3 -> v4:
=

GPIO:
- as discussed with Linus Walleij: the gpio-controller is now part of
  the core mfd module (we don't spawn a sub-node anymore), the binding
  document for GPIO has been dropped, the GPIO properties have been
  defined in the binding document for the mfd core, the interrupt
  functionality has been reintroduced with the irq directly passed from
  the mfd part
- due to the above changes the Reviewed-by tag from Linus was dropped

v4 -> v5:
=

General:
- add a patch documenting mfd_add_devices()

MFD:
- pass the regmap irq_chip irq domain to mfd over mfd_add_devices so that
  the hw interrupts from resources can be correctly mapped to virtual irqs
- remove the enum listing cell indexes
- extend Kconfig help
- add a link to the programming manual
- use REGMAP_IRQ_REG() for regmap interrupts (except for GPI which has
  is composed of two hw interrupts for rising and falling edge)
- add error messages in probe
- use PLATFORM_DEVID_NONE constant in devm_mfd_add_devices()
- set irq_base to 0 in regmap_add_irq_chip() as other users to, it's only
  relevant if it's > 0

Charger:
- use non-maxim specific property names for minimum input voltage and current
  limit
- code shrink by using the enable/disable charger helpers everywhere
- use more descriptive names for constants

Onkey:
- use EV_SW event type for slide mode

LED:
- remove stray " from Kconfig help

v5 -> v6:
=

MFD:
- remove stray spaces in the binding document
- rename the example dt node
- remove unnecessary interrupt-parent property from the bindings

LED:
- add a missing dependency on LEDS_CLASS to Kconfig

Onkey:
- use boolean for the slide button property

Charger:
- fix the property names in DT example
- make constants even more readable

v6 -> v7:
=

Charger:
- rename the current limit property to current-limit-microamp

Bartosz Golaszewski (11):
  dt-bindings: mfd: add DT bindings for max77650
  dt-bindings: power: supply: add DT bindings for max77650
  dt-bindings: leds: add DT bindings for max77650
  dt-bindings: input: add DT bindings for max77650
  mfd: core: document mfd_add_devices()
  mfd: max77650: new core mfd driver
  power: supply: max77650: add support for battery charger
  gpio: max77650: add GPIO support
  leds: max77650: add LEDs support
  input: max77650: add onkey support
  MAINTAINERS: add an entry for max77650 mfd driver

 .../bindings/input/max77650-onkey.txt |  26 ++
 .../bindings/leds/leds-max77650.txt   |  57 +++
 .../devicetree/bindings/mfd/max77650.txt  |  46 +++
 .../power/supply/max77650-charger.txt |  27 ++
 MAINTAINERS   |  14 +
 drivers/gpio/Kconfig  |   7 +
 drivers/gpio/Makefile |   1 +
 drivers/gpio/gpio-max77650.c  | 190 +
 

[PATCH v7 01/11] dt-bindings: mfd: add DT bindings for max77650

2019-03-26 Thread Bartosz Golaszewski
From: Bartosz Golaszewski 

Add a DT binding document for max77650 ultra-low power PMIC. This
describes the core mfd device and the GPIO module.

Signed-off-by: Bartosz Golaszewski 
Reviewed-by: Rob Herring 
Acked-by: Pavel Machek 
---
 .../devicetree/bindings/mfd/max77650.txt  | 46 +++
 1 file changed, 46 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/mfd/max77650.txt

diff --git a/Documentation/devicetree/bindings/mfd/max77650.txt 
b/Documentation/devicetree/bindings/mfd/max77650.txt
new file mode 100644
index ..b529d8d19335
--- /dev/null
+++ b/Documentation/devicetree/bindings/mfd/max77650.txt
@@ -0,0 +1,46 @@
+MAX77650 ultra low-power PMIC from Maxim Integrated.
+
+Required properties:
+---
+- compatible:  Must be "maxim,max77650"
+- reg: I2C device address.
+- interrupts:  The interrupt on the parent the controller is
+   connected to.
+- interrupt-controller: Marks the device node as an interrupt controller.
+- #interrupt-cells:Must be <2>.
+
+- gpio-controller: Marks the device node as a gpio controller.
+- #gpio-cells: Must be <2>. The first cell is the pin number and
+   the second cell is used to specify the gpio active
+   state.
+
+Optional properties:
+
+gpio-line-names:   Single string containing the name of the GPIO line.
+
+The GPIO-controller module is represented as part of the top-level PMIC
+node. The device exposes a single GPIO line.
+
+For device-tree bindings of other sub-modules (regulator, power supply,
+LEDs and onkey) refer to the binding documents under the respective
+sub-system directories.
+
+For more details on GPIO bindings, please refer to the generic GPIO DT
+binding document .
+
+Example:
+
+
+   pmic@48 {
+   compatible = "maxim,max77650";
+   reg = <0x48>;
+
+   interrupt-controller;
+   interrupt-parent = <>;
+   #interrupt-cells = <2>;
+   interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
+
+   gpio-controller;
+   #gpio-cells = <2>;
+   gpio-line-names = "max77650-charger";
+   };
-- 
2.20.1



Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-03-26 Thread Joel Savitz
Ping!

Does anyone have any comments or concerns about this patch?

Best,
Joel Savitz

Best,
Joel Savitz


On Thu, Mar 7, 2019 at 9:42 AM Joel Savitz  wrote:
>
> On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz  wrote:
> >
> > If a process is limited by taskset (i.e. cpuset) to only be allowed to
> > run on cpu N, and then cpu N is offlined via hotplug, the process will
> > be assigned the current value of its cpuset cgroup's effective_cpus field
> > in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> > This argument's value does not makes sense for this case, because
> > task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> > to reflect the new value of cpu_active_mask after cpu N is removed from
> > the mask. While this may make sense for the cgroup affinity mask, it
> > does not make sense on a per-task basis, as a task that was previously
> > limited to only be run on cpu N will be limited to every cpu _except_ for
> > cpu N after it is offlined/onlined via hotplug.
> >
> > Pre-patch behavior:
> >
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   ff
> > Cpus_allowed_list:  0-7
> >
> > $ taskset -p 4 $$
> > pid 19202's current affinity mask: f
> > pid 19202's new affinity mask: 4
> >
> > $ grep Cpus /proc/self/status
> > Cpus_allowed:   04
> > Cpus_allowed_list:  2
> >
> > # echo off > /sys/devices/system/cpu/cpu2/online
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   0b
> > Cpus_allowed_list:  0-1,3
> >
> > # echo on > /sys/devices/system/cpu/cpu2/online
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   0b
> > Cpus_allowed_list:  0-1,3
> >
> > On a patched system, the final grep produces the following
> > output instead:
> >
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   ff
> > Cpus_allowed_list:  0-7
> >
> > This patch changes the above behavior by instead simply resetting the mask
> > to cpu_possible_mask.
> >
> > Signed-off-by: Joel Savitz 
> > ---
> >  kernel/cgroup/cpuset.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 479743db6c37..5f65a2167bdf 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> > struct cpumask *pmask)
> >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> >  {
> > rcu_read_lock();
> > -   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> > +   do_set_cpus_allowed(tsk, cpu_possible_mask);
> > rcu_read_unlock();
> >
> > /*
> > --
> > 2.20.1
> >


Re: [RESEND PATCH v1] moduleparam: Save information about built-in modules in separate file

2019-03-26 Thread Alexey Gladkov
On Fri, Mar 22, 2019 at 02:34:12PM +0900, Masahiro Yamada wrote:
> Hi.
> 
> (added some people to CC)
> 
> 
> On Fri, Mar 15, 2019 at 7:10 PM Alexey Gladkov  
> wrote:
> >
> > Problem:
> >
> > When a kernel module is compiled as a separate module, some important
> > information about the kernel module is available via .modinfo section of
> > the module.  In contrast, when the kernel module is compiled into the
> > kernel, that information is not available.
> 
> 
> I might be missing something, but
> vmlinux provides info of builtin modules
> in /sys/module/.

No. There are definitely not all modules. I have a builtin sha256_generic,
but I can't find him in the /sys/module.

> (Looks like currently only module_param and MODULE_VERSION)
> 
> This patch is not exactly the same, but I see a kind of overwrap.
> I'd like to be sure if we want this new scheme.

The /sys/module is only for running kernel. One of my use cases is
to create an initrd for a new kernel.

> 
> > Information about built-in modules is necessary in the following cases:
> >
> > 1. When it is necessary to find out what additional parameters can be
> > passed to the kernel at boot time.
> 
> 
> Actually, /sys/module//parameters/
> exposes this information.
> 
> Doesn't it work for your purpose?

No, since creating an initrd needs to know all the modalias before
I get the sysfs for new kernel. Also there are no modalias at all.

> > 2. When you need to know which module names and their aliases are in
> > the kernel. This is very useful for creating an initrd image.
> >
> > Proposal:
> >
> > The proposed patch does not remove .modinfo section with module
> > information from the vmlinux at the build time and saves it into a
> > separate file after kernel linking. So, the kernel does not increase in
> > size and no additional information remains in it. Information is stored
> > in the same format as in the separate modules (null-terminated string
> > array). Because the .modinfo section is already exported with a separate
> > modules, we are not creating a new API.
> >
> > It can be easily read in the userspace:
> >
> > $ tr '\0' '\n' < kernel.builtin.modinfo
> > ext4.softdep=pre: crc32c
> > ext4.license=GPL
> > ext4.description=Fourth Extended Filesystem
> > ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, 
> > Theodore Ts'o and others
> > ext4.alias=fs-ext4
> > ext4.alias=ext3
> > ext4.alias=fs-ext3
> > ext4.alias=ext2
> > ext4.alias=fs-ext2
> > md_mod.alias=block-major-9-*
> > md_mod.alias=md
> > md_mod.description=MD RAID framework
> > md_mod.license=GPL
> > md_mod.parmtype=create_on_open:bool
> > md_mod.parmtype=start_dirty_degraded:int
> > ...
> >
> > Co-Developed-by: Gleb Fotengauer-Malinovskiy 
> > Signed-off-by: Gleb Fotengauer-Malinovskiy 
> > Signed-off-by: Alexey Gladkov 
> > ---
> >  Makefile|  1 +
> >  include/linux/moduleparam.h | 12 +---
> >  scripts/link-vmlinux.sh |  8 
> >  3 files changed, 14 insertions(+), 7 deletions(-)
> >
> > diff --git a/Makefile b/Makefile
> > index d5713e7b1e50..971102194c92 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -1288,6 +1288,7 @@ _modinst_:
> > fi
> > @cp -f $(objtree)/modules.order $(MODLIB)/
> > @cp -f $(objtree)/modules.builtin $(MODLIB)/
> > +   @cp -f $(objtree)/kernel.builtin.modinfo $(MODLIB)/
> > $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modinst
> >
> >  # This depmod is only for convenience to give the initial
> > diff --git a/include/linux/moduleparam.h b/include/linux/moduleparam.h
> > index ba36506db4fb..5ba250d9172a 100644
> > --- a/include/linux/moduleparam.h
> > +++ b/include/linux/moduleparam.h
> > @@ -10,23 +10,21 @@
> > module name. */
> >  #ifdef MODULE
> >  #define MODULE_PARAM_PREFIX /* empty */
> > +#define __MODULE_INFO_PREFIX /* empty */
> >  #else
> >  #define MODULE_PARAM_PREFIX KBUILD_MODNAME "."
> > +/* We cannot use MODULE_PARAM_PREFIX because some modules override it. */
> > +#define __MODULE_INFO_PREFIX KBUILD_MODNAME "."
> >  #endif
> >
> >  /* Chosen so that structs with an unsigned long line up. */
> >  #define MAX_PARAM_PREFIX_LEN (64 - sizeof(unsigned long))
> >
> > -#ifdef MODULE
> >  #define __MODULE_INFO(tag, name, info)   \
> >  static const char __UNIQUE_ID(name)[]\
> >__used __attribute__((section(".modinfo"), unused, aligned(1)))\
> > -  = __stringify(tag) "=" info
> > -#else  /* !MODULE */
> > -/* This struct is here for syntactic coherency, it is not used */
> > -#define __MODULE_INFO(tag, name, info)   \
> > -  struct __UNIQUE_ID(name) {}
> > -#endif
> > +  = __MODULE_INFO_PREFIX __stringify(tag) "=" info
> > +
> >  #define __MODULE_PARM_TYPE(name, _type)
> >   \
> >__MODULE_INFO(parmtype, name##type, #name ":" _type)
> >
> > diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh

Re: [PATCH 08/17] x86/speculation: Fix __initconst in bugs.c

2019-03-26 Thread Thomas Gleixner
On Thu, 21 Mar 2019, Andi Kleen wrote:

Finally found a fix in the pile of unrelated crap.

Subject: x86/speculation: Fix __initconst in bugs.c

So how does this fix __initconst?

Again from Documentation:

  "The summary phrase in the email’s Subject should concisely describe the
  patch which that email contains. The summary phrase should not be a
  filename."

Also instead of having the file name in the subject line, which is
completely uninteresting the prefix should be 'x86/cpu/bugs:'

> Fix some of the recently added const tables to use __initconst
> for const data instead of __initdata which causes section attribute
> conflicts.
> 
> Signed-off-by: Andi Kleen 

This lacks a 'Fixes:' tag.

Thanks,

tglx

[PATCH v2 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2019-03-26 Thread Alexandru Gagniuc
This started as a nudge from Keith, who pointed out that it doesn't make sense
to disable AER services when only one device has a FIRMWARE_FIRST HEST.

I won't re-phrase the points in the original patch [1]. The patch started a
long discussion in the ACPI Software Working Group (ASWG). The nearly unanimous
conclusion is that my original interpretation is correct.

I'd like to quote one of the tables that was produced as part of that
conversation:

(_OSC AER Control, HEST AER Structure FFS) = (0, 0)
* OSPM is prevented from writing to the PCI Express AER registers.
* OSPM has no guidance on how AER errors are being handled – but it
  does know that it is not in control of AER registers. PCI-e errors
  that make it to the OS (via NMI, etc) would be treated as spurious
  since access to the AER registers isn’t allowed for proper sourcing.


(_OSC AER Control, HEST AER Structure FFS) = (0, 1)
* OSPM is prevented from writing to the PCI Express AER registers.
* OSPM is being given guidance that Firmware is handling AER errors and
  those interrupts are routed to the platform. Firmware may pass along
  error information via GHES


(_OSC AER Control, HEST AER Structure FFS) = (0, Does not exist)
* OSPM is prevented from writing to the PCI Express AER registers.
* OSPM has no guidance on how AER errors are being handled – but it
  does know that it is not in control of AER registers. PCI-e errors
  that make it to the OS (via NMI, etc) would be treated as spurious
  since access to the AER registers isn’t allowed for proper sourcing.

(_OSC AER Control, HEST AER Structure FFS) = (1, 0)
* OSPM is in control of writing to the PCI Express AER registers.
* OSPM is being given guidance that AER errors will interrupt the OS
  directly and that the OS is expected to handle all AER capability
  structure read/clears for the devices with this attribute (or all if
  the Global Bit is set.)

(_OSC AER Control, HEST AER Structure FFS) = (1, 1)
* OSPM is in control of writing to the PCI Express AER registers.
* OSPM is being given guidance that although OS is in control of AER
  read/writes – the actual interrupt is being routed to the platform
  first.
* Subsequent fields with masks/enables should be performed by the OS
  during initialization on behalf of firmware. These are to be honoured
  in this mode because with FF, the firmware needs to be able to handle
  the errors it expects and not be given errors it was not expecting to
  handle.
* Firmware may pass along error information via GHES, or generate an OS
  interrupt and allow the OS to interrogate AER status directly via the
  AER capability structures.


(_OSC AER Control, HEST AER Structure FFS) = (0, Does not exist)
* OSPM is in control of writing to the PCI Express AER registers.
* OSPM has no guidance from the platform and is in complete control of
  AER error handling.


There may be one caveat. Someone mentioned in the original discussions that
there may exist machines which make the assumption that HEST is authoritative,
but did not identify any such machine. We should keep in mind that they may
require a quirk.

Alex


[1] https://lkml.org/lkml/2018/11/16/202

Changes since v1:
 * Started 6-month conversation in ASWG
 * Re-phrased commit message to reflect some of the points in ASWG discussion

Alexandru Gagniuc (2):
  PCI/AER: Do not use APEI/HEST to disable AER services globally
  PCI/AER: Determine AER ownership based on _OSC instead of HEST

 drivers/acpi/pci_root.c  |  9 +
 drivers/pci/pcie/aer.c   | 82 ++--
 include/linux/pci-acpi.h |  6 ---
 3 files changed, 5 insertions(+), 92 deletions(-)

-- 
2.19.2



[PATCH] ASoC: core: Fix use-after-free after deferred card registration

2019-03-26 Thread Guenter Roeck
If snd_soc_register_card() fails because one of its links fails
to instantiate with -EPROBE_DEFER, and the to-be-registered link
is a legacy link, a subsequent retry will trigger a use-after-free
and quite often a system crash.

Example:

byt-max98090 byt-max98090: ASoC: failed to init link Baytrail Audio
byt-max98090 byt-max98090: snd_soc_register_card failed -517

BUG: KASAN: use-after-free in snd_soc_init_platform+0x233/0x312
Read of size 8 at addr 888067c43070 by task kworker/1:1/23

snd_soc_init_platform() allocates memory attached to the card device.
This memory is released when the card device is released. However,
the pointer to the memory (dai_link->platforms) is only cleared from
soc_cleanup_platform(), which is called from soc_cleanup_card_resources(),
but not if snd_soc_register_card() fails early.

Add the missing call to soc_cleanup_platform() in the error handling
code of snd_soc_register_card() to fix the problem.

Fixes: 78a24e10cd94 ("ASoC: soc-core: clear platform pointers on error")
Cc: Curtis Malainey 
Signed-off-by: Guenter Roeck 
---
 sound/soc/soc-core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 93d316d5bf8e..6bf9884d0863 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -2799,6 +2799,7 @@ int snd_soc_register_card(struct snd_soc_card *card)
if (ret) {
dev_err(card->dev, "ASoC: failed to init link %s\n",
link->name);
+   soc_cleanup_platform(card);
mutex_unlock(_mutex);
return ret;
}
-- 
2.7.4



A CALL FOR HELP

2019-03-26 Thread Mrs Ruhama Koenig



My Dearest,

Please forgive me for stressing you with my predicaments as I know that this 
letter may come to you as big surprise. Actually, I came across your E-mail 
from my personal search afterward I decided to email you directly believing 
that you will be honest to fulfil my final wish before or after my death. 
Meanwhile, I am Madam Ruhama Koenig, 71 years plus,i am from USA, childless. I 
am suffering from Adenocarcinoma Cancer of the lungs for the past 8 years and 
from all indication my condition is really deteriorating as my doctors have 
confirmed and courageously advised me that I may not live beyond 6 weeks from 
now for the reason that my tumor has reached a critical stage which has defiled 
all forms of medical treatment.

Since my days are numbered, I’ve decided willingly to fulfil my long-time vow 
to donate to the underprivileged the sum of Eleven Million Two Hundred Thousand 
Dollars($11.2 million) I deposited in a different account over 7 years now 
because I have tried to handle this project by myself but I have seen that my 
health could not allow me to do so anymore.  My promise for the poor includes 
building of well-equipped charity foundation hospital and a technical school.

If you will be honest, kind and willing to assist me handle this charity 
project as I’ve mentioned here, I will like you to provide me your personal 
data like,

(1) Your full name:
(2) country:
(3) Occupation:
(4) phone number:
(5) Age:

My Direct E-mail: rmadamkoe...@gmail.com

Warmest Regards!

Mrs. Ruhama Koenig


Re: [PATCH v6 3/3] arm64: dts: Add USB DT nodes for Stingray SoC

2019-03-26 Thread Florian Fainelli
On Tue, 19 Mar 2019 14:45:44 +0530, Srinath Mannam 
 wrote:
> Add DT nodes for
>   - Two xHCI host controllers
>   - Two BDC Broadcom USB device controller
>   - Five USB PHY controllers
> 
> [xHCI0]  [BDC0][xHCI1][BDC1]
>|   |  | |
>   ---   ---
>|   | | | |
> [SS-PHY0]   [HS-PHY0][SS-PHY1] [HS-PHY2] [HS-PHY1]
> 
> [SS-PHY0/HS-PHY0] and [SS-PHY1/HS-PHY1] are combo PHYs has one SS and
> one HS PHYs. [HS-PHY2] is a single HS PHY.
> 
> xHCI use SS-PHY to detect SS devices and HS-PHY to detect HS/FS/LS
> devices. BDC use SS-PHY in SS mode and HS-PHY in HS mode.
> 
> xHCI0 port1 is SS-PHY0, port2 is HS-PHY0.
> xHCI1 port1 is SS-PHY1, port2 is HS-PHY2 and port3 is HS-PHY1.
> 
> Signed-off-by: Srinath Mannam 
> ---

Applied to devicetree-arm64/next, thanks!
--
Florian


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> > 
> > "Each process have different pids, one for each pid namespace it belongs.
> >  When interaction happens within single pid-ns translation isn't required.
> >  More complicated scenarios needs special handling.
> > 
> >  For example:
> >  - reading pid-files or logs written inside container with pid namespace
> >  - writing logs with internal pids outside container for pushing them into
> >  - attaching with ptrace to tasks from different pid namespace
> > 
> >  Generally speaking, any cross pid-ns API with pids needs translation.
> > 
> >  Currently there are several interfaces that could be used here:
> > 
> >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > 
> >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> >  In some cases pid translation could be easily done using this information.
> >  Backward translation requires scanning all tasks and becomes really
> >  complicated for deeper namespace nesting.
> > 
> >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> >  into pid namespace, this expose process and could be insecure."
> > 
> > The original patchset allowed two distinct operations implicitly:
> > - discovering whether pid namespaces (pidns) have a parent-child
> >   relationship
> > - translating a pid from a source pidns into a target pidns
> > 
> > Both tasks are accomplished in the original patchset by passing a pid
> > along. If the pid argument is passed as 1 the relationship between two pid
> > namespaces can be discovered.
> > The syscall will gain a lot clearer syntax and will be easier to use for
> > userspace if the task it is asked to perform is passed through a
> > command argument. Additionally, it allows us to remove an intrinsic race
> > caused by using the pid argument as a way to discover the relationship
> > between pid namespaces.
> > This patch introduces three commands:
> > 
> > /* PIDCMD_QUERY_PID */
> > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > Given a source pid namespace fd return the pid of the process in the target
> > namespace:
> 
> Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> IMO (see below).

Yes, doesn't matter to me too much what we call it.

> 
> > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pid in callers pidns
> > 
> > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> >   - retrieve callers pidns
> >   - retrieve struct pid identifed by pid in callers pidns
> >   - retrieve pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by init task in pidns identified by 
> > source_fd
> >   - retrieve callers pidns
> >   - return pid of init task of pidns identified by source_fd in callers 
> > pidns
> > 
> > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve pidns identified by target_fd
> >   - check whether struct pid can be found in pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > /* PIDCMD_QUERY_PIDNS */
> > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > namespaces.
> > In the original version of the pachset passing pid as 1 would allow to
> > deterimine the relationship between the pid namespaces. This is inherhently
> > racy. If pid 1 inside a pid namespace has died it would report false
> > negatives. For example, if pid 1 inside of the target pid namespace already
> > died, it would report that the target pid namespace cannot be reached from
> > the source pid namespace because it couldn't find the pid inside of the
> > target pid namespace and thus falsely report to the user that the two pid
> > namespaces are not related. This problem is simple to avoid. In the new
> > version we simply walk the list of ancestors and check whether the
> > namespace are related to each other. By doing it this way we can reliably
> > report what the relationship between two pid namespace file descriptors
> > looks like.
> > 
> > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> >- pidns_of(ns_fd1) and pidns_of(ns_fd2) 

Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Joel Fernandes
On Tue, Mar 26, 2019 at 1:17 PM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 01:15:25PM -0400, Joel Fernandes wrote:
> > On Tue, Mar 26, 2019 at 06:08:28PM +0100, Christian Brauner wrote:
> > [snip]
> > > >
> > > > > +
> > > > > +   if (!result)
> > > > > +   result = -ENOENT;
> > > > > +
> > > > > +   put_pid(struct_pid);
> > > >
> > > > so on error you would put_pid twice which seems odd..  I would suggest, 
> > > > don't
> > > > release the pid ref from within pidfd_create_fd, release the ref from 
> > > > the
> > > > caller. Speaking of which, I added to my list to convert the pid->count 
> > > > to
> > > > refcount_t at some point :)
> > >
> > > as i said, pidfd_create_fd takes its own reference
> >
> > Oh. That was easy to miss. Fair enough. I take that comment back.
> >
> > Please also reply to the other comments I posted, thanks. Generally on LKML,
> > I have seen there is an expectation to reply to all reviewer's review
> > comments even if you agree with them. This helps keep the review going
> > smoothly. Just my 2 cents.
>
> I tend to do it in multiple mails depending on whether or not I need to
> think about a comment or not.

Ok, that's also fine with me. thanks,

 - Joel


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 01:15:25PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 06:08:28PM +0100, Christian Brauner wrote:
> [snip]
> > > > +   struct pid *struct_pid;
> > > > +   pid_t result;
> > > > +
> > > > +   if (flags)
> > > > +   return -EINVAL;
> > > > +
> > > > +   switch (cmd) {
> > > > +   case PIDCMD_QUERY_PID:
> > > > +   break;
> > > > +   case PIDCMD_QUERY_PIDNS:
> > > > +   if (pid)
> > > > +   return -EINVAL;
> > > > +   break;
> > > > +   case PIDCMD_GET_PIDFD:
> > > > +   break;
> > > > +   default:
> > > > +   return -EOPNOTSUPP;
> > > > +   }
> > > > +
> > > > +   source_ns = get_pid_ns_by_fd(source);
> > > > +   if (IS_ERR(source_ns))
> > > > +   return PTR_ERR(source_ns);
> > > > +
> > > > +   target_ns = get_pid_ns_by_fd(target);
> > > > +   if (IS_ERR(target_ns)) {
> > > > +   put_pid_ns(source_ns);
> > > > +   return PTR_ERR(target_ns);
> > > > +   }
> > > > +
> > > > +   if (cmd == PIDCMD_QUERY_PIDNS) {
> > > > +   result = pidns_related(source_ns, target_ns);
> > > > +   } else {
> > > > +   rcu_read_lock();
> > > > +   struct_pid = get_pid(find_pid_ns(pid, source_ns));
> > > > +   rcu_read_unlock();
> > > > +
> > > > +   if (struct_pid)
> > > > +   result = pid_nr_ns(struct_pid, target_ns);
> > > > +   else
> > > > +   result = -ESRCH;
> > > > +
> > > > +   if (cmd == PIDCMD_GET_PIDFD && (result > 0))
> > > > +   result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> > > 
> > > pidfd_create_fd already does put_pid on errors..
> > 
> > it also takes its own reference
> > 
> > > 
> > > > +
> > > > +   if (!result)
> > > > +   result = -ENOENT;
> > > > +
> > > > +   put_pid(struct_pid);
> > > 
> > > so on error you would put_pid twice which seems odd..  I would suggest, 
> > > don't
> > > release the pid ref from within pidfd_create_fd, release the ref from the
> > > caller. Speaking of which, I added to my list to convert the pid->count to
> > > refcount_t at some point :)
> > 
> > as i said, pidfd_create_fd takes its own reference
> 
> Oh. That was easy to miss. Fair enough. I take that comment back.
> 
> Please also reply to the other comments I posted, thanks. Generally on LKML,
> I have seen there is an expectation to reply to all reviewer's review
> comments even if you agree with them. This helps keep the review going
> smoothly. Just my 2 cents.

I tend to do it in multiple mails depending on whether or not I need to
think about a comment or not.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Joel Fernandes
On Tue, Mar 26, 2019 at 06:08:28PM +0100, Christian Brauner wrote:
[snip]
> > > + struct pid *struct_pid;
> > > + pid_t result;
> > > +
> > > + if (flags)
> > > + return -EINVAL;
> > > +
> > > + switch (cmd) {
> > > + case PIDCMD_QUERY_PID:
> > > + break;
> > > + case PIDCMD_QUERY_PIDNS:
> > > + if (pid)
> > > + return -EINVAL;
> > > + break;
> > > + case PIDCMD_GET_PIDFD:
> > > + break;
> > > + default:
> > > + return -EOPNOTSUPP;
> > > + }
> > > +
> > > + source_ns = get_pid_ns_by_fd(source);
> > > + if (IS_ERR(source_ns))
> > > + return PTR_ERR(source_ns);
> > > +
> > > + target_ns = get_pid_ns_by_fd(target);
> > > + if (IS_ERR(target_ns)) {
> > > + put_pid_ns(source_ns);
> > > + return PTR_ERR(target_ns);
> > > + }
> > > +
> > > + if (cmd == PIDCMD_QUERY_PIDNS) {
> > > + result = pidns_related(source_ns, target_ns);
> > > + } else {
> > > + rcu_read_lock();
> > > + struct_pid = get_pid(find_pid_ns(pid, source_ns));
> > > + rcu_read_unlock();
> > > +
> > > + if (struct_pid)
> > > + result = pid_nr_ns(struct_pid, target_ns);
> > > + else
> > > + result = -ESRCH;
> > > +
> > > + if (cmd == PIDCMD_GET_PIDFD && (result > 0))
> > > + result = pidfd_create_fd(struct_pid, O_CLOEXEC);
> > 
> > pidfd_create_fd already does put_pid on errors..
> 
> it also takes its own reference
> 
> > 
> > > +
> > > + if (!result)
> > > + result = -ENOENT;
> > > +
> > > + put_pid(struct_pid);
> > 
> > so on error you would put_pid twice which seems odd..  I would suggest, 
> > don't
> > release the pid ref from within pidfd_create_fd, release the ref from the
> > caller. Speaking of which, I added to my list to convert the pid->count to
> > refcount_t at some point :)
> 
> as i said, pidfd_create_fd takes its own reference

Oh. That was easy to miss. Fair enough. I take that comment back.

Please also reply to the other comments I posted, thanks. Generally on LKML,
I have seen there is an expectation to reply to all reviewer's review
comments even if you agree with them. This helps keep the review going
smoothly. Just my 2 cents.

thanks,

 - Joel



Re: [PATCH] arm64: dts: msm8998: Add UFS phy reset

2019-03-26 Thread Marc Gonzalez
On 26/03/2019 18:05, Evan Green wrote:

> With the new refactoring at [1], the UFS phy now controls its own
> destiny in toggling the phy reset bit within the UFS host controller.
> Add the DT pieces needed to 1) expose the reset controller from the
> HC, and 2) use it from the PHY. This series is based atop linux-next
> plus Marc's series at [2].
> 
> Signed-off-by: Evan Green 
> 
> [1] https://lore.kernel.org/lkml/20190321171800.104681-1-evgr...@chromium.org/
> [2] https://lore.kernel.org/lkml/43768d77-80b7-9cdc-b6e0-08ec4a026...@free.fr/
> 
> ---
> I haven't tested this. Marc, I'm hoping you'll test this out and hijack this
> patch if it needs any fixups.
> 
>  arch/arm64/boot/dts/qcom/msm8998.dtsi | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/msm8998.dtsi 
> b/arch/arm64/boot/dts/qcom/msm8998.dtsi
> index 3d0aeb3211de..d59a2c5fe83a 100644
> --- a/arch/arm64/boot/dts/qcom/msm8998.dtsi
> +++ b/arch/arm64/boot/dts/qcom/msm8998.dtsi
> @@ -990,6 +990,7 @@
>   interrupts = ;
>   phys = <_lanes>;
>   phy-names = "ufsphy";
> + #reset-cells = <1>;
>   lanes-per-direction = <2>;
>   power-domains = < UFS_GDSC>;
>  
> @@ -1039,6 +1040,7 @@
>   < GCC_UFS_CLKREF_CLK>,
>   < GCC_UFS_PHY_AUX_CLK>;
>  
> + resets = < 0>;
>   ufsphy_lanes: lanes@1da7400 {
>   reg = <0x01da7400 0x128>,
> <0x01da7600 0x1fc>,
> 

If it's OK with you, I plan to test this patch tomorrow, and simply squash it
into my UFS DT submission.

Regards.


Re: [PATCH v19,RESEND 24/27] x86/vdso: Add __vdso_sgx_enter_enclave() to wrap SGX enclave transitions

2019-03-26 Thread Andy Lutomirski
On Mon, Mar 25, 2019 at 9:53 PM Xing, Cedric  wrote:
>
> > On Mon, Mar 25, 2019 at 11:03 AM Sean Christopherson
> >  wrote:
> > >
> > > On Sun, Mar 24, 2019 at 01:59:48AM -0700, Xing, Cedric wrote:
> > > > As said in my previous email, this vDSO API isn't even compliant to
> > > > x86_64 ABI and is absolutely NOT for average developers. Instead,
> > > > host/enclave communications are expected to be handled by SDKs and
> > > > those developers will be very aware of the limitations of their
> > > > targeted environments, and will need the freedom to deploy optimal
> > solutions.
> >
> > > I fully realize that the above approach saddles Cedric and the SDK
> > > team with the extra task of justifying the need for two vDSO
> > > interfaces, and likely reduces the probability of their proposal being
> > > accepted.  But, we don't *force* the SDK to be rewritten, and we gain
> > > a vDSO interface that many people want and is acceptable to the
> > > maintainers (unless I've horribly misread Andy's position).
> >
> > I don't think you've horribly misread it.  I would like to keep the
> > stuff in the vDSO as minimal as possible.  If we need to add a fancier
> > interface down the line, then that's fine.
>
> Andy, I don't know "many people" is how many in Sean's email. I couldn't tell 
> you how long it took us to settle on the current SGX ISA because you would 
> just LAUGH! Why? Because it took insanely ridiculously long. Why that long? 
> Because the h/w and u-code teams would like to trim down the ISA as much as 
> possible. The fact is, whatever new is released, Intel will have to maintain 
> it on all future processors FOREVER! That means significant/on-going cost to 
> Intel. So any addition to ISA has to undergo extensive reviews that involve 
> all kinds of experts from both within Intel and externally, and would usually 
> take years, before you can see what you are seeing today. As I said in my 
> earlier emails, RBP is NOT needed for interrupt/exception handlers, then how 
> did RBP end up being restored at AEX? You can guess how many people were 
> standing behind it! Sean has no clue! I can assure you!
>
> Guess we've talked enough on the technical front. So let's talk about it on 
> the business front. Let me take a step back. Let's say you are right, all 
> enclaves would eventually be coded in the way you want. We (Intel SDK team) 
> were convinced to follow your approach. But there were existing enclaves and 
> a migration path would be needed. Would you like to support us? It'd be only 
> 9 LOC on your side but how much would incur on our side? If you believe you 
> are doing right thing, then acceptance is the next thing you should think of. 
> You should offer an easy path for those who did "wrong" to get on your 
> "right" boat. Don't you think so?
>

I suppose the real question is: are there a significant number of
users who will want to run enclaves created using an old SDK on Linux?
 And will there actually be support for doing this in the software
stack?

If the answer to both questions is yes, then it seems like it could be
reasonable to support it in the vDSO.  But I still think it should
probably be a different vDSO entry point so that the normal case
doesn't become more complicated.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 01:06:01PM -0400, Joel Fernandes wrote:
> On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> > 
> > "Each process have different pids, one for each pid namespace it belongs.
> >  When interaction happens within single pid-ns translation isn't required.
> >  More complicated scenarios needs special handling.
> > 
> >  For example:
> >  - reading pid-files or logs written inside container with pid namespace
> >  - writing logs with internal pids outside container for pushing them into
> >  - attaching with ptrace to tasks from different pid namespace
> > 
> >  Generally speaking, any cross pid-ns API with pids needs translation.
> > 
> >  Currently there are several interfaces that could be used here:
> > 
> >  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> > 
> >  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
> >  In some cases pid translation could be easily done using this information.
> >  Backward translation requires scanning all tasks and becomes really
> >  complicated for deeper namespace nesting.
> > 
> >  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
> >  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
> >  into pid namespace, this expose process and could be insecure."
> > 
> > The original patchset allowed two distinct operations implicitly:
> > - discovering whether pid namespaces (pidns) have a parent-child
> >   relationship
> > - translating a pid from a source pidns into a target pidns
> > 
> > Both tasks are accomplished in the original patchset by passing a pid
> > along. If the pid argument is passed as 1 the relationship between two pid
> > namespaces can be discovered.
> > The syscall will gain a lot clearer syntax and will be easier to use for
> > userspace if the task it is asked to perform is passed through a
> > command argument. Additionally, it allows us to remove an intrinsic race
> > caused by using the pid argument as a way to discover the relationship
> > between pid namespaces.
> > This patch introduces three commands:
> > 
> > /* PIDCMD_QUERY_PID */
> > PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> > Given a source pid namespace fd return the pid of the process in the target
> > namespace:
> 
> Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
> IMO (see below).
> 
> > 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve callers pidns
> >   - return pid in callers pidns
> > 
> > 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
> >   - retrieve callers pidns
> >   - retrieve struct pid identifed by pid in callers pidns
> >   - retrieve pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by init task in pidns identified by 
> > source_fd
> >   - retrieve callers pidns
> >   - return pid of init task of pidns identified by source_fd in callers 
> > pidns
> > 
> > 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
> >   - retrieve pidns identified by source_fd
> >   - retrieve struct pid identifed by pid in pidns identified by source_fd
> >   - retrieve pidns identified by target_fd
> >   - check whether struct pid can be found in pidns identified by target_fd
> >   - return pid in pidns identified by target_fd
> > 
> > /* PIDCMD_QUERY_PIDNS */
> > PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> > namespaces.
> > In the original version of the pachset passing pid as 1 would allow to
> > deterimine the relationship between the pid namespaces. This is inherhently
> > racy. If pid 1 inside a pid namespace has died it would report false
> > negatives. For example, if pid 1 inside of the target pid namespace already
> > died, it would report that the target pid namespace cannot be reached from
> > the source pid namespace because it couldn't find the pid inside of the
> > target pid namespace and thus falsely report to the user that the two pid
> > namespaces are not related. This problem is simple to avoid. In the new
> > version we simply walk the list of ancestors and check whether the
> > namespace are related to each other. By doing it this way we can reliably
> > report what the relationship between two pid namespace file descriptors
> > looks like.
> > 
> > 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
> >- pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> > 
> > 2. 

Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 09:50:28AM -0700, Daniel Colascione wrote:
> On Tue, Mar 26, 2019 at 9:44 AM Christian Brauner  
> wrote:
> >
> > On Tue, Mar 26, 2019 at 09:38:31AM -0700, Daniel Colascione wrote:
> > > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  
> > > wrote:
> > > >
> > > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > > Thanks for the patch.
> > > > > > >
> > > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > > > translate_pid() [4].
> > > > > > > > I quote Konstantins original patchset first that has already 
> > > > > > > > been acked and
> > > > > > > > picked up by Eric before and whose functionality is preserved 
> > > > > > > > in this
> > > > > > > > syscall:
> > > > > > >
> > > > > > > We still haven't had a much-needed conversation about splitting 
> > > > > > > this
> > > > > > > system call into smaller logical operations. It's important that 
> > > > > > > we
> > > > > > > address this point before this patch is merged and becomes 
> > > > > > > permanent
> > > > > > > kernel ABI.
> > > > > >
> > > > > > I don't particularly mind splitting this into an additional syscall 
> > > > > > like
> > > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > > about this rn is that it connects both apis in a single syscall
> > > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll 
> > > > > > see
> > > > > > what other people think.
> > > > >
> > > > > There's something to be said for
> > > > >
> > > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > > >
> > > > > /* get pidfd */
> > > > > int pidfd = pidfd_open(1234, -1, 0);
> > > > >
> > > > > /* convert to procfd */
> > > > > int procfd = pidfd_open(-1, 4, 0);
> > > > >
> > > > > /* convert to pidfd */
> > > > > int pidfd = pidfd_open(4, -1, 0);
> > > >
> > > > probably rather:
> > > >
> > > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> > > > int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > >
> > > These three operations look like three related but distinct functions
> > > to me, and in the second case, the "pidfd_open" name is a bit of a
> > > misnomer. IMHO, the presence of an "operation name" field in any API
> > > is usually a good indication that we're looking at a family of related
> > > APIs, not a single coherent operation.
> >
> > So I'm happy to accommodate the need for a clean api even though I
> > disagree that what we have in pidctl() is unclean.
> > But I will not start sending a pile of syscalls. There is nothing
> > necessarily wrong to group related APIs together.
> 
> In the email I sent just now, I identified several specific technical
> disadvantages arising from unnecessary grouping of system calls. We
> have historical evidence in the form of socketcall that this grouping
> tends to be regrettable. I don't recall your identifying any
> offsetting technical advantages. Did I miss something?
> 
> > By these standards the
> > new mount API would need to be like 30 different syscalls, same for
> > keyring management.
> 
> Can you please point out the problem that would arise from splitting
> the mount and keyring APIs this way? One could have made the same
> argument about grouping socket operations, and this socket-operation
> grouping ended up being a mistake.

The main reasons why I am not responding to such mails is that I don't
want long tangents about very generic issues. If you can find support
from people that prefer to split this into three separate syscalls:

pidfd_open()
pidfd_procfd()
procfd_pidfd()

I'm happy to do it this way. But it seems we can find a compromise, e.g.
by having

pidfd_open(pid_t pid, int fd, int fd, unsigned int flags)

and avoid that whole email waterfall.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Joel Fernandes
On Tue, Mar 26, 2019 at 04:55:11PM +0100, Christian Brauner wrote:
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall:
> 
> "Each process have different pids, one for each pid namespace it belongs.
>  When interaction happens within single pid-ns translation isn't required.
>  More complicated scenarios needs special handling.
> 
>  For example:
>  - reading pid-files or logs written inside container with pid namespace
>  - writing logs with internal pids outside container for pushing them into
>  - attaching with ptrace to tasks from different pid namespace
> 
>  Generally speaking, any cross pid-ns API with pids needs translation.
> 
>  Currently there are several interfaces that could be used here:
> 
>  Pid namespaces are identified by device and inode of /proc/[pid]/ns/pid.
> 
>  Pids for nested pid namespaces are shown in file /proc/[pid]/status.
>  In some cases pid translation could be easily done using this information.
>  Backward translation requires scanning all tasks and becomes really
>  complicated for deeper namespace nesting.
> 
>  Unix socket automatically translates pid attached to SCM_CREDENTIALS.
>  This requires CAP_SYS_ADMIN for sending arbitrary pids and entering
>  into pid namespace, this expose process and could be insecure."
> 
> The original patchset allowed two distinct operations implicitly:
> - discovering whether pid namespaces (pidns) have a parent-child
>   relationship
> - translating a pid from a source pidns into a target pidns
> 
> Both tasks are accomplished in the original patchset by passing a pid
> along. If the pid argument is passed as 1 the relationship between two pid
> namespaces can be discovered.
> The syscall will gain a lot clearer syntax and will be easier to use for
> userspace if the task it is asked to perform is passed through a
> command argument. Additionally, it allows us to remove an intrinsic race
> caused by using the pid argument as a way to discover the relationship
> between pid namespaces.
> This patch introduces three commands:
> 
> /* PIDCMD_QUERY_PID */
> PIDCMD_QUERY_PID allows to translate a pid between pid namespaces.
> Given a source pid namespace fd return the pid of the process in the target
> namespace:

Could we call this PIDCMD_TRANSLATE_PID please? QUERY is confusing/ambigious
IMO (see below).

> 1. pidctl(PIDCMD_QUERY_PID, pid, source_fd, -1, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by pid in pidns identified by source_fd
>   - retrieve callers pidns
>   - return pid in callers pidns
> 
> 2. pidctl(PIDCMD_QUERY_PID, pid, -1, target_fd, 0)
>   - retrieve callers pidns
>   - retrieve struct pid identifed by pid in callers pidns
>   - retrieve pidns identified by target_fd
>   - return pid in pidns identified by target_fd
> 
> 3. pidctl(PIDCMD_QUERY_PID, 1, source_fd, -1, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by init task in pidns identified by 
> source_fd
>   - retrieve callers pidns
>   - return pid of init task of pidns identified by source_fd in callers pidns
> 
> 4. pidctl(PIDCMD_QUERY_PID, pid, source_fd, target_fd, 0)
>   - retrieve pidns identified by source_fd
>   - retrieve struct pid identifed by pid in pidns identified by source_fd
>   - retrieve pidns identified by target_fd
>   - check whether struct pid can be found in pidns identified by target_fd
>   - return pid in pidns identified by target_fd
> 
> /* PIDCMD_QUERY_PIDNS */
> PIDCMD_QUERY_PIDNS allows to determine the relationship between pid
> namespaces.
> In the original version of the pachset passing pid as 1 would allow to
> deterimine the relationship between the pid namespaces. This is inherhently
> racy. If pid 1 inside a pid namespace has died it would report false
> negatives. For example, if pid 1 inside of the target pid namespace already
> died, it would report that the target pid namespace cannot be reached from
> the source pid namespace because it couldn't find the pid inside of the
> target pid namespace and thus falsely report to the user that the two pid
> namespaces are not related. This problem is simple to avoid. In the new
> version we simply walk the list of ancestors and check whether the
> namespace are related to each other. By doing it this way we can reliably
> report what the relationship between two pid namespace file descriptors
> looks like.
> 
> 1. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd1, 0) == 0
>- pidns_of(ns_fd1) and pidns_of(ns_fd2) are unrelated to each other
> 
> 2. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 1
>- pidns_of(ns_fd1) == pidns_of(ns_fd2)
> 
> 3. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 2
>- pidns_of(ns_fd1) is ancestor of pidns_of(ns_fd2)
> 
> 4. pidctl(PIDCMD_QUERY_PIDNS, 0, ns_fd1, ns_fd2, 0) == 3
>  

Re: [PATCH 3/4] scripts/gdb: Add rb tree iterating utilities

2019-03-26 Thread Stephen Boyd
Quoting Kieran Bingham (2019-03-26 01:52:10)
> Hi Stephen,
> 
> On 25/03/2019 18:45, Stephen Boyd wrote:
> > Implement gdb functions for rb_first(), rb_last(), rb_next(), and
> > rb_prev(). These can be useful to iterate through the kernel's red-black
> > trees.
> 
> I definitely approve of getting data-structure helpers into scripts/gdb,
> as it will greatly assist debug options but my last attempt to do this
> was with the radix-tree which I had to give up on as the internals were
> changing rapidly and caused continuous breakage to the helpers.

Thanks for the background on radix-tree. I haven't looked at that yet,
but I suppose I'll want to have that too at some point.

> 
> Do you foresee any similar issue here? Or is the corresponding RB code
> in the kernel fairly 'stable'?
> 
> 
> Please could we make sure whomever maintains the RBTree code is aware of
> the python implementation?
> 
> That said, MAINTAINERS doesn't actually seem to list any ownership over
> the rb-tree code, and get_maintainers.pl [0] seems to be pointing at
> Andrew as the probable route in for that code so perhaps that's already
> in place :D

I don't think that the rb tree implementation is going to change. It
feels similar to the list API. I suppose this problem of keeping things
in sync is a more general problem than just data-structures changing.
The only solution I can offer is to have more testing and usage of these
scripts. Unless gdb can "simulate" or run arbitrary code for us then I
think we're stuck reimplementing kernel internal code in gdb scripts so
that we can get debug info out.



[PATCH] arm64: dts: msm8998: Add UFS phy reset

2019-03-26 Thread Evan Green
With the new refactoring at [1], the UFS phy now controls its own
destiny in toggling the phy reset bit within the UFS host controller.
Add the DT pieces needed to 1) expose the reset controller from the
HC, and 2) use it from the PHY. This series is based atop linux-next
plus Marc's series at [2].

Signed-off-by: Evan Green 

[1] https://lore.kernel.org/lkml/20190321171800.104681-1-evgr...@chromium.org/
[2] https://lore.kernel.org/lkml/43768d77-80b7-9cdc-b6e0-08ec4a026...@free.fr/

---
I haven't tested this. Marc, I'm hoping you'll test this out and hijack this
patch if it needs any fixups.

 arch/arm64/boot/dts/qcom/msm8998.dtsi | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/msm8998.dtsi 
b/arch/arm64/boot/dts/qcom/msm8998.dtsi
index 3d0aeb3211de..d59a2c5fe83a 100644
--- a/arch/arm64/boot/dts/qcom/msm8998.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8998.dtsi
@@ -990,6 +990,7 @@
interrupts = ;
phys = <_lanes>;
phy-names = "ufsphy";
+   #reset-cells = <1>;
lanes-per-direction = <2>;
power-domains = < UFS_GDSC>;
 
@@ -1039,6 +1040,7 @@
< GCC_UFS_CLKREF_CLK>,
< GCC_UFS_PHY_AUX_CLK>;
 
+   resets = < 0>;
ufsphy_lanes: lanes@1da7400 {
reg = <0x01da7400 0x128>,
  <0x01da7600 0x1fc>,
-- 
2.20.1



Re: [PATCH 02/17] x86, lto: Mark all top level asm statements as .text

2019-03-26 Thread Thomas Gleixner
Andi,

On Thu, 21 Mar 2019, Andi Kleen wrote:

> With gcc 8 toplevel assembler statements that do not mark themselves
> as .text may end up in other sections.

Which is clearly a change in behaviour. Is that intended or just yet
another feature of GCC?

Your subject says: 'x86, lto:'

So is this a LTO related problem or is the section randomization
independent of LTO?

This wants to be clearly documented in the changelog.

Aside of that the proper Subject prefix is either:

x86/asm/lto:

or

x86/asm:

dependent on the nature. Like it or not, but this has been the prefix x86
uses for a very long time already.

> I had boot crashes because
> various assembler statements ended up in the middle of the initcall
> section.
> 
> Always mark all the top level assembler statements as text
> so that they switch to the right section.
> 
> For AMD "vide", which is only used on 32bit kernels, I also
> marked it as 32bit only.

Once more. See

  
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes

  "Describe your changes in imperative mood, e.g. “make xyzzy do frotz”
  instead of “[This patch] makes xyzzy do frotz” or “[I] changed xyzzy to
  do frotz”, as if you are giving orders to the codebase to change its
  behaviour."

This is the last time, I'm asking for this.
 
Thanks,

tglx

[PATCH fixes] MIPS: perf: Fix build with CONFIG_CPU_BMIPS5000 enabled

2019-03-26 Thread Florian Fainelli
The 'event' variable may be unused in case only CONFIG_CPU_BMIPS5000
being enabled:

arch/mips/kernel/perf_event_mipsxx.c: In function 'mipsxx_pmu_enable_event':
arch/mips/kernel/perf_event_mipsxx.c:326:21: error: unused variable 'event' 
[-Werror=unused-variable]
  struct perf_event *event = container_of(evt, struct perf_event, hw);
 ^

Fixes: 84002c88599d ("MIPS: perf: Fix perf with MT counting other threads")
Signed-off-by: Florian Fainelli 
---
 arch/mips/kernel/perf_event_mipsxx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/mips/kernel/perf_event_mipsxx.c 
b/arch/mips/kernel/perf_event_mipsxx.c
index 413863508f6f..739b7ff9fdab 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -323,7 +323,9 @@ static int mipsxx_pmu_alloc_counter(struct cpu_hw_events 
*cpuc,
 
 static void mipsxx_pmu_enable_event(struct hw_perf_event *evt, int idx)
 {
+#ifndef CONFIG_CPU_BMIPS5000
struct perf_event *event = container_of(evt, struct perf_event, hw);
+#endif
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
 #ifdef CONFIG_MIPS_MT_SMP
unsigned int range = evt->event_base >> 24;
-- 
2.17.1



RE: [PATCH v7 2/4] perf/smmuv3: Add arm64 smmuv3 pmu driver

2019-03-26 Thread Shameerali Kolothum Thodi
Hi Robin,

> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: 26 March 2019 16:58
> To: Shameerali Kolothum Thodi ;
> lorenzo.pieral...@arm.com
> Cc: andrew.mur...@arm.com; jean-philippe.bruc...@arm.com;
> will.dea...@arm.com; mark.rutl...@arm.com; Guohanjun (Hanjun Guo)
> ; John Garry ;
> pa...@codeaurora.org; vkil...@codeaurora.org; rruig...@codeaurora.org;
> linux-a...@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-arm-ker...@lists.infradead.org; Linuxarm ;
> neil.m.lee...@gmail.com
> Subject: Re: [PATCH v7 2/4] perf/smmuv3: Add arm64 smmuv3 pmu driver
> 
> Hi Shameer,
> 
> On 26/03/2019 15:17, Shameer Kolothum wrote:
> [...]
> > +static int smmu_pmu_apply_event_filter(struct smmu_pmu *smmu_pmu,
> > +  struct perf_event *event, int idx)
> > +{
> > +   u32 span, sid;
> > +   unsigned int num_ctrs = smmu_pmu->num_counters;
> > +   bool filter_en = !!get_filter_enable(event);
> > +
> > +   span = filter_en ? get_filter_span(event) :
> > +  SMMU_PMCG_DEFAULT_FILTER_SPAN;
> > +   sid = filter_en ? get_filter_stream_id(event) :
> > +  SMMU_PMCG_DEFAULT_FILTER_SID;
> > +
> > +   /* Support individual filter settings */
> > +   if (!smmu_pmu->global_filter) {
> > +   smmu_pmu_set_event_filter(event, idx, span, sid);
> > +   return 0;
> > +   }
> > +
> > +   /* Requested settings same as current global settings*/
> > +   if (span == smmu_pmu->global_filter_span &&
> > +   sid == smmu_pmu->global_filter_sid)
> > +   return 0;
> > +
> > +   if (!bitmap_empty(smmu_pmu->used_counters, num_ctrs))
> > +   return -EAGAIN;
> > +
> > +   if (idx == 0) {
> > +   smmu_pmu_set_event_filter(event, idx, span, sid);
> > +   smmu_pmu->global_filter_span = span;
> > +   smmu_pmu->global_filter_sid = sid;
> > +   return 0;
> > +   }
> 
> When I suggested dropping the check of idx, I did mean removing it
> entirely, not just moving it further down ;)

Ah..I must confess that I was slightly confused by that suggestion and 
thought that you are making a case for code being more clear to read :)
 
> Nothing to worry about though, I'll just leave this here for Will to
> consider applying on top or squashing.

Thanks for that.

Cheers,
Shameer

> Thanks,
> Robin.
> 
> ->8-
> From: Robin Murphy 
> Subject: [PATCH] perf/smmuv3: Relax global filter constraint a little
> 
> Although the current behaviour of smmu_pmu_get_event_idx() effectively
> ensures that the first-allocated counter will be counter 0, there's no
> need to strictly enforce that in smmu_pmu_apply_event_filter(). All that
> matters is that we only ever touch the global filter settings in
> SMMU_PMCG_SMR0 and SMMU_PMCG_EVTYPER0 while no counters are
> active.
> 
> Signed-off-by: Robin Murphy 
> ---
>   drivers/perf/arm_smmuv3_pmu.c | 11 ---
>   1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/perf/arm_smmuv3_pmu.c
> b/drivers/perf/arm_smmuv3_pmu.c
> index 6b3c0ed7ad71..23045ead6de1 100644
> --- a/drivers/perf/arm_smmuv3_pmu.c
> +++ b/drivers/perf/arm_smmuv3_pmu.c
> @@ -286,14 +286,11 @@ static int smmu_pmu_apply_event_filter(struct
> smmu_pmu *smmu_pmu,
>   if (!bitmap_empty(smmu_pmu->used_counters, num_ctrs))
>   return -EAGAIN;
> 
> - if (idx == 0) {
> - smmu_pmu_set_event_filter(event, idx, span, sid);
> - smmu_pmu->global_filter_span = span;
> - smmu_pmu->global_filter_sid = sid;
> - return 0;
> - }
> + smmu_pmu_set_event_filter(event, 0, span, sid);
> + smmu_pmu->global_filter_span = span;
> + smmu_pmu->global_filter_sid = sid;
> 
> - return -EAGAIN;
> + return 0;
>   }
> 
>   static int smmu_pmu_get_event_idx(struct smmu_pmu *smmu_pmu,
> --
> 2.20.1.dirty


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:46 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 09:42:59AM -0700, Andy Lutomirski wrote:
> > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  
> > wrote:
> > >
> > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > Thanks for the patch.
> > > > > >
> > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > > > >  wrote:
> > > > > > >
> > > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > > translate_pid() [4].
> > > > > > > I quote Konstantins original patchset first that has already been 
> > > > > > > acked and
> > > > > > > picked up by Eric before and whose functionality is preserved in 
> > > > > > > this
> > > > > > > syscall:
> > > > > >
> > > > > > We still haven't had a much-needed conversation about splitting this
> > > > > > system call into smaller logical operations. It's important that we
> > > > > > address this point before this patch is merged and becomes permanent
> > > > > > kernel ABI.
> > > > >
> > > > > I don't particularly mind splitting this into an additional syscall 
> > > > > like
> > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > about this rn is that it connects both apis in a single syscall
> > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > > what other people think.
> > > >
> > > > There's something to be said for
> > > >
> > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > >
> > > > /* get pidfd */
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > > >
> > > > /* convert to procfd */
> > > > int procfd = pidfd_open(-1, 4, 0);
> > > >
> > > > /* convert to pidfd */
> > > > int pidfd = pidfd_open(4, -1, 0);
> > >
> > > probably rather:
> > >
> > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> >
> > Do you mean:
> >
> > int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> > int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD);
> >
> > or do you have some other solution in mind to avoid the security problem?
>
> Yes, we need the proc root obviously. I just jotted this down.
>
> We probably would need where one of the fds can refer to the proc root.
>
> pidfd_open(pid_t, int fd, int fd, 0)

Indeed. This is precisely the pidfd-procfd translation API I proposed
in the last paragraph of [1].

[1] 
https://lore.kernel.org/lkml/cakozuetcfgu0b53+mgmq3+539mpt_tiu-pacx2atvihhrrm...@mail.gmail.com/


Re: Fixes and cleanup from LTO tree

2019-03-26 Thread Thomas Gleixner
Andi,

On Thu, 21 Mar 2019, Andi Kleen wrote:

> Here are a range of bug fixes and cleanups that have accumulated in my
> gcc Link Time Optimization (LTO) branches; for issues found
> by the compiler when doing global optimization and a few
> other issues.
> 
> (https://github.com/andikleen/linux-misc lto-*)
> 
> IMNSHO they are all useful improvements even without LTO support.
> 
> About half of it is in x86 specific code, but the others are
> random all over. I tried to always copy the respective maintainers,
> but since it's (nearly) a tree sweep I'm also copying Andrew.

Can you please once and forever stop sending a random pile of patches which
are:

  - fixes independent of LTO
  - LTO required changes
  - RFC material

It's very clear where x86 related patches go through and it's also clear
that fixes have to be separate from features and other material.

You complain about maintainers being inresponsive and slow, but you are not
even trying to make their work easier by following the general process.

Thanks,

tglx


[PATCH] lib/lzo: fix bugs for very short or empty input

2019-03-26 Thread Dave Rodgman
For very short input data (0 - 1 bytes), lzo-rle was not behaving
correctly. Fix this behaviour and update documentation accordingly.

For zero-length input, lzo v0 outputs an end-of-stream marker only,
which was misinterpreted by lzo-rle as a bitstream version number.
Ensure bitstream versions > 0 require a minimum stream length of 5.

Also fixes a bug in handling the tail for very short inputs when a
bitstream version is present.

Change-Id: Ifcf7a1b9acc46a25cb3ef746eccfe26937209560
Signed-off-by: Dave Rodgman 
---
 Documentation/lzo.txt   | 8 +---
 lib/lzo/lzo1x_compress.c| 9 ++---
 lib/lzo/lzo1x_decompress_safe.c | 4 +---
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
index f79934225d8d..ca983328976b 100644
--- a/Documentation/lzo.txt
+++ b/Documentation/lzo.txt
@@ -102,9 +102,11 @@ Byte sequences
 dictionary which is empty, and that it will always be
 invalid at this place.
 
-  17  : bitstream version. If the first byte is 17, the next byte
-gives the bitstream version (version 1 only). If the first byte
-is not 17, the bitstream version is 0.
+  17  : bitstream version. If the first byte is 17, and compressed
+stream length is at least 5 bytes (length of shortest possible
+versioned bitstream), the next byte gives the bitstream version
+(version 1 only).
+Otherwise, the bitstream version is 0.
 
   18..21  : copy 0..3 literals
 state = (byte - 17) = 0..3  [ copy  literals ]
diff --git a/lib/lzo/lzo1x_compress.c b/lib/lzo/lzo1x_compress.c
index 4525fb094844..a8ede77afe0d 100644
--- a/lib/lzo/lzo1x_compress.c
+++ b/lib/lzo/lzo1x_compress.c
@@ -291,13 +291,14 @@ int lzogeneric1x_1_compress(const unsigned char *in, 
size_t in_len,
 {
const unsigned char *ip = in;
unsigned char *op = out;
+   unsigned char *data_start;
size_t l = in_len;
size_t t = 0;
signed char state_offset = -2;
unsigned int m4_max_offset;
 
-   // LZO v0 will never write 17 as first byte,
-   // so this is used to version the bitstream
+   // LZO v0 will never write 17 as first byte (except for zero-length
+   // input), so this is used to version the bitstream
if (bitstream_version > 0) {
*op++ = 17;
*op++ = bitstream_version;
@@ -306,6 +307,8 @@ int lzogeneric1x_1_compress(const unsigned char *in, size_t 
in_len,
m4_max_offset = M4_MAX_OFFSET_V0;
}
 
+   data_start = op;
+
while (l > 20) {
size_t ll = l <= (m4_max_offset + 1) ? l : (m4_max_offset + 1);
uintptr_t ll_end = (uintptr_t) ip + ll;
@@ -324,7 +327,7 @@ int lzogeneric1x_1_compress(const unsigned char *in, size_t 
in_len,
if (t > 0) {
const unsigned char *ii = in + in_len - t;
 
-   if (op == out && t <= 238) {
+   if (op == data_start && t <= 238) {
*op++ = (17 + t);
} else if (t <= 3) {
op[state_offset] |= t;
diff --git a/lib/lzo/lzo1x_decompress_safe.c b/lib/lzo/lzo1x_decompress_safe.c
index 6d2600ea3b55..9e07e9ef1aad 100644
--- a/lib/lzo/lzo1x_decompress_safe.c
+++ b/lib/lzo/lzo1x_decompress_safe.c
@@ -54,11 +54,9 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t 
in_len,
if (unlikely(in_len < 3))
goto input_overrun;
 
-   if (likely(*ip == 17)) {
+   if (likely(in_len >= 5) && likely(*ip == 17)) {
bitstream_version = ip[1];
ip += 2;
-   if (unlikely(in_len < 5))
-   goto input_overrun;
} else {
bitstream_version = 0;
}
-- 
2.17.1



Re: [PATCH v2] genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent()

2019-03-26 Thread Stephen Boyd
Quoting Marc Zyngier (2019-03-26 04:11:56)
> Hi Stephen,
> 
> On 25/03/2019 18:10, Stephen Boyd wrote:
> > This function returns an error if a child irqchip calls
> > irq_chip_set_wake_parent() but its parent irqchip has the
> > IRQCHIP_SKIP_SET_WAKE flag set. Let's return 0 for success here instead
> > because there isn't anything to do.
> > 
> > This keeps the behavior consistent with how set_irq_wake_real() is
> > implemented. That function returns 0 when the irqchip has the
> > IRQCHIP_SKIP_SET_WAKE flag set. It doesn't attempt to walk the chain of
> > parents and set irq wake on any chips that don't have the flag set
> > either. If the intent is to call the .irq_set_wake() callback of the
> > parent irqchip, then we expect irqchip implementations to omit the
> > IRQCHIP_SKIP_SET_WAKE flag and implement an .irq_set_wake() function
> > that calls irq_chip_set_wake_parent().
> > 
> > This fixes a problem on my Qualcomm sdm845 device where I can't set wake
> > on any GPIO interrupts after I apply work in progress wakeup irq patches
> > to the GPIO driver. The chain of chips looks like this:
> > 
> >  ARM GIC (skip) -> QCOM PDC (skip) -> QCOM GPIO
> 
> nit: the parenting chain is actually built the other way around (we
> don't express the 'child' relationship). This doesn't change anything to
> the patch, but would make the reasoning a but easier to understand.

I take it you want the sentence below to say 'parent' instead of 'child'
then?

> 
> > 
> > The GPIO controller is a child of the QCOM PDC irqchip which is a child
> > of the ARM GIC irqchip. The QCOM PDC irqchip has the
> > IRQCHIP_SKIP_SET_WAKE flag set, and so does the grandparent ARM GIC.
> > 
> > The GPIO driver doesn't know if the parent needs to set wake or not, so
> > it unconditionally calls irq_chip_set_wake_parent() causing this
> > function to return a failure because the parent irqchip (PDC) doesn't
> > have the .irq_set_wake() callback set. Returning 0 instead makes
> > everything work and irqs from the GPIO controller can be configured for
> > wakeup.
> > 
> > Cc: Lina Iyer 
> > Cc: Marc Zyngier 
> > Signed-off-by: Stephen Boyd 
> 
> Fixes: 08b55e2a9208e ("genirq: Add irqchip_set_wake_parent")
> Acked-by: Marc Zyngier 
> 

I'm happy to resend with the commit text clarified more and the above
tags added.



Re: [PATCH v6 04/19] powerpc: mm: Add p?d_large() definitions

2019-03-26 Thread Christophe Leroy




Le 26/03/2019 à 17:26, Steven Price a écrit :

walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For powerpc pmd_large() was already implemented, so hoist it out of the
CONFIG_TRANSPARENT_HUGEPAGE condition and implement the other levels.

Also since we now have a pmd_large always implemented we can drop the
pmd_is_leaf() function.


Wouldn't it be better to drop the pmd_is_leaf() in a second patch ?

Christophe



CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: linuxppc-...@lists.ozlabs.org
CC: kvm-...@vger.kernel.org
Signed-off-by: Steven Price 
---
  arch/powerpc/include/asm/book3s/64/pgtable.h | 30 ++--
  arch/powerpc/kvm/book3s_64_mmu_radix.c   | 12 ++--
  2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 581f91be9dd4..f6d1ac8b832e 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -897,6 +897,12 @@ static inline int pud_present(pud_t pud)
return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PRESENT));
  }
  
+#define pud_large	pud_large

+static inline int pud_large(pud_t pud)
+{
+   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
  extern struct page *pud_page(pud_t pud);
  extern struct page *pmd_page(pmd_t pmd);
  static inline pte_t pud_pte(pud_t pud)
@@ -940,6 +946,12 @@ static inline int pgd_present(pgd_t pgd)
return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
  }
  
+#define pgd_large	pgd_large

+static inline int pgd_large(pgd_t pgd)
+{
+   return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
+}
+
  static inline pte_t pgd_pte(pgd_t pgd)
  {
return __pte_raw(pgd_raw(pgd));
@@ -1093,6 +1105,15 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool 
write)
return pte_access_permitted(pmd_pte(pmd), write);
  }
  
+#define pmd_large	pmd_large

+/*
+ * returns true for pmd migration entries, THP, devmap, hugetlb
+ */
+static inline int pmd_large(pmd_t pmd)
+{
+   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
+}
+
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
  extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
@@ -1119,15 +1140,6 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long 
addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
  }
  
-/*

- * returns true for pmd migration entries, THP, devmap, hugetlb
- * But compile time dependent on THP config
- */
-static inline int pmd_large(pmd_t pmd)
-{
-   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
-}
-
  static inline pmd_t pmd_mknotpresent(pmd_t pmd)
  {
return __pmd(pmd_val(pmd) & ~_PAGE_PRESENT);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index f55ef071883f..1b57b4e3f819 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -363,12 +363,6 @@ static void kvmppc_pte_free(pte_t *ptep)
kmem_cache_free(kvm_pte_cache, ptep);
  }
  
-/* Like pmd_huge() and pmd_large(), but works regardless of config options */

-static inline int pmd_is_leaf(pmd_t pmd)
-{
-   return !!(pmd_val(pmd) & _PAGE_PTE);
-}
-
  static pmd_t *kvmppc_pmd_alloc(void)
  {
return kmem_cache_alloc(kvm_pmd_cache, GFP_KERNEL);
@@ -460,7 +454,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
*pmd, bool full,
for (im = 0; im < PTRS_PER_PMD; ++im, ++p) {
if (!pmd_present(*p))
continue;
-   if (pmd_is_leaf(*p)) {
+   if (pmd_large(*p)) {
if (full) {
pmd_clear(p);
} else {
@@ -593,7 +587,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, 
pte_t pte,
else if (level <= 1)
new_pmd = kvmppc_pmd_alloc();
  
-	if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_is_leaf(*pmd)))

+   if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_large(*pmd)))
new_ptep = kvmppc_pte_alloc();
  
  	/* Check if we might have been invalidated; let the guest retry if so */

@@ -662,7 +656,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, 
pte_t pte,
new_pmd = NULL;
}
pmd = pmd_offset(pud, gpa);
-   if (pmd_is_leaf(*pmd)) {
+   if (pmd_large(*pmd)) {
unsigned long lgpa = gpa & PMD_MASK;
  
  		/* Check if we raced and someone else has set the same thing */




Re: [PATCH v7 2/4] perf/smmuv3: Add arm64 smmuv3 pmu driver

2019-03-26 Thread Robin Murphy

Hi Shameer,

On 26/03/2019 15:17, Shameer Kolothum wrote:
[...]

+static int smmu_pmu_apply_event_filter(struct smmu_pmu *smmu_pmu,
+  struct perf_event *event, int idx)
+{
+   u32 span, sid;
+   unsigned int num_ctrs = smmu_pmu->num_counters;
+   bool filter_en = !!get_filter_enable(event);
+
+   span = filter_en ? get_filter_span(event) :
+  SMMU_PMCG_DEFAULT_FILTER_SPAN;
+   sid = filter_en ? get_filter_stream_id(event) :
+  SMMU_PMCG_DEFAULT_FILTER_SID;
+
+   /* Support individual filter settings */
+   if (!smmu_pmu->global_filter) {
+   smmu_pmu_set_event_filter(event, idx, span, sid);
+   return 0;
+   }
+
+   /* Requested settings same as current global settings*/
+   if (span == smmu_pmu->global_filter_span &&
+   sid == smmu_pmu->global_filter_sid)
+   return 0;
+
+   if (!bitmap_empty(smmu_pmu->used_counters, num_ctrs))
+   return -EAGAIN;
+
+   if (idx == 0) {
+   smmu_pmu_set_event_filter(event, idx, span, sid);
+   smmu_pmu->global_filter_span = span;
+   smmu_pmu->global_filter_sid = sid;
+   return 0;
+   }


When I suggested dropping the check of idx, I did mean removing it 
entirely, not just moving it further down ;)


Nothing to worry about though, I'll just leave this here for Will to 
consider applying on top or squashing.


Thanks,
Robin.

->8-
From: Robin Murphy 
Subject: [PATCH] perf/smmuv3: Relax global filter constraint a little

Although the current behaviour of smmu_pmu_get_event_idx() effectively
ensures that the first-allocated counter will be counter 0, there's no
need to strictly enforce that in smmu_pmu_apply_event_filter(). All that
matters is that we only ever touch the global filter settings in
SMMU_PMCG_SMR0 and SMMU_PMCG_EVTYPER0 while no counters are active.

Signed-off-by: Robin Murphy 
---
 drivers/perf/arm_smmuv3_pmu.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index 6b3c0ed7ad71..23045ead6de1 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -286,14 +286,11 @@ static int smmu_pmu_apply_event_filter(struct 
smmu_pmu *smmu_pmu,

if (!bitmap_empty(smmu_pmu->used_counters, num_ctrs))
return -EAGAIN;

-   if (idx == 0) {
-   smmu_pmu_set_event_filter(event, idx, span, sid);
-   smmu_pmu->global_filter_span = span;
-   smmu_pmu->global_filter_sid = sid;
-   return 0;
-   }
+   smmu_pmu_set_event_filter(event, 0, span, sid);
+   smmu_pmu->global_filter_span = span;
+   smmu_pmu->global_filter_sid = sid;

-   return -EAGAIN;
+   return 0;
 }

 static int smmu_pmu_get_event_idx(struct smmu_pmu *smmu_pmu,
--
2.20.1.dirty


[PATCH] HID: quirks: Fix keyboard + touchpad on Lenovo Miix 630

2019-03-26 Thread Jeffrey Hugo
Similar to commit edfc3722cfef ("HID: quirks: Fix keyboard + touchpad on
Toshiba Click Mini not working"), the Lenovo Miix 630 has a combo
keyboard/touchpad device with vid:pid of 04F3:0400, which is shared with
Elan touchpads.  The combo on the Miix 630 has an ACPI id of QTEC0001,
which is not claimed by the elan_i2c driver, so key on that similar to
what was done for the Toshiba Click Mini.

Signed-off-by: Jeffrey Hugo 
---
 drivers/hid/hid-quirks.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/hid/hid-quirks.c b/drivers/hid/hid-quirks.c
index 1148d8c0816a..77ffba48cc73 100644
--- a/drivers/hid/hid-quirks.c
+++ b/drivers/hid/hid-quirks.c
@@ -715,7 +715,6 @@ static const struct hid_device_id hid_ignore_list[] = {
{ HID_USB_DEVICE(USB_VENDOR_ID_DEALEXTREAME, 
USB_DEVICE_ID_DEALEXTREAME_RADIO_SI4701) },
{ HID_USB_DEVICE(USB_VENDOR_ID_DELORME, 
USB_DEVICE_ID_DELORME_EARTHMATE) },
{ HID_USB_DEVICE(USB_VENDOR_ID_DELORME, USB_DEVICE_ID_DELORME_EM_LT20) 
},
-   { HID_I2C_DEVICE(USB_VENDOR_ID_ELAN, 0x0400) },
{ HID_USB_DEVICE(USB_VENDOR_ID_ESSENTIAL_REALITY, 
USB_DEVICE_ID_ESSENTIAL_REALITY_P5) },
{ HID_USB_DEVICE(USB_VENDOR_ID_ETT, USB_DEVICE_ID_TC5UH) },
{ HID_USB_DEVICE(USB_VENDOR_ID_ETT, USB_DEVICE_ID_TC4UM) },
@@ -996,6 +995,10 @@ bool hid_ignore(struct hid_device *hdev)
if (hdev->product == 0x0401 &&
strncmp(hdev->name, "ELAN0800", 8) != 0)
return true;
+   /* Same with product id 0x0400 */
+   if (hdev->product == 0x0400 &&
+   strncmp(hdev->name, "QTEC0001", 8) != 0)
+   return true;
break;
}
 
-- 
2.17.1



Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:44 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 09:38:31AM -0700, Daniel Colascione wrote:
> > On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  
> > wrote:
> > >
> > > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > > Thanks for the patch.
> > > > > >
> > > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > > > >  wrote:
> > > > > > >
> > > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > > translate_pid() [4].
> > > > > > > I quote Konstantins original patchset first that has already been 
> > > > > > > acked and
> > > > > > > picked up by Eric before and whose functionality is preserved in 
> > > > > > > this
> > > > > > > syscall:
> > > > > >
> > > > > > We still haven't had a much-needed conversation about splitting this
> > > > > > system call into smaller logical operations. It's important that we
> > > > > > address this point before this patch is merged and becomes permanent
> > > > > > kernel ABI.
> > > > >
> > > > > I don't particularly mind splitting this into an additional syscall 
> > > > > like
> > > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > > about this rn is that it connects both apis in a single syscall
> > > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > > what other people think.
> > > >
> > > > There's something to be said for
> > > >
> > > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > > >
> > > > /* get pidfd */
> > > > int pidfd = pidfd_open(1234, -1, 0);
> > > >
> > > > /* convert to procfd */
> > > > int procfd = pidfd_open(-1, 4, 0);
> > > >
> > > > /* convert to pidfd */
> > > > int pidfd = pidfd_open(4, -1, 0);
> > >
> > > probably rather:
> > >
> > > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> > > int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> > > int pidfd = pidfd_open(1234, -1, 0);
> >
> > These three operations look like three related but distinct functions
> > to me, and in the second case, the "pidfd_open" name is a bit of a
> > misnomer. IMHO, the presence of an "operation name" field in any API
> > is usually a good indication that we're looking at a family of related
> > APIs, not a single coherent operation.
>
> So I'm happy to accommodate the need for a clean api even though I
> disagree that what we have in pidctl() is unclean.
> But I will not start sending a pile of syscalls. There is nothing
> necessarily wrong to group related APIs together.

In the email I sent just now, I identified several specific technical
disadvantages arising from unnecessary grouping of system calls. We
have historical evidence in the form of socketcall that this grouping
tends to be regrettable. I don't recall your identifying any
offsetting technical advantages. Did I miss something?

> By these standards the
> new mount API would need to be like 30 different syscalls, same for
> keyring management.

Can you please point out the problem that would arise from splitting
the mount and keyring APIs this way? One could have made the same
argument about grouping socket operations, and this socket-operation
grouping ended up being a mistake.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 09:42:59AM -0700, Andy Lutomirski wrote:
> On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  
> wrote:
> >
> > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > Thanks for the patch.
> > > > >
> > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > > >  wrote:
> > > > > >
> > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > translate_pid() [4].
> > > > > > I quote Konstantins original patchset first that has already been 
> > > > > > acked and
> > > > > > picked up by Eric before and whose functionality is preserved in 
> > > > > > this
> > > > > > syscall:
> > > > >
> > > > > We still haven't had a much-needed conversation about splitting this
> > > > > system call into smaller logical operations. It's important that we
> > > > > address this point before this patch is merged and becomes permanent
> > > > > kernel ABI.
> > > >
> > > > I don't particularly mind splitting this into an additional syscall like
> > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > about this rn is that it connects both apis in a single syscall
> > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > what other people think.
> > >
> > > There's something to be said for
> > >
> > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > >
> > > /* get pidfd */
> > > int pidfd = pidfd_open(1234, -1, 0);
> > >
> > > /* convert to procfd */
> > > int procfd = pidfd_open(-1, 4, 0);
> > >
> > > /* convert to pidfd */
> > > int pidfd = pidfd_open(4, -1, 0);
> >
> > probably rather:
> >
> > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> 
> Do you mean:
> 
> int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
> int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD);
> 
> or do you have some other solution in mind to avoid the security problem?

Yes, we need the proc root obviously. I just jotted this down.

We probably would need where one of the fds can refer to the proc root.

pidfd_open(pid_t, int fd, int fd, 0)


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 09:38:31AM -0700, Daniel Colascione wrote:
> On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  
> wrote:
> >
> > On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > > Thanks for the patch.
> > > > >
> > > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > > >  wrote:
> > > > > >
> > > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > > translate_pid() [4].
> > > > > > I quote Konstantins original patchset first that has already been 
> > > > > > acked and
> > > > > > picked up by Eric before and whose functionality is preserved in 
> > > > > > this
> > > > > > syscall:
> > > > >
> > > > > We still haven't had a much-needed conversation about splitting this
> > > > > system call into smaller logical operations. It's important that we
> > > > > address this point before this patch is merged and becomes permanent
> > > > > kernel ABI.
> > > >
> > > > I don't particularly mind splitting this into an additional syscall like
> > > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > > about this rn is that it connects both apis in a single syscall
> > > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > > what other people think.
> > >
> > > There's something to be said for
> > >
> > > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> > >
> > > /* get pidfd */
> > > int pidfd = pidfd_open(1234, -1, 0);
> > >
> > > /* convert to procfd */
> > > int procfd = pidfd_open(-1, 4, 0);
> > >
> > > /* convert to pidfd */
> > > int pidfd = pidfd_open(4, -1, 0);
> >
> > probably rather:
> >
> > int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> > int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> > int pidfd = pidfd_open(1234, -1, 0);
> 
> These three operations look like three related but distinct functions
> to me, and in the second case, the "pidfd_open" name is a bit of a
> misnomer. IMHO, the presence of an "operation name" field in any API
> is usually a good indication that we're looking at a family of related
> APIs, not a single coherent operation.

So I'm happy to accommodate the need for a clean api even though I
disagree that what we have in pidctl() is unclean.
But I will not start sending a pile of syscalls. There is nothing
necessarily wrong to group related APIs together. By these standards the
new mount API would need to be like 30 different syscalls, same for
keyring management.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Andy Lutomirski
On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > Thanks for the patch.
> > > >
> > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > >  wrote:
> > > > >
> > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been 
> > > > > acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall:
> > > >
> > > > We still haven't had a much-needed conversation about splitting this
> > > > system call into smaller logical operations. It's important that we
> > > > address this point before this patch is merged and becomes permanent
> > > > kernel ABI.
> > >
> > > I don't particularly mind splitting this into an additional syscall like
> > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > about this rn is that it connects both apis in a single syscall
> > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > what other people think.
> >
> > There's something to be said for
> >
> > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> >
> > /* get pidfd */
> > int pidfd = pidfd_open(1234, -1, 0);
> >
> > /* convert to procfd */
> > int procfd = pidfd_open(-1, 4, 0);
> >
> > /* convert to pidfd */
> > int pidfd = pidfd_open(4, -1, 0);
>
> probably rather:
>
> int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);

Do you mean:

int procrootfd = open("/proc", O_DIRECTORY | O_RDONLY);
int procfd = pidfd_open(procrootfd, pidfd, PIDFD_TO_PROCFD);

or do you have some other solution in mind to avoid the security problem?


[patch 2/2] x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y

2019-03-26 Thread Thomas Gleixner
The SMT disable 'nosmt' command line argument is not working properly when
CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are
required to be brought up due to the MCE issues, cannot work. The CPUs are
then kept in a half dead state.

As the 'nosmt' functionality has become popular due to the speculative
hardware vulnerabilities, the half torn down state is not a proper solution
to the problem.

Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is
possible.

Reported-by: Tianyu Lan 
Signed-off-by: Thomas Gleixner 
Cc: Konrad Wilk 
Cc: Josh Poimboeuf 
Cc: Mukesh Ojha 
Cc: Peter Zijlstra 
Cc: Jiri Kosina 
Cc: Rik van Riel 
Cc: Andy Lutomirski 
Cc: Micheal Kelley 
Cc: K. Y. Srinivasan 
Cc: Greg KH 
Cc: Linus Torvalds 
Cc: Borislav Petkov 
Cc: sta...@vger.kernel.org
---
 arch/x86/Kconfig |8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2217,14 +2217,8 @@ config RANDOMIZE_MEMORY_PHYSICAL_PADDING
   If unsure, leave at the default value.
 
 config HOTPLUG_CPU
-   bool "Support for hot-pluggable CPUs"
+   def_bool y
depends on SMP
-   ---help---
- Say Y here to allow turning CPUs off and on. CPUs can be
- controlled through /sys/devices/system/cpu.
- ( Note: power management support will enable this option
-   automatically on SMP systems. )
- Say N if you want to disable CPU hotplug.
 
 config BOOTPARAM_HOTPLUG_CPU0
bool "Set default setting of cpu0_hotpluggable"




[patch 1/2] cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n

2019-03-26 Thread Thomas Gleixner
Tianyu reported a crash in a CPU hotplug teardown callback when booting a
kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
parameter.

It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
forever in case that a bringup callback fails. Unfortunately this issue was
not recognized when the CPU hotplug code was reworked, so the shortcoming
just stayed in place.

When a bringup callback fails, the CPU hotplug code rolls back the
operation and takes the CPU offline.

The 'nosmt' command line argument uses a bringup failure to abort the
bringup of SMT sibling CPUs. This partial bringup is required due to the
MCE misdesign on Intel CPUs.

With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
teardown of a CPU including the synchronizations in various facilities like
RCU, NOHZ and others.

As a consequence the teardown callbacks which must be executed on the
outgoing CPU within stop machine with interrupts disabled are executed on
the control CPU in interrupt enabled and preemptible context causing the
kernel to crash and burn. The pre state machine code has a different
failure mode which is more subtle and resulting in a less obvious use after
free crash because the control side frees resources which are still in use
by the undead CPU.

But this is not a x86 only problem. Any architecture which supports the
SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
likely to be triggered because in 99.9% of the cases all bringup
callbacks succeed.

The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
all architectures as the following architectures have either no hotplug
support at all or not all subarchitectures support it:

 alpha, arc, hexagon, openrisc, riscv, sparc (32bit), mips (partial).

Crashing the kernel in such a situation is not an acceptable state
either.

Implement a minimal rollback variant by limiting the teardown to the point
where all regular teardown callbacks have been invoked and leave the CPU in
the 'dead' idle state. This has the following consequences:

 - the CPU is brought down to the point where the stop_machine takedown
   would happen.

 - the CPU stays there forever and is idle

 - The CPU is cleared in the CPU active mask, but not in the CPU online
   mask which is a legit state.

 - Interrupts are not forced away from the CPU

 - All facilities which only look at online mask would still see it, but
   that is the case during normal hotplug/unplug operations as well. It's
   just a (way) longer time frame.

This will expose issues, which haven't been exposed before or only seldom,
because now the normally transient state of being non active but online is
a permanent state. In testing this exposed already an issue vs. work queues
where the vmstat code schedules work on the almost dead CPU which ends up
in an unbound workqueue and triggers 'preemtible context' warnings. This is
not a problem of this change, it merily exposes an already existing issue.
Still this is better than crashing fully without a chance to debug it.

This is mainly thought as workaround for those architectures which do not
support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.

Fixes: 2e1a3483ce74 ("cpu/hotplug: Split out the state walk into functions")
Reported-by: Tianyu Lan 
Signed-off-by: Thomas Gleixner 
Tested-by: Tianyu Lan 
Cc: Konrad Wilk 
Cc: Josh Poimboeuf 
Cc: Mukesh Ojha 
Cc: Peter Zijlstra 
Cc: Jiri Kosina 
Cc: Rik van Riel 
Cc: Andy Lutomirski 
Cc: Micheal Kelley 
Cc: K. Y. Srinivasan 
Cc: Greg KH 
Cc: Linus Torvalds 
Cc: Borislav Petkov 
Cc: sta...@vger.kernel.org
---
 kernel/cpu.c |   20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -564,6 +564,20 @@ static void undo_cpu_up(unsigned int cpu
cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
 }
 
+static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
+{
+   if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
+   return true;
+   /*
+* When CPU hotplug is disabled, then taking the CPU down is not
+* possible because takedown_cpu() and the architecture and
+* subsystem specific mechanisms are not available. So the CPU
+* which would be completely unplugged again needs to stay around
+* in the current state.
+*/
+   return st->state <= CPUHP_BRINGUP_CPU;
+}
+
 static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
  enum cpuhp_state target)
 {
@@ -574,8 +588,10 @@ static int cpuhp_up_callbacks(unsigned i
st->state++;
ret = cpuhp_invoke_callback(cpu, st->state, true, NULL, NULL);
if (ret) {
-   st->target = prev_state;
-   undo_cpu_up(cpu, st);
+   if 

[patch 0/2] cpu/hotplug: Prevent damage with SMP=y and HOTPLUG_CPU=n

2019-03-26 Thread Thomas Gleixner
Tianyu reported a crash with SMP=y and HOTPLUG_CPU=n plus 'nosmt' on the
kernel command line.

  
https://lkml.kernel.org/r/1553521883-20868-1-git-send-email-tianyu@microsoft.com

The reason is a bug in the hotplug code which does not handle the fact,
that HOTPLUG_CPU=n cannot tear down a CPU completely.

Unfortunately HOTPLUG_CPU cannot be enforced as some architectures do not
support it at all.

The fix is only a workaround because a full solution is not possible due to
the limitations of HOTPLUG_CPU=n. So the CPU stays around in an undead state.

As 'nosmt' has become popular recently, the proper solution for X86 is to
enforce HOTPLUG_CPU when SMP is enabled.

Thanks,

tglx

 arch/x86/Kconfig |8 +---
 kernel/cpu.c |   20 ++--
 2 files changed, 19 insertions(+), 9 deletions(-)





Re: [PATCH 4.14 00/41] 4.14.109-stable review

2019-03-26 Thread Naresh Kamboju
On Tue, 26 Mar 2019 at 12:04, Greg Kroah-Hartman
 wrote:
>
> This is the start of the stable review cycle for the 4.14.109 release.
> There are 41 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Thu Mar 28 04:26:32 UTC 2019.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
> 
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.109-rc1.gz
> or in the git tree and branch at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.14.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h
>

Results from Linaro’s test farm.
No regressions on arm64, arm, x86_64, and i386.

Summary


kernel: 4.14.109-rc1
git repo: 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
git branch: linux-4.14.y
git commit: 4bb6d9c67e49d5301e6341b32ab5d72354e821d6
git describe: v4.14.108-42-g4bb6d9c67e49
Test details: 
https://qa-reports.linaro.org/lkft/linux-stable-rc-4.14-oe/build/v4.14.108-42-g4bb6d9c67e49

No regressions (compared to build v4.14.108)


No fixes (compared to build v4.14.108)

Ran 22897 total tests in the following environments and test suites.

Environments
--
- dragonboard-410c - arm64
- hi6220-hikey - arm64
- i386
- juno-r2 - arm64
- qemu_arm
- qemu_arm64
- qemu_i386
- qemu_x86_64
- x15 - arm
- x86_64

Test Suites
---
* boot
* install-android-platform-tools-r2600
* kselftest
* libhugetlbfs
* ltp-cap_bounds-tests
* ltp-commands-tests
* ltp-containers-tests
* ltp-cpuhotplug-tests
* ltp-cve-tests
* ltp-dio-tests
* ltp-fcntl-locktests-tests
* ltp-filecaps-tests
* ltp-fs-tests
* ltp-fs_bind-tests
* ltp-fs_perms_simple-tests
* ltp-fsx-tests
* ltp-hugetlb-tests
* ltp-io-tests
* ltp-ipc-tests
* ltp-math-tests
* ltp-mm-tests
* ltp-nptl-tests
* ltp-pty-tests
* ltp-sched-tests
* ltp-securebits-tests
* ltp-syscalls-tests
* ltp-timers-tests
* perf
* spectre-meltdown-checker-test
* ltp-open-posix-tests
* kselftest-vsyscall-mode-native
* kselftest-vsyscall-mode-none

-- 
Linaro LKFT
https://lkft.linaro.org


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:34 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > > Thanks for the patch.
> > > >
> > > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner 
> > > >  wrote:
> > > > >
> > > > > The pidctl() syscalls builds on, extends, and improves 
> > > > > translate_pid() [4].
> > > > > I quote Konstantins original patchset first that has already been 
> > > > > acked and
> > > > > picked up by Eric before and whose functionality is preserved in this
> > > > > syscall:
> > > >
> > > > We still haven't had a much-needed conversation about splitting this
> > > > system call into smaller logical operations. It's important that we
> > > > address this point before this patch is merged and becomes permanent
> > > > kernel ABI.
> > >
> > > I don't particularly mind splitting this into an additional syscall like
> > > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > > about this rn is that it connects both apis in a single syscall
> > > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > > what other people think.
> >
> > There's something to be said for
> >
> > pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> >
> > /* get pidfd */
> > int pidfd = pidfd_open(1234, -1, 0);
> >
> > /* convert to procfd */
> > int procfd = pidfd_open(-1, 4, 0);
> >
> > /* convert to pidfd */
> > int pidfd = pidfd_open(4, -1, 0);
>
> probably rather:
>
> int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
> int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
> int pidfd = pidfd_open(1234, -1, 0);

These three operations look like three related but distinct functions
to me, and in the second case, the "pidfd_open" name is a bit of a
misnomer. IMHO, the presence of an "operation name" field in any API
is usually a good indication that we're looking at a family of related
APIs, not a single coherent operation.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
On Tue, Mar 26, 2019 at 9:23 AM Christian Brauner  wrote:
>
> On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > Thanks for the patch.
> >
> > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner  
> > wrote:
> > >
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > [4].
> > > I quote Konstantins original patchset first that has already been acked 
> > > and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall:
> >
> > We still haven't had a much-needed conversation about splitting this
> > system call into smaller logical operations. It's important that we
> > address this point before this patch is merged and becomes permanent
> > kernel ABI.
>
> I don't particularly mind splitting this into an additional syscall like
> e.g.  pidfd_open() but then we have - and yes, I know you'll say
> syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> about this rn is that it connects both apis in a single syscall
> and allows pidfd retrieval across pid namespaces. So I guess we'll see
> what other people think.

Thanks. I also appreciate a clean unification of related
functionality, but I'm concerned that this API in particular --- due
in part to its *ctl() name --- will become a catch-all facility for
doing *anything* with processes. (Granted, heavy use of a new, good,
and clean API would be a good problem to have.) This
single-system-call state of affairs would make it more awkward than
necessary to do system-call level logging (say, strace -e), enable or
disable tracing of specific operations with ftrace, apply some kinds
of SELinux policy, and so on, and the only advantage of the single
system call design that I can see right now is the logical
cleanliness.

I'd propose splitting the call, or if we can't do that, renaming it to
something else --- pidfd_query --- so that it's less likely to become
a catch-all operation holder.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 05:31:42PM +0100, Christian Brauner wrote:
> On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> > On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > > Thanks for the patch.
> > > 
> > > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner  
> > > wrote:
> > > >
> > > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > > [4].
> > > > I quote Konstantins original patchset first that has already been acked 
> > > > and
> > > > picked up by Eric before and whose functionality is preserved in this
> > > > syscall:
> > > 
> > > We still haven't had a much-needed conversation about splitting this
> > > system call into smaller logical operations. It's important that we
> > > address this point before this patch is merged and becomes permanent
> > > kernel ABI.
> > 
> > I don't particularly mind splitting this into an additional syscall like
> > e.g.  pidfd_open() but then we have - and yes, I know you'll say
> > syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> > about this rn is that it connects both apis in a single syscall
> > and allows pidfd retrieval across pid namespaces. So I guess we'll see
> > what other people think.
> 
> There's something to be said for
> 
> pidfd_open(pid_t pid, int pidfd, unsigned int flags);
> 
> /* get pidfd */
> int pidfd = pidfd_open(1234, -1, 0);
> 
> /* convert to procfd */
> int procfd = pidfd_open(-1, 4, 0);
> 
> /* convert to pidfd */
> int pidfd = pidfd_open(4, -1, 0);

probably rather:

int pidfd = pidfd_open(-1, 4, PIDFD_TO_PROCFD);
int procfd = pidfd_open(-1, 4, PROCFD_TO_PIDFD);
int pidfd = pidfd_open(1234, -1, 0);


Re: [RFC PATCH v2 1/3] resource: Request IO port regions from children of ioport_resource

2019-03-26 Thread John Garry

On 25/03/2019 23:32, Bjorn Helgaas wrote:

Hi John,



Hi Bjorn,

Thanks for reviewing this.


On Thu, Mar 21, 2019 at 02:14:08AM +0800, John Garry wrote:

Currently when we request an IO port region, the request is made directly
to the top resource, ioport_resource.


Let's be explicit here, e.g.,

  Currently request_region() requests an IO port region directly from the
  top resource, ioport_resource.


ok




There is an issue here, in that drivers may successfully request an IO
port region even if the IO port region has not even been mapped in
(in pci_remap_iospace()).

This may lead to crashes when the system has no PCI host, or, has a host
but it has failed enumeration, while drivers still attempt to access PCI
IO ports, as below:


I don't understand the strategy here.  f71882fg is not a driver for a
PCI device, so it should work even if there is no PCI host in the
system.


From my checking, the f71882fg hwmon is accessed via the super-io 
interface on the PCH on x86. The super-io interface is at fixed 
addresses, those being 0x2e and 0x4e.


Please see the following:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/f71805f.c?h=v5.1-rc2#n1621

and

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/8-series-chipset-pch-datasheet.pdf 
(Table 9.2).


On x86 systems, these PCH IO ports will be mapped on a PCI bus, like:

$more /proc/ioports
-0cf7 : PCI Bus :00
  -001f : dma1
  0020-0021 : pic1
  0040-0043 : timer0
  0050-0053 : timer1
  0060-0060 : keyboard
  0064-0064 : keyboard
  0070-0077 : rtc0
  0080-008f : dma page reg
  00a0-00a1 : pic2
  00c0-00df : dma2
  00f0-00ff : fpu

So, the idea in the patch is that if PCI Bus :00 does not exist 
because of no PCI host, then we should fail a request to an IO port region.




On x86, I think inb/inw/inl from a port where nothing responds
probably just returns ~0, and outb/outw/outl just get dropped.
Shouldn't arm64 do the same, without crashing?


That would be ideal and we're doing something similar in patch 2/3.

So on ARM64 we have to IO remap the PCI IO resource. If this mapping is 
not done (due to no PCI host), then any inb/inw/inl calls will crash the 
system.


So in patch 2/3, I am also making the change to the logical PIO 
inb/inw/inl accessors to discard accesses when no PCI MMIO regions are 
registered in logical PIO space.


This is really a second line of defense (this patch being the first).




root@(none)$root@(none)$ insmod f71882fg.ko
[  152.215377] Unable to handle kernel paging request at virtual address 
7dfffee0002e
[  152.231299] Mem abort info:
[  152.236898]   ESR = 0x9646
[  152.243019]   Exception class = DABT (current EL), IL = 32 bits
[  152.254905]   SET = 0, FnV = 0
[  152.261024]   EA = 0, S1PTW = 0
[  152.267320] Data abort info:
[  152.273091]   ISV = 0, ISS = 0x0046
[  152.280784]   CM = 0, WnR = 1
[  152.286730] swapper pgtable: 4k pages, 48-bit VAs, pgdp = (ptrval)
[  152.300537] [7dfffee0002e] pgd=0141c003, pud=0141d003, 
pmd=
[  152.318016] Internal error: Oops: 9646 [#1] PREEMPT SMP
[  152.329199] Modules linked in: f71882fg(+)
[  152.337415] CPU: 8 PID: 2732 Comm: insmod Not tainted 
5.1.0-rc1-2-gab1a0e9200b8-dirty #102
[  152.354712] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 
Nemo 2.0 RC0 04/18/2018
[  152.373058] pstate: 8005 (Nzcv daif -PAN -UAO)
[  152.382675] pc : logic_outb+0x54/0xb8
[  152.390017] lr : f71882fg_find+0x64/0x390 [f71882fg]
[  152.399977] sp : 13393aa0
[  152.406618] x29: 13393aa0 x28: 08b98b10
[  152.417278] x27: 13393df0 x26: 0100
[  152.427938] x25: 801f8c872d30 x24: 1142
[  152.438598] x23: 801fb49d2940 x22: 11291000
[  152.449257] x21: 002e x20: 0087
[  152.459917] x19: 13393b44 x18: 
[  152.470577] x17:  x16: 
[  152.481236] x15: 1127d6c8 x14: 801f8cfd691c
[  152.491896] x13:  x12: 
[  152.502555] x11: 0003 x10: 801feace2000
[  152.513215] x9 :  x8 : 841fa654f280
[  152.523874] x7 :  x6 : 00ffc0e3
[  152.534534] x5 : 11291360 x4 : 801fb4949f00
[  152.545194] x3 : 00ffbffe x2 : 76e767a63713d500
[  152.555853] x1 : 7dfffee0002e x0 : 7dfffee0
[  152.566514] Process insmod (pid: 2732, stack limit = 0x(ptrval))
[  152.579968] Call trace:
[  152.584863]  logic_outb+0x54/0xb8
[  152.591506]  f71882fg_find+0x64/0x390 [f71882fg]
[  152.600768]  f71882fg_init+0x38/0xc70 [f71882fg]
[  152.610031]  do_one_initcall+0x5c/0x198
[  152.617723]  do_init_module+0x54/0x1b0
[  152.625237]  load_module+0x1dc4/0x2158
[  152.632752]  __se_sys_init_module+0x14c/0x1e8
[  152.641490]  __arm64_sys_init_module+0x18/0x20
[  152.650404]  

Re: [PATCH v3] kmemleaak: survive in a low-memory situation

2019-03-26 Thread Michal Hocko
On Tue 26-03-19 16:20:41, Catalin Marinas wrote:
> On Tue, Mar 26, 2019 at 09:05:36AM -0700, Matthew Wilcox wrote:
> > On Tue, Mar 26, 2019 at 11:43:38AM -0400, Qian Cai wrote:
> > > Unless there is a brave soul to reimplement the kmemleak to embed it's
> > > metadata into the tracked memory itself in a foreseeable future, this
> > > provides a good balance between enabling kmemleak in a low-memory
> > > situation and not introducing too much hackiness into the existing
> > > code for now.
> > 
> > I don't understand kmemleak.  Kirill pointed me at this a few days ago:
> > 
> > https://gist.github.com/kiryl/3225e235fea390aa2e49bf625bbe83ec
> > 
> > It's caused by the XArray allocating memory using GFP_NOWAIT | __GFP_NOWARN.
> > kmemleak then decides it needs to allocate memory to track this memory.
> > So it calls kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
> > 
> > #define gfp_kmemleak_mask(gfp)  (((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
> >  __GFP_NORETRY | __GFP_NOMEMALLOC | \
> >  __GFP_NOWARN | __GFP_NOFAIL)
> > 
> > then the page allocator gets to see GFP_NOFAIL | GFP_NOWAIT and gets angry.
> > 
> > But I don't understand why kmemleak needs to mess with the GFP flags at
> > all.
> 
> Originally, it was just preserving GFP_KERNEL | GFP_ATOMIC. Starting
> with commit 6ae4bd1f0bc4 ("kmemleak: Allow kmemleak metadata allocations
> to fail"), this mask changed, aimed at making kmemleak allocation
> failures less verbose (i.e. just disable it since it's a debug tool).
> 
> Commit d9570ee3bd1d ("kmemleak: allow to coexist with fault injection")
> introduced __GFP_NOFAIL but this came with its own problems which have
> been previously reported (the warning you mentioned is another one of
> these). We didn't get to any clear conclusion on how best to allow
> allocations to fail with fault injection but not for the kmemleak
> metadata. Your suggestion below would probably do the trick.

I have objected to that on several occasions. An implicit __GFP_NOFAIL
is simply broken and __GFP_NOWAIT allocations are a shiny example of
that. You cannot loop inside the allocator for an unbound amount of time
potentially with locks held. I have heard that there are some plans to
deal with that but nothing has really materialized AFAIK. d9570ee3bd1d
should be reverted I believe.

The proper way around is to keep a pool objects and keep spare objects
for restrected allocation contexts.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 05:23:37PM +0100, Christian Brauner wrote:
> On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> > Thanks for the patch.
> > 
> > On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner  
> > wrote:
> > >
> > > The pidctl() syscalls builds on, extends, and improves translate_pid() 
> > > [4].
> > > I quote Konstantins original patchset first that has already been acked 
> > > and
> > > picked up by Eric before and whose functionality is preserved in this
> > > syscall:
> > 
> > We still haven't had a much-needed conversation about splitting this
> > system call into smaller logical operations. It's important that we
> > address this point before this patch is merged and becomes permanent
> > kernel ABI.
> 
> I don't particularly mind splitting this into an additional syscall like
> e.g.  pidfd_open() but then we have - and yes, I know you'll say
> syscalls are cheap - translate_pid(), and pidfd_open(). What I like
> about this rn is that it connects both apis in a single syscall
> and allows pidfd retrieval across pid namespaces. So I guess we'll see
> what other people think.

There's something to be said for

pidfd_open(pid_t pid, int pidfd, unsigned int flags);

/* get pidfd */
int pidfd = pidfd_open(1234, -1, 0);

/* convert to procfd */
int procfd = pidfd_open(-1, 4, 0);

/* convert to pidfd */
int pidfd = pidfd_open(4, -1, 0);


Re: linux-next: Fixes tag needs some work in the btrfs-kdave tree

2019-03-26 Thread David Sterba
On Tue, Mar 26, 2019 at 07:09:24AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> In commit
> 
>   167ab7e5ebbf ("btrfs: reloc: Fix NULL pointer dereference due to expanded 
> reloc_root lifespan")
> 
> Fixes tag
> 
>   Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after 
> merge_reloc_roots()")
> 
> has these problem(s):
> 
>   - Subject does not match target commit subject
> Just use
>   git log -1 --format='Fixes: %h (%s)'

Will be fixed in the next update, thanks for the report. There was
reference from another repository that had wrong commit id that I fixed
but did not notice the slightly differen subject with ().


Re: [PATCH -next] x86/apic: Reduce print level of CPU limit announcement

2019-03-26 Thread Rafael J. Wysocki
On Tue, Mar 26, 2019 at 4:32 PM Leon Romanovsky  wrote:
>
> On Tue, Mar 26, 2019 at 04:12:27PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Mar 26, 2019 at 3:41 PM Leon Romanovsky  wrote:
> > >
> > > On Tue, Mar 26, 2019 at 01:29:54PM +0100, Rafael J. Wysocki wrote:
> > > > On Tue, Mar 26, 2019 at 1:02 PM Leon Romanovsky  wrote:
> > > > >
> > > > > From: Leon Romanovsky 
> > > > >
> > > > > Kernel is booted with less possible CPUs (possible_cpus kernel boot
> > > > > option) than available CPUs will have prints like this:
> > > > >
> > > > > [1.131039] APIC: NR_CPUS/possible_cpus limit of 8 reached. 
> > > > > Processor 55/0x1f ignored.
> > > > > [1.132228] ACPI: Unable to map lapic to logical cpu number
> > > > >
> > > > > Those warnings are printed for every not-enabled CPU and on the 
> > > > > systems
> > > > > with large number of such CPUs, we see a lot of those prints for 
> > > > > default
> > > > > print level.
> > > > >
> > > > > Simple conversion of those prints to be in debug level removes them
> > > > > while leaving the option to debug system.
> > > >
> > > > But generally dynamic debug must be enabled in order for pr_debug()
> > > > prints to be visible which is kind of cumbersome to do via the command
> > > > line.
> > >
> > > It is doable and documented pretty well, which is uncommon :)
> > > https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html#debug-messages-during-boot-process
> >
> > I know.
> >
> > That's what I mean by "kind of cumbersome", because you need to know
> > which debug messages to enable upfront.
> >
> > > >
> > > > > Signed-off-by: Leon Romanovsky 
> > > > > ---
> > > > >  arch/x86/kernel/acpi/boot.c | 2 +-
> > > > >  arch/x86/kernel/apic/apic.c | 6 +++---
> > > > >  2 files changed, 4 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> > > > > index 8dcbf6890714..3ef8ab89c02d 100644
> > > > > --- a/arch/x86/kernel/acpi/boot.c
> > > > > +++ b/arch/x86/kernel/acpi/boot.c
> > > > > @@ -770,7 +770,7 @@ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t 
> > > > > physid, u32 acpi_id,
> > > > >
> > > > > cpu = acpi_register_lapic(physid, acpi_id, ACPI_MADT_ENABLED);
> > > > > if (cpu < 0) {
> > > > > -   pr_info(PREFIX "Unable to map lapic to logical cpu 
> > > > > number\n");
> > > > > +   pr_debug(PREFIX "Unable to map lapic to logical cpu 
> > > > > number\n");
> > > >
> > > > And this one is printed sometimes when something really goes wrong
> > > > which may be really hard to debug otherwise, so there is value in the
> > > > info level here.
> > > >
> > > > Would it be possible to avoid printing it just in some cases?
> > >
> > > This can do the trick:
> > >
> > > diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> > > index 3ef8ab89c02d..00212b3991e0 100644
> > > --- a/arch/x86/kernel/acpi/boot.c
> > > +++ b/arch/x86/kernel/acpi/boot.c
> > > @@ -770,7 +770,10 @@ int acpi_map_cpu(acpi_handle handle, phys_cpuid_t 
> > > physid, u32 acpi_id,
> > >
> > > cpu = acpi_register_lapic(physid, acpi_id, ACPI_MADT_ENABLED);
> > > if (cpu < 0) {
> > > -   pr_debug(PREFIX "Unable to map lapic to logical cpu 
> > > number\n");
> > > +   if (cpu == -ENOENT)
> > > +   pr_debug(PREFIX "Unable to map lapic to logical 
> > > cpu number\n");
> >
> > I don't think it is necessary to print this in the -ENOENT case, as
> > there is a message for that case that will be printed anyway.
>
> Agree, how do you want me to progress? Should I resend patch?

Yes, please.


Re: [PATCH v3] kmemleaak: survive in a low-memory situation

2019-03-26 Thread Qian Cai



On 3/26/19 12:00 PM, Christopher Lameter wrote:
>> + */
>> +gfp = (in_atomic() || irqs_disabled()) ? GFP_ATOMIC :
>> +   gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM;
>> +object = kmem_cache_alloc(object_cache, gfp);
>> +}
>> +
>>  if (!object) {
> 
> If the alloc must succeed then this check is no longer necessary.

Well, GFP_ATOMIC could still fail. It looks like the only thing that will never
fail is (__GFP_DIRECT_RECLAIM | __GFP_NOFAIL) as it keeps retrying in
__alloc_pages_slowpath().


Re: [PATCH 2/5] ARM: dts: imx50: Add Kobo Aura DTS

2019-03-26 Thread Jonathan NeuschÀfer
Hi, thanks for your comments. I'll address them in v2.

On Fri, Mar 22, 2019 at 09:31:53AM +0800, Shawn Guo wrote:
> On Tue, Mar 19, 2019 at 04:24:17PM +0100, Jonathan NeuschÀfer wrote:
> > The Kobo Aura is an e-book reader released in 2013.
[...]
> > +   sd2_pwrseq: pwrseq {
> > +   compatible = "mmc-pwrseq-simple";
> > +   pinctrl-names = "default";
> > +   pinctrl-0 = <_sd2_reset>;
> > +
> 
> Please do not have random newlines.

Does that apply to all empty lines between properties?

> 
> > +   reset-gpios = < 17 GPIO_ACTIVE_LOW>;
> > +   };
> > +
[...]
> > + {
> > +   pinctrl_uart2: uart2 {
> > +   fsl,pins = <
> > +   MX50_PAD_UART2_TXD__UART2_TXD_MUX   0x1e4
> > +   MX50_PAD_UART2_RXD__UART2_RXD_MUX   0x1e4
> > +   >;
> > +   };
> > +
> > +   pinctrl_i2c1: i2c1 {
> 
> Please sort these pinctrl nodes alphabetically.

It doesn't make a difference here, but should I generally sort by name
or by label in cases like this one?


Jonathan NeuschÀfer


signature.asc
Description: PGP signature


[PATCH v6 14/19] x86: mm: Don't display pages which aren't present in debugfs

2019-03-26 Thread Steven Price
For the /sys/kernel/debug/page_tables/ files, rather than outputing a
mostly empty line when a block of memory isn't present just skip the
line. This keeps the output shorter and will help with a future change
switching to using the generic page walk code as we no longer care about
the 'level' that the page table holes are at.

Signed-off-by: Steven Price 
---
 arch/x86/mm/dump_pagetables.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ca270fb00805..e2b53db92c34 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -304,8 +304,8 @@ static void note_page(struct seq_file *m, struct pg_state 
*st,
/*
 * Now print the actual finished series
 */
-   if (!st->marker->max_lines ||
-   st->lines < st->marker->max_lines) {
+   if ((cur & _PAGE_PRESENT) && (!st->marker->max_lines ||
+   st->lines < st->marker->max_lines)) {
pt_dump_seq_printf(m, st->to_dmesg,
   "0x%0*lx-0x%0*lx   ",
   width, st->start_address,
@@ -321,7 +321,8 @@ static void note_page(struct seq_file *m, struct pg_state 
*st,
printk_prot(m, st->current_prot, st->level,
st->to_dmesg);
}
-   st->lines++;
+   if (cur & _PAGE_PRESENT)
+   st->lines++;
 
/*
 * We print markers for special areas of address space,
-- 
2.20.1



[PATCH v6 19/19] x86: mm: Convert dump_pagetables to use walk_page_range

2019-03-26 Thread Steven Price
Make use of the new functionality in walk_page_range to remove the
arch page walking code and use the generic code to walk the page tables.

The effective permissions are passed down the chain using new fields
in struct pg_state.

The KASAN optimisation is implemented by including test_p?d callbacks
which can decide to skip an entire tree of entries

Signed-off-by: Steven Price 
---
 arch/x86/mm/dump_pagetables.c | 280 ++
 1 file changed, 146 insertions(+), 134 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index c0fbb9e5a790..f6b814aaddf7 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -33,6 +33,10 @@ struct pg_state {
int level;
pgprot_t current_prot;
pgprotval_t effective_prot;
+   pgprotval_t effective_prot_pgd;
+   pgprotval_t effective_prot_p4d;
+   pgprotval_t effective_prot_pud;
+   pgprotval_t effective_prot_pmd;
unsigned long start_address;
unsigned long current_address;
const struct addr_marker *marker;
@@ -356,22 +360,21 @@ static inline pgprotval_t effective_prot(pgprotval_t 
prot1, pgprotval_t prot2)
   ((prot1 | prot2) & _PAGE_NX);
 }
 
-static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
-  unsigned long P)
+static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
+   unsigned long next, struct mm_walk *walk)
 {
-   int i;
-   pte_t *pte;
-   pgprotval_t prot, eff;
-
-   for (i = 0; i < PTRS_PER_PTE; i++) {
-   st->current_address = normalize_addr(P + i * PTE_LEVEL_MULT);
-   pte = pte_offset_map(, st->current_address);
-   prot = pte_flags(*pte);
-   eff = effective_prot(eff_in, prot);
-   note_page(st, __pgprot(prot), eff, 5);
-   pte_unmap(pte);
-   }
+   struct pg_state *st = walk->private;
+   pgprotval_t eff, prot;
+
+   st->current_address = normalize_addr(addr);
+
+   prot = pte_flags(*pte);
+   eff = effective_prot(st->effective_prot_pmd, prot);
+   note_page(st, __pgprot(prot), eff, 5);
+
+   return 0;
 }
+
 #ifdef CONFIG_KASAN
 
 /*
@@ -400,131 +403,152 @@ static inline bool kasan_page_table(struct pg_state 
*st, void *pt)
 }
 #endif
 
-#if PTRS_PER_PMD > 1
-
-static void walk_pmd_level(struct pg_state *st, pud_t addr,
-  pgprotval_t eff_in, unsigned long P)
+static int ptdump_test_pmd(unsigned long addr, unsigned long next,
+  pmd_t *pmd, struct mm_walk *walk)
 {
-   int i;
-   pmd_t *start, *pmd_start;
-   pgprotval_t prot, eff;
-
-   pmd_start = start = (pmd_t *)pud_page_vaddr(addr);
-   for (i = 0; i < PTRS_PER_PMD; i++) {
-   st->current_address = normalize_addr(P + i * PMD_LEVEL_MULT);
-   if (!pmd_none(*start)) {
-   prot = pmd_flags(*start);
-   eff = effective_prot(eff_in, prot);
-   if (pmd_large(*start) || !pmd_present(*start)) {
-   note_page(st, __pgprot(prot), eff, 4);
-   } else if (!kasan_page_table(st, pmd_start)) {
-   walk_pte_level(st, *start, eff,
-  P + i * PMD_LEVEL_MULT);
-   }
-   } else
-   note_page(st, __pgprot(0), 0, 4);
-   start++;
-   }
+   struct pg_state *st = walk->private;
+
+   st->current_address = normalize_addr(addr);
+
+   if (kasan_page_table(st, pmd))
+   return 1;
+   return 0;
 }
 
-#else
-#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
-#undef pud_large
-#define pud_large(a) pmd_large(__pmd(pud_val(a)))
-#define pud_none(a)  pmd_none(__pmd(pud_val(a)))
-#endif
+static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
+   unsigned long next, struct mm_walk *walk)
+{
+   struct pg_state *st = walk->private;
+   pgprotval_t eff, prot;
+
+   prot = pmd_flags(*pmd);
+   eff = effective_prot(st->effective_prot_pud, prot);
+
+   st->current_address = normalize_addr(addr);
+
+   if (pmd_large(*pmd))
+   note_page(st, __pgprot(prot), eff, 4);
 
-#if PTRS_PER_PUD > 1
+   st->effective_prot_pmd = eff;
 
-static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
-  unsigned long P)
+   return 0;
+}
+
+static int ptdump_test_pud(unsigned long addr, unsigned long next,
+  pud_t *pud, struct mm_walk *walk)
 {
-   int i;
-   pud_t *start, *pud_start;
-   pgprotval_t prot, eff;
-
-   pud_start = start = (pud_t *)p4d_page_vaddr(addr);
-
-   for (i = 0; i < PTRS_PER_PUD; i++) {
-   st->current_address = normalize_addr(P + 

[PATCH v6 15/19] x86: mm: Point to struct seq_file from struct pg_state

2019-03-26 Thread Steven Price
mm/dump_pagetables.c passes both struct seq_file and struct pg_state
down the chain of walk_*_level() functions to be passed to note_page().
Instead place the struct seq_file in struct pg_state and access it from
struct pg_state (which is private to this file) in note_page().

Signed-off-by: Steven Price 
---
 arch/x86/mm/dump_pagetables.c | 69 ++-
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e2b53db92c34..3d12ac031144 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -40,6 +40,7 @@ struct pg_state {
bool to_dmesg;
bool check_wx;
unsigned long wx_pages;
+   struct seq_file *seq;
 };
 
 struct addr_marker {
@@ -268,11 +269,12 @@ static void note_wx(struct pg_state *st)
  * of PTE entries; the next one is different so we need to
  * print what we collected so far.
  */
-static void note_page(struct seq_file *m, struct pg_state *st,
- pgprot_t new_prot, pgprotval_t new_eff, int level)
+static void note_page(struct pg_state *st, pgprot_t new_prot,
+ pgprotval_t new_eff, int level)
 {
pgprotval_t prot, cur, eff;
static const char units[] = "BKMGTPE";
+   struct seq_file *m = st->seq;
 
/*
 * If we have a "break" in the series, we need to flush the state that
@@ -358,8 +360,8 @@ static inline pgprotval_t effective_prot(pgprotval_t prot1, 
pgprotval_t prot2)
   ((prot1 | prot2) & _PAGE_NX);
 }
 
-static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
-  pgprotval_t eff_in, unsigned long P)
+static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
+  unsigned long P)
 {
int i;
pte_t *pte;
@@ -370,7 +372,7 @@ static void walk_pte_level(struct seq_file *m, struct 
pg_state *st, pmd_t addr,
pte = pte_offset_map(, st->current_address);
prot = pte_flags(*pte);
eff = effective_prot(eff_in, prot);
-   note_page(m, st, __pgprot(prot), eff, 5);
+   note_page(st, __pgprot(prot), eff, 5);
pte_unmap(pte);
}
 }
@@ -383,22 +385,20 @@ static void walk_pte_level(struct seq_file *m, struct 
pg_state *st, pmd_t addr,
  * us dozens of seconds (minutes for 5-level config) while checking for
  * W+X mapping or reading kernel_page_tables debugfs file.
  */
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
-   void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
 {
if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
(pgtable_l5_enabled() &&
__pa(pt) == __pa(kasan_early_shadow_p4d)) ||
__pa(pt) == __pa(kasan_early_shadow_pud)) {
pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
-   note_page(m, st, __pgprot(prot), 0, 5);
+   note_page(st, __pgprot(prot), 0, 5);
return true;
}
return false;
 }
 #else
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
-   void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
 {
return false;
 }
@@ -406,7 +406,7 @@ static inline bool kasan_page_table(struct seq_file *m, 
struct pg_state *st,
 
 #if PTRS_PER_PMD > 1
 
-static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
+static void walk_pmd_level(struct pg_state *st, pud_t addr,
   pgprotval_t eff_in, unsigned long P)
 {
int i;
@@ -420,19 +420,19 @@ static void walk_pmd_level(struct seq_file *m, struct 
pg_state *st, pud_t addr,
prot = pmd_flags(*start);
eff = effective_prot(eff_in, prot);
if (pmd_large(*start) || !pmd_present(*start)) {
-   note_page(m, st, __pgprot(prot), eff, 4);
-   } else if (!kasan_page_table(m, st, pmd_start)) {
-   walk_pte_level(m, st, *start, eff,
+   note_page(st, __pgprot(prot), eff, 4);
+   } else if (!kasan_page_table(st, pmd_start)) {
+   walk_pte_level(st, *start, eff,
   P + i * PMD_LEVEL_MULT);
}
} else
-   note_page(m, st, __pgprot(0), 0, 4);
+   note_page(st, __pgprot(0), 0, 4);
start++;
}
 }
 
 #else
-#define walk_pmd_level(m,s,a,e,p) walk_pte_level(m,s,__pmd(pud_val(a)),e,p)
+#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
 #undef pud_large
 #define pud_large(a) pmd_large(__pmd(pud_val(a)))
 #define pud_none(a) 

[PATCH v6 13/19] arm64: mm: Convert mm/dump.c to use walk_page_range()

2019-03-26 Thread Steven Price
Now walk_page_range() can walk kernel page tables, we can switch the
arm64 ptdump code over to using it, simplifying the code.

Signed-off-by: Steven Price 
---
 arch/arm64/mm/dump.c | 117 ++-
 1 file changed, 59 insertions(+), 58 deletions(-)

diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/dump.c
index 14fe23cd5932..ea20c1213498 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/dump.c
@@ -72,7 +72,7 @@ struct pg_state {
struct seq_file *seq;
const struct addr_marker *marker;
unsigned long start_address;
-   unsigned level;
+   int level;
u64 current_prot;
bool check_wx;
unsigned long wx_pages;
@@ -234,11 +234,14 @@ static void note_prot_wx(struct pg_state *st, unsigned 
long addr)
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
 }
 
-static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
+static void note_page(struct pg_state *st, unsigned long addr, int level,
u64 val)
 {
static const char units[] = "KMGTPE";
-   u64 prot = val & pg_level[level].mask;
+   u64 prot = 0;
+
+   if (level >= 0)
+   prot = val & pg_level[level].mask;
 
if (!st->level) {
st->level = level;
@@ -286,73 +289,71 @@ static void note_page(struct pg_state *st, unsigned long 
addr, unsigned level,
 
 }
 
-static void walk_pte(struct pg_state *st, pmd_t *pmdp, unsigned long start,
-unsigned long end)
+static int pud_entry(pud_t *pud, unsigned long addr,
+   unsigned long next, struct mm_walk *walk)
 {
-   unsigned long addr = start;
-   pte_t *ptep = pte_offset_kernel(pmdp, start);
+   struct pg_state *st = walk->private;
+   pud_t val = READ_ONCE(*pud);
+
+   if (pud_table(val))
+   return 0;
+
+   note_page(st, addr, 2, pud_val(val));
 
-   do {
-   note_page(st, addr, 4, READ_ONCE(pte_val(*ptep)));
-   } while (ptep++, addr += PAGE_SIZE, addr != end);
+   return 0;
 }
 
-static void walk_pmd(struct pg_state *st, pud_t *pudp, unsigned long start,
-unsigned long end)
+static int pmd_entry(pmd_t *pmd, unsigned long addr,
+   unsigned long next, struct mm_walk *walk)
 {
-   unsigned long next, addr = start;
-   pmd_t *pmdp = pmd_offset(pudp, start);
-
-   do {
-   pmd_t pmd = READ_ONCE(*pmdp);
-   next = pmd_addr_end(addr, end);
-
-   if (pmd_none(pmd) || pmd_sect(pmd)) {
-   note_page(st, addr, 3, pmd_val(pmd));
-   } else {
-   BUG_ON(pmd_bad(pmd));
-   walk_pte(st, pmdp, addr, next);
-   }
-   } while (pmdp++, addr = next, addr != end);
+   struct pg_state *st = walk->private;
+   pmd_t val = READ_ONCE(*pmd);
+
+   if (pmd_table(val))
+   return 0;
+
+   note_page(st, addr, 3, pmd_val(val));
+
+   return 0;
 }
 
-static void walk_pud(struct pg_state *st, pgd_t *pgdp, unsigned long start,
-unsigned long end)
+static int pte_entry(pte_t *pte, unsigned long addr,
+   unsigned long next, struct mm_walk *walk)
 {
-   unsigned long next, addr = start;
-   pud_t *pudp = pud_offset(pgdp, start);
-
-   do {
-   pud_t pud = READ_ONCE(*pudp);
-   next = pud_addr_end(addr, end);
-
-   if (pud_none(pud) || pud_sect(pud)) {
-   note_page(st, addr, 2, pud_val(pud));
-   } else {
-   BUG_ON(pud_bad(pud));
-   walk_pmd(st, pudp, addr, next);
-   }
-   } while (pudp++, addr = next, addr != end);
+   struct pg_state *st = walk->private;
+   pte_t val = READ_ONCE(*pte);
+
+   note_page(st, addr, 4, pte_val(val));
+
+   return 0;
+}
+
+static int pte_hole(unsigned long addr, unsigned long next,
+   struct mm_walk *walk)
+{
+   struct pg_state *st = walk->private;
+
+   note_page(st, addr, -1, 0);
+
+   return 0;
 }
 
 static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
-unsigned long start)
+   unsigned long start)
 {
-   unsigned long end = (start < TASK_SIZE_64) ? TASK_SIZE_64 : 0;
-   unsigned long next, addr = start;
-   pgd_t *pgdp = pgd_offset(mm, start);
-
-   do {
-   pgd_t pgd = READ_ONCE(*pgdp);
-   next = pgd_addr_end(addr, end);
-
-   if (pgd_none(pgd)) {
-   note_page(st, addr, 1, pgd_val(pgd));
-   } else {
-   BUG_ON(pgd_bad(pgd));
-   walk_pud(st, pgdp, addr, next);
-   }
-   } while (pgdp++, addr = next, addr != end);
+   struct mm_walk walk = {
+   .mm = mm,
+   .private = st,
+   .pud_entry = 

[PATCH v6 16/19] x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct

2019-03-26 Thread Steven Price
To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level() need to pass an mm_struct rather
than the raw pgd_t pointer. Luckily since commit 7e904a91bf60
("efi: Use efi_mm in x86 as well as ARM") we now have an mm_struct
for EFI on x86.

Signed-off-by: Steven Price 
---
 arch/x86/include/asm/pgtable.h | 2 +-
 arch/x86/mm/dump_pagetables.c  | 4 ++--
 arch/x86/platform/efi/efi_32.c | 2 +-
 arch/x86/platform/efi/efi_64.c | 4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0dd04cf6ebeb..579959750f34 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -27,7 +27,7 @@
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
 void ptdump_walk_pgd_level_checkwx(void);
 void ptdump_walk_user_pgd_level_checkwx(void);
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 3d12ac031144..ddf8ea6b059d 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -574,9 +574,9 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, 
pgd_t *pgd,
pr_info("x86/mm: Checked W+X mappings: passed, no W+X pages 
found.\n");
 }
 
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
 {
-   ptdump_walk_pgd_level_core(m, pgd, false, true);
+   ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
 }
 
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
diff --git a/arch/x86/platform/efi/efi_32.c b/arch/x86/platform/efi/efi_32.c
index 9959657127f4..9175ceaa6e72 100644
--- a/arch/x86/platform/efi/efi_32.c
+++ b/arch/x86/platform/efi/efi_32.c
@@ -49,7 +49,7 @@ void efi_sync_low_kernel_mappings(void) {}
 void __init efi_dump_pagetable(void)
 {
 #ifdef CONFIG_EFI_PGT_DUMP
-   ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+   ptdump_walk_pgd_level(NULL, init_mm);
 #endif
 }
 
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index cf0347f61b21..a2e0f9800190 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -611,9 +611,9 @@ void __init efi_dump_pagetable(void)
 {
 #ifdef CONFIG_EFI_PGT_DUMP
if (efi_enabled(EFI_OLD_MEMMAP))
-   ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+   ptdump_walk_pgd_level(NULL, init_mm);
else
-   ptdump_walk_pgd_level(NULL, efi_mm.pgd);
+   ptdump_walk_pgd_level(NULL, efi_mm);
 #endif
 }
 
-- 
2.20.1



[PATCH v6 18/19] x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct

2019-03-26 Thread Steven Price
An mm_struct is needed to enable x86 to use of the generic
walk_page_range() function.

In the case of walking the user page tables (when
CONFIG_PAGE_TABLE_ISOLATION is enabled), it is necessary to create a
fake_mm structure because there isn't an mm_struct with a pointer
to the pgd of the user page tables. This fake_mm structure is
initialised with the minimum necessary for the generic page walk code.

Signed-off-by: Steven Price 
---
 arch/x86/mm/dump_pagetables.c | 36 ---
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 40b3f1da6e15..c0fbb9e5a790 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -111,8 +111,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR]   = { -1, NULL }
 };
 
-#define INIT_PGD   ((pgd_t *) _top_pgt)
-
 #else /* CONFIG_X86_64 */
 
 enum address_markers_idx {
@@ -147,8 +145,6 @@ static struct addr_marker address_markers[] = {
[END_OF_SPACE_NR]   = { -1, NULL }
 };
 
-#define INIT_PGD   (swapper_pg_dir)
-
 #endif /* !CONFIG_X86_64 */
 
 /* Multipliers for offsets within the PTEs */
@@ -522,10 +518,10 @@ static inline bool is_hypervisor_range(int idx)
 #endif
 }
 
-static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
+static void ptdump_walk_pgd_level_core(struct seq_file *m, struct mm_struct 
*mm,
   bool checkwx, bool dmesg)
 {
-   pgd_t *start = pgd;
+   pgd_t *start = mm->pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
@@ -572,39 +568,49 @@ static void ptdump_walk_pgd_level_core(struct seq_file 
*m, pgd_t *pgd,
 
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
 {
-   ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
+   ptdump_walk_pgd_level_core(m, mm, false, true);
 }
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static void ptdump_walk_pgd_level_user_core(struct seq_file *m,
+   struct mm_struct *mm,
+   bool checkwx, bool dmesg)
+{
+   struct mm_struct fake_mm = {
+   .pgd = kernel_to_user_pgdp(mm->pgd)
+   };
+   init_rwsem(_mm.mmap_sem);
+   ptdump_walk_pgd_level_core(m, _mm, checkwx, dmesg);
+}
+#endif
+
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
   bool user)
 {
-   pgd_t *pgd = mm->pgd;
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && static_cpu_has(X86_FEATURE_PTI))
-   pgd = kernel_to_user_pgdp(pgd);
+   ptdump_walk_pgd_level_user_core(m, mm, false, false);
+   else
 #endif
-   ptdump_walk_pgd_level_core(m, pgd, false, false);
+   ptdump_walk_pgd_level_core(m, mm, false, false);
 }
 EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
 
 void ptdump_walk_user_pgd_level_checkwx(void)
 {
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
-   pgd_t *pgd = INIT_PGD;
-
if (!(__supported_pte_mask & _PAGE_NX) ||
!static_cpu_has(X86_FEATURE_PTI))
return;
 
pr_info("x86/mm: Checking user space page tables\n");
-   pgd = kernel_to_user_pgdp(pgd);
-   ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+   ptdump_walk_pgd_level_user_core(NULL, _mm, true, false);
 #endif
 }
 
 void ptdump_walk_pgd_level_checkwx(void)
 {
-   ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
+   ptdump_walk_pgd_level_core(NULL, _mm, true, false);
 }
 
 static int __init pt_dump_init(void)
-- 
2.20.1



[PATCH v6 17/19] x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct

2019-03-26 Thread Steven Price
To enable x86 to use the generic walk_page_range() function, the
callers of ptdump_walk_pgd_level_debugfs() need to pass in the mm_struct.

This means that ptdump_walk_pgd_level_core() is now always passed a
valid pgd, so drop the support for pgd==NULL.

Signed-off-by: Steven Price 
---
 arch/x86/include/asm/pgtable.h |  3 ++-
 arch/x86/mm/debug_pagetables.c |  8 
 arch/x86/mm/dump_pagetables.c  | 14 ++
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 579959750f34..5abf693dc9b2 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,7 +28,8 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD];
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+  bool user);
 void ptdump_walk_pgd_level_checkwx(void);
 void ptdump_walk_user_pgd_level_checkwx(void);
 
diff --git a/arch/x86/mm/debug_pagetables.c b/arch/x86/mm/debug_pagetables.c
index cd84f067e41d..824131052574 100644
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -6,7 +6,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-   ptdump_walk_pgd_level_debugfs(m, NULL, false);
+   ptdump_walk_pgd_level_debugfs(m, _mm, false);
return 0;
 }
 
@@ -16,7 +16,7 @@ static int ptdump_curknl_show(struct seq_file *m, void *v)
 {
if (current->mm->pgd) {
down_read(>mm->mmap_sem);
-   ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+   ptdump_walk_pgd_level_debugfs(m, current->mm, false);
up_read(>mm->mmap_sem);
}
return 0;
@@ -31,7 +31,7 @@ static int ptdump_curusr_show(struct seq_file *m, void *v)
 {
if (current->mm->pgd) {
down_read(>mm->mmap_sem);
-   ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+   ptdump_walk_pgd_level_debugfs(m, current->mm, true);
up_read(>mm->mmap_sem);
}
return 0;
@@ -46,7 +46,7 @@ static struct dentry *pe_efi;
 static int ptdump_efi_show(struct seq_file *m, void *v)
 {
if (efi_mm.pgd)
-   ptdump_walk_pgd_level_debugfs(m, efi_mm.pgd, false);
+   ptdump_walk_pgd_level_debugfs(m, _mm, false);
return 0;
 }
 
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ddf8ea6b059d..40b3f1da6e15 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -525,16 +525,12 @@ static inline bool is_hypervisor_range(int idx)
 static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
   bool checkwx, bool dmesg)
 {
-   pgd_t *start = INIT_PGD;
+   pgd_t *start = pgd;
pgprotval_t prot, eff;
int i;
struct pg_state st = {};
 
-   if (pgd) {
-   start = pgd;
-   st.to_dmesg = dmesg;
-   }
-
+   st.to_dmesg = dmesg;
st.check_wx = checkwx;
st.seq = m;
if (checkwx)
@@ -579,8 +575,10 @@ void ptdump_walk_pgd_level(struct seq_file *m, struct 
mm_struct *mm)
ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
 }
 
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+  bool user)
 {
+   pgd_t *pgd = mm->pgd;
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
if (user && static_cpu_has(X86_FEATURE_PTI))
pgd = kernel_to_user_pgdp(pgd);
@@ -606,7 +604,7 @@ void ptdump_walk_user_pgd_level_checkwx(void)
 
 void ptdump_walk_pgd_level_checkwx(void)
 {
-   ptdump_walk_pgd_level_core(NULL, NULL, true, false);
+   ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
 }
 
 static int __init pt_dump_init(void)
-- 
2.20.1



[PATCH v6 11/19] mm: pagewalk: Allow walking without vma

2019-03-26 Thread Steven Price
Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range
for vma(VM_PFNMAP)", page_table_walk() will report any kernel area as
a hole, because it lacks a vma.

This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.

Remove the requirement to have a vma except when trying to split huge
pages.

Signed-off-by: Steven Price 
---
 mm/pagewalk.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 98373a9f88b8..dac0c848b458 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -36,7 +36,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
do {
 again:
next = pmd_addr_end(addr, end);
-   if (pmd_none(*pmd) || !walk->vma) {
+   if (pmd_none(*pmd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -59,9 +59,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
if (!walk->pte_entry)
continue;
 
-   split_huge_pmd(walk->vma, pmd, addr);
-   if (pmd_trans_unstable(pmd))
-   goto again;
+   if (walk->vma) {
+   split_huge_pmd(walk->vma, pmd, addr);
+   if (pmd_trans_unstable(pmd))
+   goto again;
+   } else if (pmd_large(*pmd)) {
+   continue;
+   }
+
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
@@ -81,7 +86,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
do {
  again:
next = pud_addr_end(addr, end);
-   if (pud_none(*pud) || !walk->vma) {
+   if (pud_none(*pud)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
if (err)
@@ -95,9 +100,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
break;
}
 
-   split_huge_pud(walk->vma, pud, addr);
-   if (pud_none(*pud))
-   goto again;
+   if (walk->vma) {
+   split_huge_pud(walk->vma, pud, addr);
+   if (pud_none(*pud))
+   goto again;
+   } else if (pud_large(*pud)) {
+   continue;
+   }
 
if (walk->pmd_entry || walk->pte_entry)
err = walk_pmd_range(pud, addr, next, walk);
-- 
2.20.1



[PATCH v6 10/19] mm: pagewalk: Add p4d_entry() and pgd_entry()

2019-03-26 Thread Steven Price
pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were
no users. We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.

Note that commit a00cc7d9dd93d66a ("mm, x86: add support for
PUD-sized transparent hugepages") already re-added pud_entry() but with
different semantics to the other callbacks. Since there have never
been upstream users of this, revert the semantics back to match the
other callbacks. This means pud_entry() is called for all entries, not
just transparent huge pages.

Signed-off-by: Steven Price 
---
 include/linux/mm.h | 15 +--
 mm/pagewalk.c  | 27 ---
 2 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76769749b5a5..f6de08c116e6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,15 +1367,14 @@ void unmap_vmas(struct mmu_gather *tlb, struct 
vm_area_struct *start_vma,
 
 /**
  * mm_walk - callbacks for walk_page_range
- * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
- *this handler should only handle pud_trans_huge() puds.
- *the pmd_entry or pte_entry callbacks will be used for
- *regular PUDs.
- * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
+ * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
+ * @p4d_entry: if set, called for each non-empty P4D entry
+ * @pud_entry: if set, called for each non-empty PUD entry
+ * @pmd_entry: if set, called for each non-empty PMD entry
  *this handler is required to be able to handle
  *pmd_trans_huge() pmds.  They may simply choose to
  *split_huge_page() instead of handling it explicitly.
- * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
+ * @pte_entry: if set, called for each non-empty PTE (lowest-level) entry
  * @pte_hole: if set, called for each hole at all levels
  * @hugetlb_entry: if set, called for each hugetlb entry
  * @test_walk: caller specific callback function to determine whether
@@ -1390,6 +1389,10 @@ void unmap_vmas(struct mmu_gather *tlb, struct 
vm_area_struct *start_vma,
  * (see the comment on walk_page_range() for more details)
  */
 struct mm_walk {
+   int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
+unsigned long next, struct mm_walk *walk);
+   int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
+unsigned long next, struct mm_walk *walk);
int (*pud_entry)(pud_t *pud, unsigned long addr,
 unsigned long next, struct mm_walk *walk);
int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c3084ff2569d..98373a9f88b8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -90,15 +90,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
}
 
if (walk->pud_entry) {
-   spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
-
-   if (ptl) {
-   err = walk->pud_entry(pud, addr, next, walk);
-   spin_unlock(ptl);
-   if (err)
-   break;
-   continue;
-   }
+   err = walk->pud_entry(pud, addr, next, walk);
+   if (err)
+   break;
}
 
split_huge_pud(walk->vma, pud, addr);
@@ -131,7 +125,12 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
break;
continue;
}
-   if (walk->pmd_entry || walk->pte_entry)
+   if (walk->p4d_entry) {
+   err = walk->p4d_entry(p4d, addr, next, walk);
+   if (err)
+   break;
+   }
+   if (walk->pud_entry || walk->pmd_entry || walk->pte_entry)
err = walk_pud_range(p4d, addr, next, walk);
if (err)
break;
@@ -157,7 +156,13 @@ static int walk_pgd_range(unsigned long addr, unsigned 
long end,
break;
continue;
}
-   if (walk->pmd_entry || walk->pte_entry)
+   if (walk->pgd_entry) {
+   err = walk->pgd_entry(pgd, addr, next, walk);
+   if (err)
+   break;
+   }
+   if (walk->p4d_entry || walk->pud_entry || walk->pmd_entry ||
+   walk->pte_entry)
err = walk_p4d_range(pgd, addr, next, walk);
if (err)
 

[PATCH v6 12/19] mm: pagewalk: Add test_p?d callbacks

2019-03-26 Thread Steven Price
It is useful to be able to skip parts of the page table tree even when
walking without VMAs. Add test_p?d callbacks similar to test_walk but
which are called just before a table at that level is walked. If the
callback returns non-zero then the entire table is skipped.

Signed-off-by: Steven Price 
---
 include/linux/mm.h | 11 +++
 mm/pagewalk.c  | 24 
 2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6de08c116e6..a4c1ed255455 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1382,6 +1382,11 @@ void unmap_vmas(struct mmu_gather *tlb, struct 
vm_area_struct *start_vma,
  * value means "do page table walk over the current vma,"
  * and a negative one means "abort current page table walk
  * right now." 1 means "skip the current vma."
+ * @test_pmd:  similar to test_walk(), but called for every pmd.
+ * @test_pud:  similar to test_walk(), but called for every pud.
+ * @test_p4d:  similar to test_walk(), but called for every p4d.
+ * Returning 0 means walk this part of the page tables,
+ * returning 1 means to skip this range.
  * @mm:mm_struct representing the target process of page table walk
  * @vma:   vma currently walked (NULL if walking outside vmas)
  * @private:   private data for callbacks' usage
@@ -1406,6 +1411,12 @@ struct mm_walk {
 struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
+   int (*test_pmd)(unsigned long addr, unsigned long next,
+   pmd_t *pmd_start, struct mm_walk *walk);
+   int (*test_pud)(unsigned long addr, unsigned long next,
+   pud_t *pud_start, struct mm_walk *walk);
+   int (*test_p4d)(unsigned long addr, unsigned long next,
+   p4d_t *p4d_start, struct mm_walk *walk);
struct mm_struct *mm;
struct vm_area_struct *vma;
void *private;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index dac0c848b458..231655db1295 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -32,6 +32,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
unsigned long next;
int err = 0;
 
+   if (walk->test_pmd) {
+   err = walk->test_pmd(addr, end, pmd_offset(pud, 0), walk);
+   if (err < 0)
+   return err;
+   if (err > 0)
+   return 0;
+   }
+
pmd = pmd_offset(pud, addr);
do {
 again:
@@ -82,6 +90,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
unsigned long next;
int err = 0;
 
+   if (walk->test_pud) {
+   err = walk->test_pud(addr, end, pud_offset(p4d, 0), walk);
+   if (err < 0)
+   return err;
+   if (err > 0)
+   return 0;
+   }
+
pud = pud_offset(p4d, addr);
do {
  again:
@@ -124,6 +140,14 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
unsigned long next;
int err = 0;
 
+   if (walk->test_p4d) {
+   err = walk->test_p4d(addr, end, p4d_offset(pgd, 0), walk);
+   if (err < 0)
+   return err;
+   if (err > 0)
+   return 0;
+   }
+
p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
-- 
2.20.1



[PATCH v6 09/19] mm: Add generic p?d_large() macros

2019-03-26 Thread Steven Price
Exposing the pud/pgd levels of the page tables to walk_page_range() means
we may come across the exotic large mappings that come with large areas
of contiguous memory (such as the kernel's linear map).

For architectures that don't provide p?d_large() macros, provide generic
does nothing defaults.

Signed-off-by: Steven Price 
---
 include/asm-generic/pgtable.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fa782fba51ee..9c5d0f73db67 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1186,4 +1186,23 @@ static inline bool arch_has_pfn_modify_check(void)
 #define mm_pmd_folded(mm)  __is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+/*
+ * p?d_large() - true if this entry is a final mapping to a physical address.
+ * This differs from p?d_huge() by the fact that they are always available (if
+ * the architecture supports large pages at the appropriate level) even
+ * if CONFIG_HUGETLB_PAGE is not defined.
+ */
+#ifndef pgd_large
+#define pgd_large(x)   0
+#endif
+#ifndef p4d_large
+#define p4d_large(x)   0
+#endif
+#ifndef pud_large
+#define pud_large(x)   0
+#endif
+#ifndef pmd_large
+#define pmd_large(x)   0
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
-- 
2.20.1



[PATCH v6 08/19] x86: mm: Add p?d_large() definitions

2019-03-26 Thread Steven Price
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For x86 we already have static inline functions, so simply add #defines
to prevent the generic versions (added in a later patch) from being
picked up.

We also need to add corresponding #undefs in dump_pagetables.c. This
code will be removed when x86 is switched over to using the generic
pagewalk code in a later patch.

Signed-off-by: Steven Price 
---
 arch/x86/include/asm/pgtable.h | 5 +
 arch/x86/mm/dump_pagetables.c  | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..0dd04cf6ebeb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -222,6 +222,7 @@ static inline unsigned long pgd_pfn(pgd_t pgd)
return (pgd_val(pgd) & PTE_PFN_MASK) >> PAGE_SHIFT;
 }
 
+#define p4d_large  p4d_large
 static inline int p4d_large(p4d_t p4d)
 {
/* No 512 GiB pages yet */
@@ -230,6 +231,7 @@ static inline int p4d_large(p4d_t p4d)
 
 #define pte_page(pte)  pfn_to_page(pte_pfn(pte))
 
+#define pmd_large  pmd_large
 static inline int pmd_large(pmd_t pte)
 {
return pmd_flags(pte) & _PAGE_PSE;
@@ -857,6 +859,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long 
address)
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
 }
 
+#define pud_large  pud_large
 static inline int pud_large(pud_t pud)
 {
return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
@@ -868,6 +871,7 @@ static inline int pud_bad(pud_t pud)
return (pud_flags(pud) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
 }
 #else
+#define pud_large  pud_large
 static inline int pud_large(pud_t pud)
 {
return 0;
@@ -1213,6 +1217,7 @@ static inline bool pgdp_maps_userspace(void *__ptr)
return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
 }
 
+#define pgd_large  pgd_large
 static inline int pgd_large(pgd_t pgd) { return 0; }
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index ee8f8ab46941..ca270fb00805 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -432,6 +432,7 @@ static void walk_pmd_level(struct seq_file *m, struct 
pg_state *st, pud_t addr,
 
 #else
 #define walk_pmd_level(m,s,a,e,p) walk_pte_level(m,s,__pmd(pud_val(a)),e,p)
+#undef pud_large
 #define pud_large(a) pmd_large(__pmd(pud_val(a)))
 #define pud_none(a)  pmd_none(__pmd(pud_val(a)))
 #endif
@@ -467,6 +468,7 @@ static void walk_pud_level(struct seq_file *m, struct 
pg_state *st, p4d_t addr,
 
 #else
 #define walk_pud_level(m,s,a,e,p) walk_pmd_level(m,s,__pud(p4d_val(a)),e,p)
+#undef p4d_large
 #define p4d_large(a) pud_large(__pud(p4d_val(a)))
 #define p4d_none(a)  pud_none(__pud(p4d_val(a)))
 #endif
@@ -501,6 +503,7 @@ static void walk_p4d_level(struct seq_file *m, struct 
pg_state *st, pgd_t addr,
}
 }
 
+#undef pgd_large
 #define pgd_large(a) (pgtable_l5_enabled() ? pgd_large(a) : 
p4d_large(__p4d(pgd_val(a
 #define pgd_none(a)  (pgtable_l5_enabled() ? pgd_none(a) : 
p4d_none(__p4d(pgd_val(a
 
-- 
2.20.1



[PATCH v6 07/19] sparc: mm: Add p?d_large() definitions

2019-03-26 Thread Steven Price
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For sparc 64 bit, pmd_large() and pud_large() are already provided, so
add #defines to prevent the generic versions (added in a later patch)
from being used.

CC: "David S. Miller" 
CC: sparcli...@vger.kernel.org
Signed-off-by: Steven Price 
Acked-by: David S. Miller 
---
 arch/sparc/include/asm/pgtable_64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..f502e937c8fe 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -713,6 +713,7 @@ static inline unsigned long pte_special(pte_t pte)
return pte_val(pte) & _PAGE_SPECIAL;
 }
 
+#define pmd_large  pmd_large
 static inline unsigned long pmd_large(pmd_t pmd)
 {
pte_t pte = __pte(pmd_val(pmd));
@@ -894,6 +895,7 @@ static inline unsigned long pud_page_vaddr(pud_t pud)
 #define pgd_present(pgd)   (pgd_val(pgd) != 0U)
 #define pgd_clear(pgdp)(pgd_val(*(pgdp)) = 0UL)
 
+#define pud_large  pud_large
 static inline unsigned long pud_large(pud_t pud)
 {
pte_t pte = __pte(pud_val(pud));
-- 
2.20.1



[PATCH v6 06/19] s390: mm: Add p?d_large() definitions

2019-03-26 Thread Steven Price
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For s390, pud_large() and pmd_large() are already implemented as static
inline functions. Add a #define so we don't pick up the generic version
introduced in a later patch.

CC: Martin Schwidefsky 
CC: Heiko Carstens 
CC: linux-s...@vger.kernel.org
Signed-off-by: Steven Price 
---
 arch/s390/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 76dc344edb8c..3ad4c69e1f2d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -679,6 +679,7 @@ static inline int pud_none(pud_t pud)
return pud_val(pud) == _REGION3_ENTRY_EMPTY;
 }
 
+#define pud_large  pud_large
 static inline int pud_large(pud_t pud)
 {
if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R3)
@@ -696,6 +697,7 @@ static inline unsigned long pud_pfn(pud_t pud)
return (pud_val(pud) & origin_mask) >> PAGE_SHIFT;
 }
 
+#define pmd_large  pmd_large
 static inline int pmd_large(pmd_t pmd)
 {
return (pmd_val(pmd) & _SEGMENT_ENTRY_LARGE) != 0;
-- 
2.20.1



[PATCH v6 05/19] riscv: mm: Add p?d_large() definitions

2019-03-26 Thread Steven Price
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For riscv a page is large when it has a read, write or execute bit
set on it.

CC: Palmer Dabbelt 
CC: Albert Ou 
CC: linux-ri...@lists.infradead.org
Signed-off-by: Steven Price 
---
 arch/riscv/include/asm/pgtable-64.h | 7 +++
 arch/riscv/include/asm/pgtable.h| 7 +++
 2 files changed, 14 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable-64.h 
b/arch/riscv/include/asm/pgtable-64.h
index 7aa0ea9bd8bb..73747d9d7c66 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -51,6 +51,13 @@ static inline int pud_bad(pud_t pud)
return !pud_present(pud);
 }
 
+#define pud_large  pud_large
+static inline int pud_large(pud_t pud)
+{
+   return pud_present(pud)
+   && (pud_val(pud) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
 static inline void set_pud(pud_t *pudp, pud_t pud)
 {
*pudp = pud;
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 1141364d990e..9570883c79e7 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -111,6 +111,13 @@ static inline int pmd_bad(pmd_t pmd)
return !pmd_present(pmd);
 }
 
+#define pmd_large  pmd_large
+static inline int pmd_large(pmd_t pmd)
+{
+   return pmd_present(pmd)
+   && (pmd_val(pmd) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
*pmdp = pmd;
-- 
2.20.1



[PATCH v6 03/19] mips: mm: Add p?d_large() definitions

2019-03-26 Thread Steven Price
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information is provided by the
p?d_large() functions/macros.

For mips, we only support large pages on 64 bit.

For 64 bit if _PAGE_HUGE is defined we can simply look for it. When not
defined we can be confident that there are no large pages in existence
and fall back on the generic implementation (added in a later patch)
which returns 0.

CC: Ralf Baechle 
CC: Paul Burton 
CC: James Hogan 
CC: linux-m...@vger.kernel.org
Signed-off-by: Steven Price 
Acked-by: Paul Burton 
---
 arch/mips/include/asm/pgtable-64.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/mips/include/asm/pgtable-64.h 
b/arch/mips/include/asm/pgtable-64.h
index 93a9dce31f25..42162877ac62 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -273,6 +273,10 @@ static inline int pmd_present(pmd_t pmd)
return pmd_val(pmd) != (unsigned long) invalid_pte_table;
 }
 
+#ifdef _PAGE_HUGE
+#define pmd_large(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
+#endif
+
 static inline void pmd_clear(pmd_t *pmdp)
 {
pmd_val(*pmdp) = ((unsigned long) invalid_pte_table);
@@ -297,6 +301,10 @@ static inline int pud_present(pud_t pud)
return pud_val(pud) != (unsigned long) invalid_pmd_table;
 }
 
+#ifdef _PAGE_HUGE
+#define pud_large(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
+#endif
+
 static inline void pud_clear(pud_t *pudp)
 {
pud_val(*pudp) = ((unsigned long) invalid_pmd_table);
-- 
2.20.1



[PATCH v6 02/19] arm64: mm: Add p?d_large() definitions

2019-03-26 Thread Steven Price
walk_page_range() is going to be allowed to walk page tables other than
those of user space. For this it needs to know when it has reached a
'leaf' entry in the page tables. This information will be provided by the
p?d_large() functions/macros.

For arm64, we already have p?d_sect() macros which we can reuse for
p?d_large().

CC: Catalin Marinas 
CC: Will Deacon 
Signed-off-by: Steven Price 
---
 arch/arm64/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index de70c1eabf33..6eef345dbaf4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -428,6 +428,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, 
unsigned long pfn,
 PMD_TYPE_TABLE)
 #define pmd_sect(pmd)  ((pmd_val(pmd) & PMD_TYPE_MASK) == \
 PMD_TYPE_SECT)
+#define pmd_large(pmd) pmd_sect(pmd)
 
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 #define pud_sect(pud)  (0)
@@ -511,6 +512,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
 #define pud_none(pud)  (!pud_val(pud))
 #define pud_bad(pud)   (!(pud_val(pud) & PUD_TABLE_BIT))
 #define pud_present(pud)   pte_present(pud_pte(pud))
+#define pud_large(pud) pud_sect(pud)
 #define pud_valid(pud) pte_valid(pud_pte(pud))
 
 static inline void set_pud(pud_t *pudp, pud_t pud)
-- 
2.20.1



[PATCH v6 00/19] Convert x86 & arm64 to use generic page walk

2019-03-26 Thread Steven Price
Most architectures current have a debugfs file for dumping the kernel
page tables. Currently each architecture has to implement custom
functions for walking the page tables because the generic
walk_page_range() function is unable to walk the page tables used by the
kernel.

This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space). x86 and arm64 are
then converted to make use of walk_page_range() removing the custom page
table walkers.

To enable a generic page table walker to walk the unusual mappings of
the kernel we need to implement a set of functions which let us know
when the walker has reached the leaf entry. Since arm, powerpc, s390,
sparc and x86 all have p?d_large macros lets standardise on that and
implement those that are missing.

Potentially future changes could unify the implementations of the
debugfs walkers further, moving the common functionality into common
code. This would require a common way of handling the effective
permissions (currently implemented only for x86) along with a per-arch
way of formatting the page table information for debugfs. One
immediate benefit would be getting the KASAN speed up optimisation in
arm64 (and other arches) which is currently only implemented for x86.

Also available as a git tree:
git://linux-arm.org/linux-sp.git walk_page_range/v6

Changes since v5:
 * Updated comment for struct mm_walk based on Mike Rapoport's
   suggestion

Changes since v4:
 * Correctly force result to a boolean in p?d_large for powerpc.
 * Added Acked-bys
 * Rebased onto v5.1-rc1

Changes since v3:
 * Restored the generic macros, only implement p?d_large() for
   architectures that have support for large pages. This also means
   adding dummy #defines for architectures that define p?d_large as
   static inline to avoid picking up the generic macro.
 * Drop the 'depth' argument from pte_hole
 * Because we no longer have the depth for holes, we also drop support
   in x86 for showing missing pages in debugfs. See discussion below:
   https://lore.kernel.org/lkml/26df02dd-c54e-ea91-bdd1-0a4aad3a3...@arm.com/
 * mips: only define p?d_large when _PAGE_HUGE is defined.

Changes since v2:
 * Rather than attemping to provide generic macros, actually implement
   p?d_large() for each architecture.

Changes since v1:
 * Added p4d_large() macro
 * Comments to explain p?d_large() macro semantics
 * Expanded comment for pte_hole() callback to explain mapping between
   depth and P?D
 * Handle folded page tables at all levels, so depth from pte_hole()
   ignores folding at any level (see real_depth() function in
   mm/pagewalk.c)

Steven Price (19):
  arc: mm: Add p?d_large() definitions
  arm64: mm: Add p?d_large() definitions
  mips: mm: Add p?d_large() definitions
  powerpc: mm: Add p?d_large() definitions
  riscv: mm: Add p?d_large() definitions
  s390: mm: Add p?d_large() definitions
  sparc: mm: Add p?d_large() definitions
  x86: mm: Add p?d_large() definitions
  mm: Add generic p?d_large() macros
  mm: pagewalk: Add p4d_entry() and pgd_entry()
  mm: pagewalk: Allow walking without vma
  mm: pagewalk: Add test_p?d callbacks
  arm64: mm: Convert mm/dump.c to use walk_page_range()
  x86: mm: Don't display pages which aren't present in debugfs
  x86: mm: Point to struct seq_file from struct pg_state
  x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
  x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
  x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
  x86: mm: Convert dump_pagetables to use walk_page_range

 arch/arc/include/asm/pgtable.h   |   1 +
 arch/arm64/include/asm/pgtable.h |   2 +
 arch/arm64/mm/dump.c | 117 +++
 arch/mips/include/asm/pgtable-64.h   |   8 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |  30 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  12 +-
 arch/riscv/include/asm/pgtable-64.h  |   7 +
 arch/riscv/include/asm/pgtable.h |   7 +
 arch/s390/include/asm/pgtable.h  |   2 +
 arch/sparc/include/asm/pgtable_64.h  |   2 +
 arch/x86/include/asm/pgtable.h   |  10 +-
 arch/x86/mm/debug_pagetables.c   |   8 +-
 arch/x86/mm/dump_pagetables.c| 347 ++-
 arch/x86/platform/efi/efi_32.c   |   2 +-
 arch/x86/platform/efi/efi_64.c   |   4 +-
 include/asm-generic/pgtable.h|  19 +
 include/linux/mm.h   |  26 +-
 mm/pagewalk.c|  76 +++-
 18 files changed, 407 insertions(+), 273 deletions(-)

-- 
2.20.1



Re: Bad file pattern in MAINTAINERS section 'KEYS-TRUSTED'

2019-03-26 Thread James Bottomley
On Tue, 2019-03-26 at 09:59 -0500, Denis Kenzior wrote:
> Hi James,
> 
> On 03/26/2019 09:25 AM, James Bottomley wrote:
> > Looking at the contents of linux/keys/trusted.h, it looks like the
> > wrong decision to move it.  The contents are way too improperly
> > named
> > and duplicative to be in a standard header.  It's mostly actually
> > TPM
> > code including a redefinition of the tpm_buf structure, so it
> > doesn't
> > even seem to be necessary for trusted keys.
> 
> The reason this was done was because asym_tpm.c needed a bunch of
> the same functionality already provided by trusted.c, e.g.
> TSS_authmac and  friends.

So make a header which only includes those.  We can't have things like
this:

struct tpm_buf {
int len;
unsigned char data[MAX_BUF_SIZE];
};

Which means you can't include drivers/char/tpm/tpm.h with this file. 
The storeX functions are also way too generically named and are, in
fact, duplicating the tpm buffer functions in tpm.h

The solution looks to be to elevate agreed tpm_buf functions into
linux/tpm.h and use them.

> > If you want to fix this as a bug, I'd move it back again, but long
> > term I think it should simply be combined with trusted.c because
> > nothing else can include it sanely anyway.
> 
> Ideally I'd like to see the TPM subsystem expose these functions
> using some proper API / library abstraction.  David Howells had an
> RFC patch set that tried to address some of this a while back.  Not
> sure if that went anywhere.

I'm not actually sure I saw it but the solution seems pretty simple:
The TSS functions you want can be renamed tpm1_whatever and we can put
them in tpm1-cmd.c ... tpm2-cmd.c is where all the TPM 2.0 trusted key
stuff is anyway.

James



Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
On Tue, Mar 26, 2019 at 09:17:07AM -0700, Daniel Colascione wrote:
> Thanks for the patch.
> 
> On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner  
> wrote:
> >
> > The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> > I quote Konstantins original patchset first that has already been acked and
> > picked up by Eric before and whose functionality is preserved in this
> > syscall:
> 
> We still haven't had a much-needed conversation about splitting this
> system call into smaller logical operations. It's important that we
> address this point before this patch is merged and becomes permanent
> kernel ABI.

I don't particularly mind splitting this into an additional syscall like
e.g.  pidfd_open() but then we have - and yes, I know you'll say
syscalls are cheap - translate_pid(), and pidfd_open(). What I like
about this rn is that it connects both apis in a single syscall
and allows pidfd retrieval across pid namespaces. So I guess we'll see
what other people think.


Re: [PATCH v3] kmemleaak: survive in a low-memory situation

2019-03-26 Thread Catalin Marinas
On Tue, Mar 26, 2019 at 09:05:36AM -0700, Matthew Wilcox wrote:
> On Tue, Mar 26, 2019 at 11:43:38AM -0400, Qian Cai wrote:
> > Unless there is a brave soul to reimplement the kmemleak to embed it's
> > metadata into the tracked memory itself in a foreseeable future, this
> > provides a good balance between enabling kmemleak in a low-memory
> > situation and not introducing too much hackiness into the existing
> > code for now.
> 
> I don't understand kmemleak.  Kirill pointed me at this a few days ago:
> 
> https://gist.github.com/kiryl/3225e235fea390aa2e49bf625bbe83ec
> 
> It's caused by the XArray allocating memory using GFP_NOWAIT | __GFP_NOWARN.
> kmemleak then decides it needs to allocate memory to track this memory.
> So it calls kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
> 
> #define gfp_kmemleak_mask(gfp)  (((gfp) & (GFP_KERNEL | GFP_ATOMIC)) | \
>  __GFP_NORETRY | __GFP_NOMEMALLOC | \
>  __GFP_NOWARN | __GFP_NOFAIL)
> 
> then the page allocator gets to see GFP_NOFAIL | GFP_NOWAIT and gets angry.
> 
> But I don't understand why kmemleak needs to mess with the GFP flags at
> all.

Originally, it was just preserving GFP_KERNEL | GFP_ATOMIC. Starting
with commit 6ae4bd1f0bc4 ("kmemleak: Allow kmemleak metadata allocations
to fail"), this mask changed, aimed at making kmemleak allocation
failures less verbose (i.e. just disable it since it's a debug tool).

Commit d9570ee3bd1d ("kmemleak: allow to coexist with fault injection")
introduced __GFP_NOFAIL but this came with its own problems which have
been previously reported (the warning you mentioned is another one of
these). We didn't get to any clear conclusion on how best to allow
allocations to fail with fault injection but not for the kmemleak
metadata. Your suggestion below would probably do the trick.

> Just allocate using the same flags as the caller, and fail the original
> allocation if the kmemleak allocation fails.  Like this:
> 
> +++ b/mm/slab.h
> @@ -435,12 +435,22 @@ static inline void slab_post_alloc_hook(struct 
> kmem_cache *s, gfp_t flags,
> for (i = 0; i < size; i++) {
> p[i] = kasan_slab_alloc(s, p[i], flags);
> /* As p[i] might get tagged, call kmemleak hook after KASAN. 
> */
> -   kmemleak_alloc_recursive(p[i], s->object_size, 1,
> -s->flags, flags);
> +   if (kmemleak_alloc_recursive(p[i], s->object_size, 1,
> +s->flags, flags))
> +   goto fail;
> }
>  
> if (memcg_kmem_enabled())
> memcg_kmem_put_cache(s);
> +   return;
> +
> +fail:
> +   while (i > 0) {
> +   kasan_blah(...);
> +   kmemleak_blah();
> +   i--;
> +   }
> + free_blah(p);
> +   *p = NULL;
>  }
>  
>  #ifndef CONFIG_SLOB
> 
> 
> and if we had something like this, we wouldn't need kmemleak to have this
> self-disabling or must-succeed property.

We'd still need the self-disabling in place since there are a few other
places where we call kmemleak_alloc() from.

-- 
Catalin


Re: [PATCH v2] x86/syscalls: Mark expected switch fall-throughs

2019-03-26 Thread Thomas Gleixner
On Tue, 26 Mar 2019, Steven Rostedt wrote:

> On Tue, 26 Mar 2019 17:09:44 +0100 (CET)
> Thomas Gleixner  wrote:
> 
> > > >  1) The third argument of get/set(), i.e. the argument offset, is 0 on 
> > > > all
> > > > call sites. Do we need it at all?  
> > > 
> > > Probably "maxargs" can be removed too, Steven sent the patches a long 
> > > ago, see
> > > https://lore.kernel.org/lkml/20161107212634.529267...@goodmis.org/  
> > 
> > Indeed. We should resurrect them.
> > 
> > > >  2) syscall_set_arguments() has been introduced in 2008 and we still 
> > > > have
> > > > no caller. Instead of polishing it, can it be removed completely or 
> > > > are
> > > > there plans to actually use it?  
> > > 
> > > I think it can die.  
> > 
> > Good. Removed code is the least buggy code :)
> > 
> > Gustavo, it would be really appreciated if you could take care of that,
> > unless Steven wants to polish his old set up himself. If you have no
> > cycles, please let us know.
> 
> I still have those patches in my quilt queue. I can polish them up and
> resend.

Appreciated.


Re: [PATCH v1 2/4] pid: add pidctl()

2019-03-26 Thread Daniel Colascione
Thanks for the patch.

On Tue, Mar 26, 2019 at 8:55 AM Christian Brauner  wrote:
>
> The pidctl() syscalls builds on, extends, and improves translate_pid() [4].
> I quote Konstantins original patchset first that has already been acked and
> picked up by Eric before and whose functionality is preserved in this
> syscall:

We still haven't had a much-needed conversation about splitting this
system call into smaller logical operations. It's important that we
address this point before this patch is merged and becomes permanent
kernel ABI.


Re: [PATCH v2] x86/syscalls: Mark expected switch fall-throughs

2019-03-26 Thread Steven Rostedt
On Tue, 26 Mar 2019 17:09:44 +0100 (CET)
Thomas Gleixner  wrote:

> > >  1) The third argument of get/set(), i.e. the argument offset, is 0 on all
> > > call sites. Do we need it at all?  
> > 
> > Probably "maxargs" can be removed too, Steven sent the patches a long ago, 
> > see
> > https://lore.kernel.org/lkml/20161107212634.529267...@goodmis.org/  
> 
> Indeed. We should resurrect them.
> 
> > >  2) syscall_set_arguments() has been introduced in 2008 and we still have
> > > no caller. Instead of polishing it, can it be removed completely or 
> > > are
> > > there plans to actually use it?  
> > 
> > I think it can die.  
> 
> Good. Removed code is the least buggy code :)
> 
> Gustavo, it would be really appreciated if you could take care of that,
> unless Steven wants to polish his old set up himself. If you have no
> cycles, please let us know.

I still have those patches in my quilt queue. I can polish them up and
resend.

-- Steve


Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame (*Reproducible*)

2019-03-26 Thread Thomas Gleixner
Alex,

On Mon, 25 Mar 2019, 573149609 wrote:

Thanks for the report.

> I think I found a reproducible kernel bug in version 5.0.4.
> Source file: arch/x86/kernel/unwind_orc.c:505
> The KASAN output is as following:
> [   26.095365] BUG: KASAN: stack-out-of-bounds in 
> unwind_next_frame+0x1403/0x19e0
> [   26.095365] Read of size 8 at addr 88805cc67d18 by task 
> syz-executor.0/2296
...

Can you please verify whether this problem persist in 5.1-rc2? If not, then
we missed a fix to backport. If yes, then we rather fix it there.

Thanks,

tglx


Re: [PATCH 2/4] pid: add pidctl()

2019-03-26 Thread Christian Brauner
> Agreed, I also was going to say the same, about the flags.

Please review the updated version I just sent out.

Christian


[PATCH V4 22/23] perf, tools: Add documentation for topdown metrics

2019-03-26 Thread kan . liang
From: Andi Kleen 

Add some documentation how to use the topdown metrics in ring 3.

Signed-off-by: Andi Kleen 
Signed-off-by: Kan Liang 
---

No changes since V3.

 tools/perf/Documentation/topdown.txt | 223 +++
 1 file changed, 223 insertions(+)
 create mode 100644 tools/perf/Documentation/topdown.txt

diff --git a/tools/perf/Documentation/topdown.txt 
b/tools/perf/Documentation/topdown.txt
new file mode 100644
index ..167393225641
--- /dev/null
+++ b/tools/perf/Documentation/topdown.txt
@@ -0,0 +1,223 @@
+Using TopDown metrics in user space
+---
+
+Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown
+methology to break down CPU pipeline execution into 4 bottlenecks:
+frontend bound, backend bound, bad speculation, retiring.
+
+For more details on Topdown see [1][5]
+
+Traditionally this was implemented by events in generic counters
+and specific formulas to compute the bottlenecks.
+
+perf stat --topdown implements this.
+
+% perf stat -a --topdown -I1000
+#   time counts unit events
+ 1.000373951  8,460,978,609  topdown-retiring  # 22.9% 
retiring
+ 1.000373951  3,445,383,303  topdown-bad-spec  #  9.3% 
bad speculation
+ 1.000373951 15,886,483,355  topdown-fe-bound  # 43.0% 
frontend bound
+ 1.000373951  9,163,488,720  topdown-be-bound  # 24.8% 
backend bound
+ 2.000782154  8,477,925,431  topdown-retiring  # 22.9% 
retiring
+ 2.000782154  3,459,151,256  topdown-bad-spec  #  9.3% 
bad speculation
+ 2.000782154 15,947,224,725  topdown-fe-bound  # 43.0% 
frontend bound
+ 2.000782154  9,145,551,695  topdown-be-bound  # 24.7% 
backend bound
+ 3.001155967  8,487,323,125  topdown-retiring  # 22.9% 
retiring
+ 3.001155967  3,451,808,066  topdown-bad-spec  #  9.3% 
bad speculation
+ 3.001155967 15,959,068,902  topdown-fe-bound  # 43.0% 
frontend bound
+ 3.001155967  9,172,479,784  topdown-be-bound  # 24.7% 
backend bound
+...
+
+Full Top Down includes more levels that can break down the
+bottlenecks further. This is not directly implemented in perf,
+but available in other tools that can run on top of perf,
+such as toplev[2] or vtune[3]
+
+New Topdown features in Icelake
+===
+
+With Icelake (2018 Core) CPUs the TopDown metrics are directly available as
+fixed counters and do not require generic counters. This allows
+to collect TopDown always in addition to other events.
+
+This also enables measuring TopDown per thread/process instead
+of only per core.
+
+Using TopDown through RDPMC in applications on Icelake
+==
+
+For more fine grained measurements it can be useful to
+access the new  directly from user space. This is more complicated,
+but drastically lowers overhead.
+
+On Icelake, there is a new fixed counter 3: SLOTS, which reports
+"pipeline SLOTS" (cycles multiplied by core issue width) and a
+metric register that reports slots ratios for the different bottleneck
+categories.
+
+The metrics counter is CPU model specific and is not be available
+on older CPUs.
+
+Example code
+
+
+Library functions to do the functionality described below
+is also available in libjevents [4]
+
+The application opens a perf_event file descriptor
+and sets up fixed counter 3 (SLOTS) to start and
+allow user programs to read the performance counters.
+
+Fixed counter 3 is mapped to a pseudo event event=0x00, umask=04,
+so the perf_event_attr structure should be initialized with
+{ .config = 0x0400, .type = PERF_TYPE_RAW }
+
+#include 
+#include 
+#include 
+
+/* Provide own perf_event_open stub because glibc doesn't */
+__attribute__((weak))
+int perf_event_open(struct perf_event_attr *attr, pid_t pid,
+   int cpu, int group_fd, unsigned long flags)
+{
+   return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
+}
+
+/* open slots counter file descriptor for current task */
+struct perf_event_attr slots = {
+   .type = PERF_TYPE_RAW,
+   .size = sizeof(struct perf_event_attr),
+   .config = 0x400,
+   .exclude_kernel = 1,
+};
+
+int fd = perf_event_open(, 0, -1, -1, 0);
+if (fd < 0)
+   ... error ...
+
+The RDPMC instruction (or _rdpmc compiler intrinsic) can now be used
+to read slots and the topdown metrics at different points of the program:
+
+#include 
+#include 
+
+#define RDPMC_FIXED(1 << 30)   /* return fixed counters */
+#define RDPMC_METRIC   (1 << 29)   /* return metric counters */
+
+#define FIXED_COUNTER_SLOTS3
+#define METRIC_COUNTER_TOPDOWN_L1  0
+
+static inline uint64_t read_slots(void)
+{
+   return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS);
+}

Re: [PATCH v2 2/8] kbuild: Support for Symbols.list creation

2019-03-26 Thread Joe Lawrence

On 3/26/19 10:40 AM, Joao Moreira wrote:



On 3/20/19 4:08 PM, Miroslav Benes wrote:

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index fd03d60f6c5a..1e28ad21314c 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -247,6 +247,11 @@ cmd_gen_ksymdeps = \
$(CONFIG_SHELL) $(srctree)/scripts/gen_ksymdeps.sh $@ >> 
$(dot-target).cmd
   endif
   
+ifdef CONFIG_LIVEPATCH

+cmd_livepatch = $(if $(LIVEPATCH_$(basetarget).o), \
+   $(shell touch $(MODVERDIR)/$(basetarget).livepatch))
+endif
+
   define rule_cc_o_c
$(call cmd,checksrc)
$(call cmd_and_fixdep,cc_o_c)
@@ -283,6 +288,7 @@ $(single-used-m): $(obj)/%.o: $(src)/%.c 
$(recordmcount_source) $(objtool_dep) F
$(call if_changed_rule,cc_o_c)
@{ echo $(@:.o=.ko); echo $@; \
   $(cmd_undef_syms); } > $(MODVERDIR)/$(@F:.o=.mod)
+   $(call cmd_livepatch)
   
   quiet_cmd_cc_lst_c = MKLST   $@

 cmd_cc_lst_c = $(CC) $(c_flags) -g -c -o $*.o $< && \


Since cmd_livepatch is only called for single-used-m, does this mean
that we can only klp-convert single object file livepatch modules?

I stumbled upon this when trying to create a self-test module that
incorporated two object files.  I tried adding a $(call cmd_livepatch)
in the recipe for $(obj)/%.o, but that didn't help.  My kbuild foo
wasn't good enough to figure this one out.


I looked at my original code and it is a bit different there. I placed it
under rule_cc_o_c right after objtool command. If I remember correctly
this is the correct recipe for .c->.o. Unfortunately I forgot the details
and there is of course nothing about it in my notes.

Does it help?

Joao, is there a reason you moved it elsewhere?


Hi,

Unfortunately I can't remember why the chunk was moved to where it is in
this version of the patch, sorry. Yet, I did try to move this into the
rule cc_o_c and it seemed to work with not damage.

Joe, would you kindly verify and squash properly the patch below, which
places cmd_livepatch in rule_cc_o_c?

Thank you.

Subject: [PATCH] Move cmd_klp_convert to the right place

  


Signed-off-by: Joao Moreira 

---

   scripts/Makefile.build | 2 +-

   1 file changed, 1 insertion(+), 1 deletion(-)

  


diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 1e28ad21314c..5f66106a47d6 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -260,6 +260,7 @@ define rule_cc_o_c
  $(call cmd,objtool)
  $(call cmd,modversions_c)
  $(call cmd,record_mcount)
+   $(call cmd,livepatch)
   endef
  
   define rule_as_o_S

@@ -288,7 +289,6 @@ $(single-used-m): $(obj)/%.o: $(src)/%.c
$(recordmcount_source) $(objtool_dep) F
  $(call if_changed_rule,cc_o_c)
  @{ echo $(@:.o=.ko); echo $@; \
 $(cmd_undef_syms); } > $(MODVERDIR)/$(@F:.o=.mod)
-   $(call cmd_livepatch)
  
   quiet_cmd_cc_lst_c = MKLST   $@

 cmd_cc_lst_c = $(CC) $(c_flags) -g -c -o $*.o $< && \


Hi Joao,

This change seems to work okay for (again) single object modules, but 
I'm having issues with multi-object modules.


Here are my sources:

% head -n100 *
==> Makefile <==
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)

LIVEPATCH_test_mod_a.o := y
LIVEPATCH_test_mod_b.o := y

obj-m += test_mod.o

test_mod-y := \
test_mod_a.o \
test_mod_b.o

default:
$(MAKE) -C $(KDIR) M=$(PWD)
clean:
@rm -rf .tmp_versions/
@rm -f .*.cmd *.o *.mod.* *.ko modules.order Module.symvers

==> test_mod_a.c <==
#include 
__used static void function(void) { }
MODULE_LICENSE("GPL");

==> test_mod_b.c <==
__used static void function(void) { }



But when I build, I don't see klp-convert invoked for any of the object 
files:


% make
make -C /lib/modules/5.0.0+/build M=/home/cloud-user/klp-convert-modtest
make[1]: Entering directory '/home/cloud-user/disk/linux'
  CC [M]  /home/cloud-user/klp-convert-modtest/test_mod_a.o
  CC [M]  /home/cloud-user/klp-convert-modtest/test_mod_b.o
  LD [M]  /home/cloud-user/klp-convert-modtest/test_mod.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC  /home/cloud-user/klp-convert-modtest/test_mod.mod.o
  LD [M]  /home/cloud-user/klp-convert-modtest/test_mod.ko
make[1]: Leaving directory '/home/cloud-user/disk/linux'

However, if I modify the Makefile to build test_mod_a.o into its own 
module, I see "KLP 
/home/cloud-user/klp-convert-modtest/test_mod_a.ko" in the build output.


-- Joe


Re: [PATCH v2] x86/realmode: don't leak kernel addresses

2019-03-26 Thread Borislav Petkov
On Sun, Mar 24, 2019 at 08:05:04PM +0100, Matteo Croce wrote:
> Since commit ad67b74d2469d9b8 ("printk: hash addresses printed with %p"),
> at boot "ptrval" is printed instead of the trampoline addresses:
> 
> Base memory trampoline at [(ptrval)] 99000 size 24576
> 
> Remove the address from the print as we don't want to leak kernel
> addresses.
> 
> Fixes: ad67b74d2469d9b8 ("printk: hash addresses printed with %p")
> Signed-off-by: Matteo Croce 
> ---
>  arch/x86/realmode/init.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
> index d10105825d57..d76a1380ec37 100644
> --- a/arch/x86/realmode/init.c
> +++ b/arch/x86/realmode/init.c
> @@ -20,8 +20,8 @@ void __init set_real_mode_mem(phys_addr_t mem, size_t size)
>   void *base = __va(mem);
>  
>   real_mode_header = (struct real_mode_header *) base;
> - printk(KERN_DEBUG "Base memory trampoline at [%p] %llx size %zu\n",
> -base, (unsigned long long)mem, size);
> + printk(KERN_DEBUG "Base memory trampoline at %llx size %zu\n",
> +(unsigned long long)mem, size);

In case this wasn't clear, please remove the whole printk. And don't
forget to CC lkml on your submissions. CCed now.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


[PATCH V4 02/23] perf/x86/intel: Extract memory code PEBS parser for reuse

2019-03-26 Thread kan . liang
From: Andi Kleen 

Extract some code related to memory profiling from the PEBS record
parser into separate functions. It can be reused by the upcoming
adaptive PEBS parser. No functional changes.
Rename intel_hsw_weight to intel_get_tsx_weight, and
intel_hsw_transaction to intel_get_tsx_transaction. Because the input is
not the hsw pebs format anymore.

Signed-off-by: Andi Kleen 
Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/intel/ds.c | 63 --
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 10c99ce1fead..c02cd19fe640 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1125,34 +1125,50 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
return 0;
 }
 
-static inline u64 intel_hsw_weight(struct pebs_record_skl *pebs)
+static inline u64 intel_get_tsx_weight(u64 tsx_tuning)
 {
-   if (pebs->tsx_tuning) {
-   union hsw_tsx_tuning tsx = { .value = pebs->tsx_tuning };
+   if (tsx_tuning) {
+   union hsw_tsx_tuning tsx = { .value = tsx_tuning };
return tsx.cycles_last_block;
}
return 0;
 }
 
-static inline u64 intel_hsw_transaction(struct pebs_record_skl *pebs)
+static inline u64 intel_get_tsx_transaction(u64 tsx_tuning, u64 ax)
 {
-   u64 txn = (pebs->tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
+   u64 txn = (tsx_tuning & PEBS_HSW_TSX_FLAGS) >> 32;
 
/* For RTM XABORTs also log the abort code from AX */
-   if ((txn & PERF_TXN_TRANSACTION) && (pebs->ax & 1))
-   txn |= ((pebs->ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT;
+   if ((txn & PERF_TXN_TRANSACTION) && (ax & 1))
+   txn |= ((ax >> 24) & 0xff) << PERF_TXN_ABORT_SHIFT;
return txn;
 }
 
+#define PERF_X86_EVENT_PEBS_HSW_PREC \
+   (PERF_X86_EVENT_PEBS_ST_HSW | \
+PERF_X86_EVENT_PEBS_LD_HSW | \
+PERF_X86_EVENT_PEBS_NA_HSW)
+
+static u64 get_data_src(struct perf_event *event, u64 aux)
+{
+   u64 val = PERF_MEM_NA;
+   int fl = event->hw.flags;
+   bool fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);
+
+   if (fl & PERF_X86_EVENT_PEBS_LDLAT)
+   val = load_latency_data(aux);
+   else if (fst && (fl & PERF_X86_EVENT_PEBS_HSW_PREC))
+   val = precise_datala_hsw(event, aux);
+   else if (fst)
+   val = precise_store_data(aux);
+   return val;
+}
+
 static void setup_pebs_sample_data(struct perf_event *event,
   struct pt_regs *iregs, void *__pebs,
   struct perf_sample_data *data,
   struct pt_regs *regs)
 {
-#define PERF_X86_EVENT_PEBS_HSW_PREC \
-   (PERF_X86_EVENT_PEBS_ST_HSW | \
-PERF_X86_EVENT_PEBS_LD_HSW | \
-PERF_X86_EVENT_PEBS_NA_HSW)
/*
 * We cast to the biggest pebs_record but are careful not to
 * unconditionally access the 'extra' entries.
@@ -1160,17 +1176,13 @@ static void setup_pebs_sample_data(struct perf_event 
*event,
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
struct pebs_record_skl *pebs = __pebs;
u64 sample_type;
-   int fll, fst, dsrc;
-   int fl = event->hw.flags;
+   int fll;
 
if (pebs == NULL)
return;
 
sample_type = event->attr.sample_type;
-   dsrc = sample_type & PERF_SAMPLE_DATA_SRC;
-
-   fll = fl & PERF_X86_EVENT_PEBS_LDLAT;
-   fst = fl & (PERF_X86_EVENT_PEBS_ST | PERF_X86_EVENT_PEBS_HSW_PREC);
+   fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
 
perf_sample_data_init(data, 0, event->hw.last_period);
 
@@ -1185,16 +1197,8 @@ static void setup_pebs_sample_data(struct perf_event 
*event,
/*
 * data.data_src encodes the data source
 */
-   if (dsrc) {
-   u64 val = PERF_MEM_NA;
-   if (fll)
-   val = load_latency_data(pebs->dse);
-   else if (fst && (fl & PERF_X86_EVENT_PEBS_HSW_PREC))
-   val = precise_datala_hsw(event, pebs->dse);
-   else if (fst)
-   val = precise_store_data(pebs->dse);
-   data->data_src.val = val;
-   }
+   if (sample_type & PERF_SAMPLE_DATA_SRC)
+   data->data_src.val = get_data_src(event, pebs->dse);
 
/*
 * We must however always use iregs for the unwinder to stay sane; the
@@ -1281,10 +1285,11 @@ static void setup_pebs_sample_data(struct perf_event 
*event,
if (x86_pmu.intel_cap.pebs_format >= 2) {
/* Only set the TSX weight when no memory weight. */
if ((sample_type & PERF_SAMPLE_WEIGHT) && !fll)
-   data->weight = intel_hsw_weight(pebs);
+   data->weight = 

[PATCH V4 03/23] perf/x86/intel/ds: Extract code of event update in short period

2019-03-26 Thread kan . liang
From: Kan Liang 

The drain_pebs() could be called twice in a short period for auto-reload
event in pmu::read(). The intel_pmu_save_and_restart_reload() should be
called to update the event->count.
This case should also be handled on Icelake. Extract the codes for reuse
later.

Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/intel/ds.c | 34 +-
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index c02cd19fe640..efc054aee3c1 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1491,6 +1491,26 @@ static void intel_pmu_drain_pebs_core(struct pt_regs 
*iregs)
__intel_pmu_pebs_event(event, iregs, at, top, 0, n);
 }
 
+static void intel_pmu_pebs_event_update_no_drain(struct cpu_hw_events *cpuc,
+int size)
+{
+   struct perf_event *event;
+   int bit;
+
+   /*
+* The drain_pebs() could be called twice in a short period
+* for auto-reload event in pmu::read(). There are no
+* overflows have happened in between.
+* It needs to call intel_pmu_save_and_restart_reload() to
+* update the event->count for this case.
+*/
+   for_each_set_bit(bit, (unsigned long *)>pebs_enabled, size) {
+   event = cpuc->events[bit];
+   if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
+   intel_pmu_save_and_restart_reload(event, 0);
+   }
+}
+
 static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
@@ -1518,19 +1538,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs 
*iregs)
}
 
if (unlikely(base >= top)) {
-   /*
-* The drain_pebs() could be called twice in a short period
-* for auto-reload event in pmu::read(). There are no
-* overflows have happened in between.
-* It needs to call intel_pmu_save_and_restart_reload() to
-* update the event->count for this case.
-*/
-   for_each_set_bit(bit, (unsigned long *)>pebs_enabled,
-size) {
-   event = cpuc->events[bit];
-   if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
-   intel_pmu_save_and_restart_reload(event, 0);
-   }
+   intel_pmu_pebs_event_update_no_drain(cpuc, size);
return;
}
 
-- 
2.17.1



[PATCH V4 04/23] perf/x86/intel: Support adaptive PEBSv4

2019-03-26 Thread kan . liang
From: Kan Liang 

Adaptive PEBS is a new way to report PEBS sampling information. Instead
of a fixed size record for all PEBS events it allows to configure the
PEBS record to only include the information needed. Events can then opt
in to use such an extended record, or stay with a basic record which
only contains the IP.

The major new feature is to support LBRs in PEBS record.
Besides normal LBR, this allows (much faster) large PEBS, while still
supporting callstacks through callstack LBR. So essentially a lot of
profiling can now be done without frequent interrupts, dropping the
overhead significantly.

The main requirement still is to use a period, and not use frequency
mode, because frequency mode requires reevaluating the frequency on each
overflow.

The floating point state (XMM) is also supported, which allows efficient
profiling of FP function arguments.

Introduce specific drain function to handle variable length records.
Use a new callback to parse the new record format, and also handle the
STATUS field now being at a different offset.

Add code to set up the configuration register. Since there is only a
single register, all events either get the full super set of all events,
or only the basic record.

Originally-by: Andi Kleen 
Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/intel/core.c  |   2 +
 arch/x86/events/intel/ds.c| 373 --
 arch/x86/events/intel/lbr.c   |  22 ++
 arch/x86/events/perf_event.h  |   9 +
 arch/x86/include/asm/msr-index.h  |   1 +
 arch/x86/include/asm/perf_event.h |  42 
 6 files changed, 429 insertions(+), 20 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8baa441d8000..620beae035a0 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3507,6 +3507,8 @@ static struct intel_excl_cntrs *allocate_excl_cntrs(int 
cpu)
 
 int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
 {
+   cpuc->pebs_record_size = x86_pmu.pebs_record_size;
+
if (x86_pmu.extra_regs || x86_pmu.lbr_sel_map) {
cpuc->shared_regs = allocate_shared_regs(cpu);
if (!cpuc->shared_regs)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index efc054aee3c1..1a076beb5fb1 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -906,17 +906,85 @@ static inline void pebs_update_threshold(struct 
cpu_hw_events *cpuc)
 
if (cpuc->n_pebs == cpuc->n_large_pebs) {
threshold = ds->pebs_absolute_maximum -
-   reserved * x86_pmu.pebs_record_size;
+   reserved * cpuc->pebs_record_size;
} else {
-   threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+   threshold = ds->pebs_buffer_base + cpuc->pebs_record_size;
}
 
ds->pebs_interrupt_threshold = threshold;
 }
 
+static void adaptive_pebs_record_size_update(void)
+{
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   u64 pebs_data_cfg = cpuc->pebs_data_cfg;
+   int sz = sizeof(struct pebs_basic);
+
+   if (pebs_data_cfg & PEBS_DATACFG_MEMINFO)
+   sz += sizeof(struct pebs_meminfo);
+   if (pebs_data_cfg & PEBS_DATACFG_GPRS)
+   sz += sizeof(struct pebs_gprs);
+   if (pebs_data_cfg & PEBS_DATACFG_XMMS)
+   sz += sizeof(struct pebs_xmm);
+   if (pebs_data_cfg & PEBS_DATACFG_LBRS)
+   sz += x86_pmu.lbr_nr * sizeof(struct pebs_lbr_entry);
+
+   cpuc->pebs_record_size = sz;
+}
+
+#define PERF_PEBS_MEMINFO_TYPE (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC |   \
+   PERF_SAMPLE_PHYS_ADDR | PERF_SAMPLE_WEIGHT | \
+   PERF_SAMPLE_TRANSACTION)
+
+static u64 pebs_update_adaptive_cfg(struct perf_event *event)
+{
+   struct perf_event_attr *attr = >attr;
+   u64 sample_type = attr->sample_type;
+   u64 pebs_data_cfg = 0;
+   bool gprs, tsx_weight;
+
+   if ((sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) ||
+   attr->precise_ip < 2) {
+
+   if (sample_type & PERF_PEBS_MEMINFO_TYPE)
+   pebs_data_cfg |= PEBS_DATACFG_MEMINFO;
+
+   /*
+* Cases we need the registers:
+* + user requested registers
+* + precise_ip < 2 for the non event IP
+* + For RTM TSX weight we need GPRs too for the abort
+* code. But we don't want to force GPRs for all other
+* weights.  So add it only collectfor the RTM abort event.
+*/
+   gprs = (sample_type & PERF_SAMPLE_REGS_INTR) &&
+ (attr->sample_regs_intr & 0x);
+   tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT) &&
+((attr->config & 0x) == 
x86_pmu.force_gpr_event);
+   if (gprs || 

[PATCH V4 06/23] perf/x86: Support constraint ranges

2019-03-26 Thread kan . liang
From: Peter Zijlstra 

Icelake extended the general counters to 8, even when SMT is enabled.
However only a (large) subset of the events can be used on all 8
counters.

The events that can or cannot be used on all counters are organized
in ranges.

A lot of scheduler constraints are required to handle all this.

To avoid blowing up the tables add event code ranges to the constraint
tables, and a new inline function to match them.

Originally-by: Andi Kleen 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/intel/core.c |  2 +-
 arch/x86/events/intel/ds.c   |  2 +-
 arch/x86/events/perf_event.h | 42 ++--
 3 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 620beae035a0..d5d796e114a1 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2688,7 +2688,7 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, int 
idx,
 
if (x86_pmu.event_constraints) {
for_each_event_constraint(c, x86_pmu.event_constraints) {
-   if ((event->hw.config & c->cmask) == c->code) {
+   if (constraint_match(c, event->hw.config)) {
event->hw.flags |= c->flags;
return c;
}
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 1a076beb5fb1..3ee1a0198c13 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -858,7 +858,7 @@ struct event_constraint *intel_pebs_constraints(struct 
perf_event *event)
 
if (x86_pmu.pebs_constraints) {
for_each_event_constraint(c, x86_pmu.pebs_constraints) {
-   if ((event->hw.config & c->cmask) == c->code) {
+   if (constraint_match(c, event->hw.config)) {
event->hw.flags |= c->flags;
return c;
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f2351e47de3d..a502e9bb02bb 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -49,11 +49,12 @@ struct event_constraint {
unsigned long   idxmsk[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
u64 idxmsk64;
};
-   u64 code;
-   u64 cmask;
-   int weight;
-   int overlap;
-   int flags;
+   u64 code;
+   u64 cmask;
+   int weight;
+   int overlap;
+   int flags;
+   unsigned intsize;
 };
 /*
  * struct hw_perf_event.flags flags
@@ -71,6 +72,10 @@ struct event_constraint {
 #define PERF_X86_EVENT_AUTO_RELOAD 0x0400 /* use PEBS auto-reload */
 #define PERF_X86_EVENT_LARGE_PEBS  0x0800 /* use large PEBS */
 
+static inline bool constraint_match(struct event_constraint *c, u64 ecode)
+{
+   return ((ecode & c->cmask) - c->code) <= (u64)c->size;
+}
 
 struct amd_nb {
int nb_id;  /* NorthBridge id */
@@ -263,18 +268,29 @@ struct cpu_hw_events {
void*kfree_on_online[X86_PERF_KFREE_MAX];
 };
 
-#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\
+#define __EVENT_CONSTRAINT_RANGE(c, e, n, m, w, o, f) {\
{ .idxmsk64 = (n) },\
.code = (c),\
+   .size = (e) - (c),  \
.cmask = (m),   \
.weight = (w),  \
.overlap = (o), \
.flags = f, \
 }
 
+#define __EVENT_CONSTRAINT(c, n, m, w, o, f) \
+   __EVENT_CONSTRAINT_RANGE(c, c, n, m, w, o, f)
+
 #define EVENT_CONSTRAINT(c, n, m)  \
__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)
 
+/*
+ * Only works for Intel events, which has 'small' event codes.
+ * Need to fix the rang compare for 'big' event codes, e.g AMD64_EVENTSEL_EVENT
+ */
+#define EVENT_CONSTRAINT_RANGE(c, e, n, m) \
+   __EVENT_CONSTRAINT_RANGE(c, e, n, m, HWEIGHT(n), 0, 0)
+
 #define INTEL_EXCLEVT_CONSTRAINT(c, n) \
__EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT, HWEIGHT(n),\
   0, PERF_X86_EVENT_EXCL)
@@ -309,6 +325,12 @@ struct cpu_hw_events {
 #define INTEL_EVENT_CONSTRAINT(c, n)   \
EVENT_CONSTRAINT(c, n, ARCH_PERFMON_EVENTSEL_EVENT)
 
+/*
+ * Constraint on a range of Event codes
+ */
+#define INTEL_EVENT_CONSTRAINT_RANGE(c, e, n)  \
+   EVENT_CONSTRAINT_RANGE(c, e, n, ARCH_PERFMON_EVENTSEL_EVENT)
+
 /*
  * Constraint on the Event code + UMask + fixed-mask
  *
@@ -356,6 +378,9 @@ struct cpu_hw_events {
 #define INTEL_FLAGS_EVENT_CONSTRAINT(c, n) \
EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK|X86_ALL_EVENT_FLAGS)
 
+#define INTEL_FLAGS_EVENT_CONSTRAINT_RANGE(c, e, n)\
+   

[PATCH V4 10/23] perf/x86/msr: Add Icelake support

2019-03-26 Thread kan . liang
From: Kan Liang 

Icelake is the same as the existing Skylake parts.

Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/msr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/events/msr.c b/arch/x86/events/msr.c
index a878e6286e4a..f3f4c2263501 100644
--- a/arch/x86/events/msr.c
+++ b/arch/x86/events/msr.c
@@ -89,6 +89,7 @@ static bool test_intel(int idx)
case INTEL_FAM6_SKYLAKE_X:
case INTEL_FAM6_KABYLAKE_MOBILE:
case INTEL_FAM6_KABYLAKE_DESKTOP:
+   case INTEL_FAM6_ICELAKE_MOBILE:
if (idx == PERF_MSR_SMI || idx == PERF_MSR_PPERF)
return true;
break;
-- 
2.17.1



[PATCH V4 07/23] perf/x86/intel: Add Icelake support

2019-03-26 Thread kan . liang
From: Kan Liang 

Add Icelake core PMU perf code, including constraint tables and the main
enable code.

Icelake expanded the generic counters to always 8 even with HT on, but a
range of events cannot be scheduled on the extra 4 counters.
Add new constraint ranges to describe this to the scheduler.
The number of constraints that need to be checked is larger now than
with earlier CPUs.
At some point we may need a new data structure to look them up more
efficiently than with linear search. So far it still seems to be
acceptable however.

Icelake added a new fixed counter SLOTS. Full support for it is added
later in the patch series.

The cache events table is identical to Skylake.

Compare to PEBS instruction event on generic counter, fixed counter 0
has less skid. Force instruction:ppp always in fixed counter 0.

Originally-by: Andi Kleen 
Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/intel/core.c  | 112 ++
 arch/x86/events/intel/ds.c|  26 ++-
 arch/x86/events/perf_event.h  |   2 +
 arch/x86/include/asm/intel_ds.h   |   2 +-
 arch/x86/include/asm/perf_event.h |   2 +-
 5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d5d796e114a1..ef95d73ef4f0 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -239,6 +239,35 @@ static struct extra_reg intel_skl_extra_regs[] 
__read_mostly = {
EVENT_EXTRA_END
 };
 
+static struct event_constraint intel_icl_event_constraints[] = {
+   FIXED_EVENT_CONSTRAINT(0x00c0, 0),  /* INST_RETIRED.ANY */
+   INTEL_UEVENT_CONSTRAINT(0x1c0, 0),  /* INST_RETIRED.PREC_DIST */
+   FIXED_EVENT_CONSTRAINT(0x003c, 1),  /* CPU_CLK_UNHALTED.CORE */
+   FIXED_EVENT_CONSTRAINT(0x0300, 2),  /* CPU_CLK_UNHALTED.REF */
+   FIXED_EVENT_CONSTRAINT(0x0400, 3),  /* SLOTS */
+   INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
+   INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
+   INTEL_EVENT_CONSTRAINT(0x32, 0xf),  /* SW_PREFETCH_ACCESS.* */
+   INTEL_EVENT_CONSTRAINT_RANGE(0x48, 0x54, 0xf),
+   INTEL_EVENT_CONSTRAINT_RANGE(0x60, 0x8b, 0xf),
+   INTEL_UEVENT_CONSTRAINT(0x04a3, 0xff),  /* CYCLE_ACTIVITY.STALLS_TOTAL 
*/
+   INTEL_UEVENT_CONSTRAINT(0x10a3, 0xff),  /* 
CYCLE_ACTIVITY.STALLS_MEM_ANY */
+   INTEL_EVENT_CONSTRAINT(0xa3, 0xf),  /* CYCLE_ACTIVITY.* */
+   INTEL_EVENT_CONSTRAINT_RANGE(0xa8, 0xb0, 0xf),
+   INTEL_EVENT_CONSTRAINT_RANGE(0xb7, 0xbd, 0xf),
+   INTEL_EVENT_CONSTRAINT_RANGE(0xd0, 0xe6, 0xf),
+   INTEL_EVENT_CONSTRAINT_RANGE(0xf0, 0xf4, 0xf),
+   EVENT_CONSTRAINT_END
+};
+
+static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
+   INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3f9fffull, 
RSP_0),
+   INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3f9fffull, 
RSP_1),
+   INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
+   INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
+   EVENT_EXTRA_END
+};
+
 EVENT_ATTR_STR(mem-loads,  mem_ld_nhm, 
"event=0x0b,umask=0x10,ldlat=3");
 EVENT_ATTR_STR(mem-loads,  mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
 EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2");
@@ -3366,6 +3395,9 @@ static struct event_constraint counter0_constraint =
 static struct event_constraint counter2_constraint =
EVENT_CONSTRAINT(0, 0x4, 0);
 
+static struct event_constraint fixed_counter0_constraint =
+   FIXED_EVENT_CONSTRAINT(0x00c0, 0);
+
 static struct event_constraint *
 hsw_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
  struct perf_event *event)
@@ -3384,6 +3416,21 @@ hsw_get_event_constraints(struct cpu_hw_events *cpuc, 
int idx,
return c;
 }
 
+static struct event_constraint *
+icl_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
+ struct perf_event *event)
+{
+   /*
+* Fixed counter 0 has less skid.
+* Force instruction:ppp in Fixed counter 0
+*/
+   if ((event->attr.precise_ip == 3) &&
+   ((event->hw.config & X86_RAW_EVENT_MASK) == 0x00c0))
+   return _counter0_constraint;
+
+   return hsw_get_event_constraints(cpuc, idx, event);
+}
+
 static struct event_constraint *
 glp_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
  struct perf_event *event)
@@ -4110,6 +4157,42 @@ static struct attribute *hsw_tsx_events_attrs[] = {
NULL
 };
 
+EVENT_ATTR_STR(tx-capacity-read,  tx_capacity_read,  "event=0x54,umask=0x80");
+EVENT_ATTR_STR(tx-capacity-write, tx_capacity_write, "event=0x54,umask=0x2");
+EVENT_ATTR_STR(el-capacity-read,  el_capacity_read,  "event=0x54,umask=0x80");
+EVENT_ATTR_STR(el-capacity-write, el_capacity_write, "event=0x54,umask=0x2");
+
+static struct attribute 

[PATCH V4 15/23] perf/x86/intel: Support hardware TopDown metrics

2019-03-26 Thread kan . liang
From: Kan Liang 

Intro
=

Icelake has support for measuring the four top level TopDown metrics
directly in hardware. This is implemented by an additional "metrics"
register, and a new Fixed Counter 3 that measures pipeline "slots".

Events
==

We export four metric events as separate perf events, which map to
internal "metrics" counter register. Those events do not exist in
hardware, but can be allocated by the scheduler.

For the event mapping we use a special 0xff event code, which is
reserved for software.

When setting up such events they point to the slots counter, and a
special callback, metric_update_event(), reads the additional metrics
msr to generate the metrics. Then the metric is reported by multiplying
the metric (percentage) with slots.

This multiplication allows to easily keep a running count, for example
when the slots counter overflows, and makes all the standard tools, such
as a perf stat, work. They can do deltas of the values without needing
to know about percentages. This also simplifies accumulating the counts
of child events, which otherwise would need to know how to average
percent values.

Groups
==

To avoid reading the METRICS register multiple times, the metrics and
slots value are cached. This only works when multiple sub-events are in
the same group.

Resetting1
==

The 8bit metrics ratio values lose precision when the measurement period
gets longer.

To avoid this we always reset the metric value when reading, as we
already accumulate the count in the perf count value.

For a long period read, low precision is acceptable.
For a short period read, the register will be reset often enough that it
is not a problem.

This implies that to read more than one submetrics always a group needs
to be used, so that the caching above still gives the correct value.

We also need to support this in the NMI, so that it's possible to
collect all top down metrics as part of leader sampling. To avoid races
with the normal transactions use a special nmi_metric cache that is only
used during the NMI.

Resetting2
==

The PERF_METRICS may report wrong value if its delta was less than 1/255
of SLOTS (Fixed counter 3).

To avoid this, the PERF_METRICS and SLOTS registers have to be reset
simultaneously. The slots value has to be cached as well.

In counting, the -max_period is the initial value of the SLOTS. The huge
initial value will definitely trigger the issue mentioned above.
Force initial value as 0 for topdown and slots event counting.

RDPMC
=
The TopDown events can be collected per thread/process. To use TopDown
through RDPMC in applications on Icelake, the metrics and slots values
have to be saved/restored during context switching.

Add specific set_period() to specially handle the slots and metrics
event. Because,
 - The initial value must be 0.
 - Only need to restore the value in context switch. For other cases,
   the counters have been cleared after read.

Originally-by: Andi Kleen 
Signed-off-by: Kan Liang 
---

No changes since V3.

 arch/x86/events/core.c   |  40 +--
 arch/x86/events/intel/core.c | 192 +++
 arch/x86/events/perf_event.h |  14 +++
 arch/x86/include/asm/msr-index.h |   2 +
 include/linux/perf_event.h   |   5 +
 5 files changed, 242 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d24f8d009529..7d4d56f76436 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -91,16 +91,20 @@ u64 x86_perf_event_update(struct perf_event *event)
new_raw_count) != prev_raw_count)
goto again;
 
-   /*
-* Now we have the new raw value and have updated the prev
-* timestamp already. We can now calculate the elapsed delta
-* (event-)time and add that to the generic event.
-*
-* Careful, not all hw sign-extends above the physical width
-* of the count.
-*/
-   delta = (new_raw_count << shift) - (prev_raw_count << shift);
-   delta >>= shift;
+   if (unlikely(hwc->flags & PERF_X86_EVENT_UPDATE))
+   delta = x86_pmu.metric_update_event(event, new_raw_count);
+   else {
+   /*
+* Now we have the new raw value and have updated the prev
+* timestamp already. We can now calculate the elapsed delta
+* (event-)time and add that to the generic event.
+*
+* Careful, not all hw sign-extends above the physical width
+* of the count.
+*/
+   delta = (new_raw_count << shift) - (prev_raw_count << shift);
+   delta >>= shift;
+   }
 
local64_add(delta, >count);
local64_sub(delta, >period_left);
@@ -974,6 +978,10 @@ static int collect_events(struct cpu_hw_events *cpuc, 
struct perf_event *leader,
 
max_count = x86_pmu.num_counters + 

<    1   2   3   4   5   6   7   8   9   10   >