date:20170417

[PATCH V3 2/2] ARM64: dts: hi6220-hikey: Add clock binding for the pmic mfd

2017-04-17 Thread Daniel Lezcano

Signed-off-by: Daniel Lezcano 
---
 Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt | 6 ++
 arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts | 1 +
 2 files changed, 7 insertions(+)

diff --git a/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt 
b/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt
index 0548569..9630ac0 100644
--- a/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt
+++ b/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt
@@ -16,6 +16,11 @@ Required properties:
 - reg:  Base address of PMIC on Hi6220 SoC.
 - interrupt-controller: Hi655x has internal IRQs (has own IRQ domain).
 - pmic-gpios:   The GPIO used by PMIC IRQ.
+- #clock-cells:From common clock binding; shall be set to 0
+
+Optional properties:
+- clock-output-names: From common clock binding to override the
+  default output clock name
 
 Example:
pmic: pmic@f800 {
@@ -24,4 +29,5 @@ Example:
interrupt-controller;
#interrupt-cells = <2>;
pmic-gpios = < 2 GPIO_ACTIVE_HIGH>;
+   #clock-cells = <0>;
}
diff --git a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts 
b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
index dba3c13..e0496f7 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
+++ b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
@@ -325,6 +325,7 @@
pmic: pmic@f800 {
compatible = "hisilicon,hi655x-pmic";
reg = <0x0 0xf800 0x0 0x1000>;
+   #clock-cells = <0>;
interrupt-controller;
#interrupt-cells = <2>;
pmic-gpios = < 2 GPIO_ACTIVE_HIGH>;
-- 
1.9.1

[PATCH V3 2/2] ARM64: dts: hi6220-hikey: Add clock binding for the pmic mfd

2017-04-17 Thread Daniel Lezcano

Signed-off-by: Daniel Lezcano 
---
 Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt | 6 ++
 arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts | 1 +
 2 files changed, 7 insertions(+)

diff --git a/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt 
b/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt
index 0548569..9630ac0 100644
--- a/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt
+++ b/Documentation/devicetree/bindings/mfd/hisilicon,hi655x.txt
@@ -16,6 +16,11 @@ Required properties:
 - reg:  Base address of PMIC on Hi6220 SoC.
 - interrupt-controller: Hi655x has internal IRQs (has own IRQ domain).
 - pmic-gpios:   The GPIO used by PMIC IRQ.
+- #clock-cells:From common clock binding; shall be set to 0
+
+Optional properties:
+- clock-output-names: From common clock binding to override the
+  default output clock name
 
 Example:
pmic: pmic@f800 {
@@ -24,4 +29,5 @@ Example:
interrupt-controller;
#interrupt-cells = <2>;
pmic-gpios = < 2 GPIO_ACTIVE_HIGH>;
+   #clock-cells = <0>;
}
diff --git a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts 
b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
index dba3c13..e0496f7 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
+++ b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
@@ -325,6 +325,7 @@
pmic: pmic@f800 {
compatible = "hisilicon,hi655x-pmic";
reg = <0x0 0xf800 0x0 0x1000>;
+   #clock-cells = <0>;
interrupt-controller;
#interrupt-cells = <2>;
pmic-gpios = < 2 GPIO_ACTIVE_HIGH>;
-- 
1.9.1

[PATCH V3 1/2] clk: hi6220: Add the hi655x's pmic clock

2017-04-17 Thread Daniel Lezcano

The hi655x multi function device is a PMIC providing regulators.

The PMIC also provides a clock for the WiFi and the Bluetooth, let's implement
this clock in order to add it in the hi655x MFD and allow proper wireless
initialization.

Signed-off-by: Daniel Lezcano 
---

Changelog:

 V3:
- Removed mfd hi655x fold from V2
- Fixed #clock-cells
- Used struct clk_init_data on the stack
- Swapped lookup name for devm_clk_hw_register
 V2:
- Added COMPILE_TEST option, compiled on x86
- Removed useless parenthesis
- Used of_clk_hw_simple_get() instead of deref dance
- Do bailout if the clock-names is not specified
- Rollback on error
- Folded mfd line change and binding
- Added #clock-cells binding documentation
- Added #clock-cells in the DT

 V1: initial post
---
 drivers/clk/Kconfig  |   8 +++
 drivers/clk/Makefile |   1 +
 drivers/clk/clk-hi655x.c | 136 +++
 3 files changed, 145 insertions(+)
 create mode 100644 drivers/clk/clk-hi655x.c

diff --git a/drivers/clk/Kconfig b/drivers/clk/Kconfig
index 9356ab4..36cfea3 100644
--- a/drivers/clk/Kconfig
+++ b/drivers/clk/Kconfig
@@ -47,6 +47,14 @@ config COMMON_CLK_RK808
  clocked at 32KHz each. Clkout1 is always on, Clkout2 can off
  by control register.
 
+config COMMON_CLK_HI655X
+   tristate "Clock driver for Hi655x"
+   depends on MFD_HI655X_PMIC || COMPILE_TEST
+   ---help---
+ This driver supports the hi655x PMIC clock. This
+ multi-function device has one fixed-rate oscillator, clocked
+ at 32KHz.
+
 config COMMON_CLK_SCPI
tristate "Clock driver controlled via SCPI interface"
depends on ARM_SCPI_PROTOCOL || COMPILE_TEST
diff --git a/drivers/clk/Makefile b/drivers/clk/Makefile
index 92c12b8..c19983a 100644
--- a/drivers/clk/Makefile
+++ b/drivers/clk/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_COMMON_CLK_PALMAS)   += clk-palmas.o
 obj-$(CONFIG_COMMON_CLK_PWM)   += clk-pwm.o
 obj-$(CONFIG_CLK_QORIQ)+= clk-qoriq.o
 obj-$(CONFIG_COMMON_CLK_RK808) += clk-rk808.o
+obj-$(CONFIG_COMMON_CLK_HI655X)+= clk-hi655x.o
 obj-$(CONFIG_COMMON_CLK_S2MPS11)   += clk-s2mps11.o
 obj-$(CONFIG_COMMON_CLK_SCPI)   += clk-scpi.o
 obj-$(CONFIG_COMMON_CLK_SI5351)+= clk-si5351.o
diff --git a/drivers/clk/clk-hi655x.c b/drivers/clk/clk-hi655x.c
new file mode 100644
index 000..4384c6f
--- /dev/null
+++ b/drivers/clk/clk-hi655x.c
@@ -0,0 +1,136 @@
+/*
+ * Clock driver for Hi655x
+ *
+ * Copyright (c) 2017, Linaro Ltd.
+ *
+ * Author: Daniel Lezcano 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define HI655X_CLK_BASEHI655X_BUS_ADDR(0x1c)
+#define HI655X_CLK_SET BIT(6)
+
+struct hi655x_clk {
+   struct hi655x_pmic *hi655x;
+   struct clk_hw   clk_hw;
+};
+
+static unsigned long hi655x_clk_recalc_rate(struct clk_hw *hw,
+   unsigned long parent_rate)
+{
+   return 32768;
+}
+
+static int hi655x_clk_enable(struct clk_hw *hw, bool enable)
+{
+   struct hi655x_clk *hi655x_clk =
+   container_of(hw, struct hi655x_clk, clk_hw);
+
+   struct hi655x_pmic *hi655x = hi655x_clk->hi655x;
+
+   return regmap_update_bits(hi655x->regmap, HI655X_CLK_BASE,
+ HI655X_CLK_SET, enable ? HI655X_CLK_SET : 0);
+}
+
+static int hi655x_clk_prepare(struct clk_hw *hw)
+{
+   return hi655x_clk_enable(hw, true);
+}
+
+static void hi655x_clk_unprepare(struct clk_hw *hw)
+{
+   hi655x_clk_enable(hw, false);
+}
+
+static int hi655x_clk_is_prepared(struct clk_hw *hw)
+{
+   struct hi655x_clk *hi655x_clk =
+   container_of(hw, struct hi655x_clk, clk_hw);
+   struct hi655x_pmic *hi655x = hi655x_clk->hi655x;
+   int ret;
+   uint32_t val;
+
+   ret = regmap_read(hi655x->regmap, HI655X_CLK_BASE, );
+   if (ret < 0)
+   return ret;
+
+   return val & HI655X_CLK_BASE;
+}
+
+static const struct clk_ops hi655x_clk_ops = {
+   .prepare = hi655x_clk_prepare,
+   .unprepare   = hi655x_clk_unprepare,
+   .is_prepared = hi655x_clk_is_prepared,
+   .recalc_rate = hi655x_clk_recalc_rate,
+};
+
+static int hi655x_clk_probe(struct platform_device *pdev)
+{
+   struct device *parent = pdev->dev.parent;
+   struct hi655x_pmic

[PATCH V3 1/2] clk: hi6220: Add the hi655x's pmic clock

2017-04-17 Thread Daniel Lezcano

The hi655x multi function device is a PMIC providing regulators.

The PMIC also provides a clock for the WiFi and the Bluetooth, let's implement
this clock in order to add it in the hi655x MFD and allow proper wireless
initialization.

Signed-off-by: Daniel Lezcano 
---

Changelog:

 V3:
- Removed mfd hi655x fold from V2
- Fixed #clock-cells
- Used struct clk_init_data on the stack
- Swapped lookup name for devm_clk_hw_register
 V2:
- Added COMPILE_TEST option, compiled on x86
- Removed useless parenthesis
- Used of_clk_hw_simple_get() instead of deref dance
- Do bailout if the clock-names is not specified
- Rollback on error
- Folded mfd line change and binding
- Added #clock-cells binding documentation
- Added #clock-cells in the DT

 V1: initial post
---
 drivers/clk/Kconfig  |   8 +++
 drivers/clk/Makefile |   1 +
 drivers/clk/clk-hi655x.c | 136 +++
 3 files changed, 145 insertions(+)
 create mode 100644 drivers/clk/clk-hi655x.c

diff --git a/drivers/clk/Kconfig b/drivers/clk/Kconfig
index 9356ab4..36cfea3 100644
--- a/drivers/clk/Kconfig
+++ b/drivers/clk/Kconfig
@@ -47,6 +47,14 @@ config COMMON_CLK_RK808
  clocked at 32KHz each. Clkout1 is always on, Clkout2 can off
  by control register.
 
+config COMMON_CLK_HI655X
+   tristate "Clock driver for Hi655x"
+   depends on MFD_HI655X_PMIC || COMPILE_TEST
+   ---help---
+ This driver supports the hi655x PMIC clock. This
+ multi-function device has one fixed-rate oscillator, clocked
+ at 32KHz.
+
 config COMMON_CLK_SCPI
tristate "Clock driver controlled via SCPI interface"
depends on ARM_SCPI_PROTOCOL || COMPILE_TEST
diff --git a/drivers/clk/Makefile b/drivers/clk/Makefile
index 92c12b8..c19983a 100644
--- a/drivers/clk/Makefile
+++ b/drivers/clk/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_COMMON_CLK_PALMAS)   += clk-palmas.o
 obj-$(CONFIG_COMMON_CLK_PWM)   += clk-pwm.o
 obj-$(CONFIG_CLK_QORIQ)+= clk-qoriq.o
 obj-$(CONFIG_COMMON_CLK_RK808) += clk-rk808.o
+obj-$(CONFIG_COMMON_CLK_HI655X)+= clk-hi655x.o
 obj-$(CONFIG_COMMON_CLK_S2MPS11)   += clk-s2mps11.o
 obj-$(CONFIG_COMMON_CLK_SCPI)   += clk-scpi.o
 obj-$(CONFIG_COMMON_CLK_SI5351)+= clk-si5351.o
diff --git a/drivers/clk/clk-hi655x.c b/drivers/clk/clk-hi655x.c
new file mode 100644
index 000..4384c6f
--- /dev/null
+++ b/drivers/clk/clk-hi655x.c
@@ -0,0 +1,136 @@
+/*
+ * Clock driver for Hi655x
+ *
+ * Copyright (c) 2017, Linaro Ltd.
+ *
+ * Author: Daniel Lezcano 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define HI655X_CLK_BASEHI655X_BUS_ADDR(0x1c)
+#define HI655X_CLK_SET BIT(6)
+
+struct hi655x_clk {
+   struct hi655x_pmic *hi655x;
+   struct clk_hw   clk_hw;
+};
+
+static unsigned long hi655x_clk_recalc_rate(struct clk_hw *hw,
+   unsigned long parent_rate)
+{
+   return 32768;
+}
+
+static int hi655x_clk_enable(struct clk_hw *hw, bool enable)
+{
+   struct hi655x_clk *hi655x_clk =
+   container_of(hw, struct hi655x_clk, clk_hw);
+
+   struct hi655x_pmic *hi655x = hi655x_clk->hi655x;
+
+   return regmap_update_bits(hi655x->regmap, HI655X_CLK_BASE,
+ HI655X_CLK_SET, enable ? HI655X_CLK_SET : 0);
+}
+
+static int hi655x_clk_prepare(struct clk_hw *hw)
+{
+   return hi655x_clk_enable(hw, true);
+}
+
+static void hi655x_clk_unprepare(struct clk_hw *hw)
+{
+   hi655x_clk_enable(hw, false);
+}
+
+static int hi655x_clk_is_prepared(struct clk_hw *hw)
+{
+   struct hi655x_clk *hi655x_clk =
+   container_of(hw, struct hi655x_clk, clk_hw);
+   struct hi655x_pmic *hi655x = hi655x_clk->hi655x;
+   int ret;
+   uint32_t val;
+
+   ret = regmap_read(hi655x->regmap, HI655X_CLK_BASE, );
+   if (ret < 0)
+   return ret;
+
+   return val & HI655X_CLK_BASE;
+}
+
+static const struct clk_ops hi655x_clk_ops = {
+   .prepare = hi655x_clk_prepare,
+   .unprepare   = hi655x_clk_unprepare,
+   .is_prepared = hi655x_clk_is_prepared,
+   .recalc_rate = hi655x_clk_recalc_rate,
+};
+
+static int hi655x_clk_probe(struct platform_device *pdev)
+{
+   struct device *parent = pdev->dev.parent;
+   struct hi655x_pmic *hi655x = dev_get_drvdata(parent);
+   struct

Re: [PATCH v3 1/8] trace: ras: add ARM processor error information trace event

2017-04-17 Thread Baicar, Tyler


On 4/16/2017 9:16 PM, Xie XiuQi wrote:

On 2017/4/17 11:08, Xie XiuQi wrote:

On 3/30/2017 4:31 AM, Xie XiuQi wrote:

Add a new trace event for ARM processor error information, so that
the user will know what error occurred. With this information the
user may take appropriate action.

These trace events are consistent with the ARM processor error
information table which defined in UEFI 2.6 spec section N.2.4.4.1.

---
v2: add trace enabled condition as Steven's suggestion.
  fix a typo.
---

Cc: Steven Rostedt 
Cc: Tyler Baicar 
Signed-off-by: Xie XiuQi 
---



...

+/*
+ * First define the enums in MM_ACTION_RESULT to be exported to userspace
+ * via TRACE_DEFINE_ENUM().
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b)TRACE_DEFINE_ENUM(a);
+
+ARM_PROC_ERR_TYPE
+ARM_PROC_ERR_FLAGS

Are the above two lines supposed to be here?

+
+/*
+ * Now redefine the EM() and EMe() macros to map the enums to the strings
+ * that will be printed in the output.
+ */
+#undef EM
+#undef EMe
+#define EM(a, b){ a, b },
+#define EMe(a, b){ a, b }
+
+TRACE_EVENT(arm_proc_err,

I think it would be better to keep this similar to the naming of the current RAS trace 
events (right now we have mc_event, arm_event, aer_event, etc.). I would suggest using 
"arm_err_info_event" since this is handling the error information structures of 
the arm errors.

+
+TP_PROTO(const struct cper_arm_err_info *err),
+
+TP_ARGS(err),
+
+TP_STRUCT__entry(
+__field(u8, type)
+__field(u16, multiple_error)
+__field(u8, flags)
+__field(u64, error_info)
+__field(u64, virt_fault_addr)
+__field(u64, physical_fault_addr)

Validation bits should also be a part of this structure that way user space 
tools will know which of these fields are valid.

Could we use the default value to check the validation which we have checked in 
TP_fast_assign?

Yes, true...I guess we really don't need the validation bits then.

+),
+
+TP_fast_assign(
+__entry->type = err->type;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_MULTI_ERR)
+__entry->multiple_error = err->multiple_error;
+else
+__entry->multiple_error = ~0;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_FLAGS)
+__entry->flags = err->flags;
+else
+__entry->flags = ~0;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
+__entry->error_info = err->error_info;
+else
+__entry->error_info = 0ULL;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
+__entry->virt_fault_addr = err->virt_fault_addr;
+else
+__entry->virt_fault_addr = 0ULL;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR)
+__entry->physical_fault_addr = err->physical_fault_addr;
+else
+__entry->physical_fault_addr = 0ULL;
+),
+
+TP_printk("ARM Processor Error: type %s; count: %u; flags: %s;"

I think the "ARM Processor Error:" part of this should just be removed. Here's 
the output with this removed and the trace event renamed to arm_err_info_event. I think 
this looks much cleaner and matches the style used with the arm_event.

   -0 [020] .ns.   366.592434: arm_event: affinity level: 2; 
MPIDR: ; MIDR: 510f8000; running state: 1; PSCI state: 0
   -0 [020] .ns.   366.592437: arm_err_info_event: type cache 
error; count: 0; flags: 0x3; error info: 00c20058; virtual address: 
; physical address: 

As this section is ARM Processor Error Section, how about use 
arm_proc_err_event?
This is not for the ARM Processor Error Section, that is what the 
arm_event is handling. What you are adding this trace support for here 
is called the ARM Processor Error Information (UEFI 2.6 spec section 
N.2.4.4.1). So I think your trace event here should be called 
arm_err_info_event. This will also be consistent with the other two 
trace events that I'm planning on adding:


arm_ctx_info_event: ARM Processor Context Information (UEFI 2.6 section 
N.2.4.4.2)
arm_vendor_info_event: This is the "Vendor Specific Error Information" 
in the ARM Processor Error Section (Table 260). It's possible I may just 
add this into the arm_event trace event, but I haven't looked into it 
enough yet.


Thanks,
Tyler

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

Re: [PATCH v3 1/8] trace: ras: add ARM processor error information trace event

2017-04-17 Thread Baicar, Tyler


On 4/16/2017 9:16 PM, Xie XiuQi wrote:

On 2017/4/17 11:08, Xie XiuQi wrote:

On 3/30/2017 4:31 AM, Xie XiuQi wrote:

Add a new trace event for ARM processor error information, so that
the user will know what error occurred. With this information the
user may take appropriate action.

These trace events are consistent with the ARM processor error
information table which defined in UEFI 2.6 spec section N.2.4.4.1.

---
v2: add trace enabled condition as Steven's suggestion.
  fix a typo.
---

Cc: Steven Rostedt 
Cc: Tyler Baicar 
Signed-off-by: Xie XiuQi 
---



...

+/*
+ * First define the enums in MM_ACTION_RESULT to be exported to userspace
+ * via TRACE_DEFINE_ENUM().
+ */
+#undef EM
+#undef EMe
+#define EM(a, b) TRACE_DEFINE_ENUM(a);
+#define EMe(a, b)TRACE_DEFINE_ENUM(a);
+
+ARM_PROC_ERR_TYPE
+ARM_PROC_ERR_FLAGS

Are the above two lines supposed to be here?

+
+/*
+ * Now redefine the EM() and EMe() macros to map the enums to the strings
+ * that will be printed in the output.
+ */
+#undef EM
+#undef EMe
+#define EM(a, b){ a, b },
+#define EMe(a, b){ a, b }
+
+TRACE_EVENT(arm_proc_err,

I think it would be better to keep this similar to the naming of the current RAS trace 
events (right now we have mc_event, arm_event, aer_event, etc.). I would suggest using 
"arm_err_info_event" since this is handling the error information structures of 
the arm errors.

+
+TP_PROTO(const struct cper_arm_err_info *err),
+
+TP_ARGS(err),
+
+TP_STRUCT__entry(
+__field(u8, type)
+__field(u16, multiple_error)
+__field(u8, flags)
+__field(u64, error_info)
+__field(u64, virt_fault_addr)
+__field(u64, physical_fault_addr)

Validation bits should also be a part of this structure that way user space 
tools will know which of these fields are valid.

Could we use the default value to check the validation which we have checked in 
TP_fast_assign?

Yes, true...I guess we really don't need the validation bits then.

+),
+
+TP_fast_assign(
+__entry->type = err->type;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_MULTI_ERR)
+__entry->multiple_error = err->multiple_error;
+else
+__entry->multiple_error = ~0;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_FLAGS)
+__entry->flags = err->flags;
+else
+__entry->flags = ~0;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_ERR_INFO)
+__entry->error_info = err->error_info;
+else
+__entry->error_info = 0ULL;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_VIRT_ADDR)
+__entry->virt_fault_addr = err->virt_fault_addr;
+else
+__entry->virt_fault_addr = 0ULL;
+
+if (err->validation_bits & CPER_ARM_INFO_VALID_PHYSICAL_ADDR)
+__entry->physical_fault_addr = err->physical_fault_addr;
+else
+__entry->physical_fault_addr = 0ULL;
+),
+
+TP_printk("ARM Processor Error: type %s; count: %u; flags: %s;"

I think the "ARM Processor Error:" part of this should just be removed. Here's 
the output with this removed and the trace event renamed to arm_err_info_event. I think 
this looks much cleaner and matches the style used with the arm_event.

   -0 [020] .ns.   366.592434: arm_event: affinity level: 2; 
MPIDR: ; MIDR: 510f8000; running state: 1; PSCI state: 0
   -0 [020] .ns.   366.592437: arm_err_info_event: type cache 
error; count: 0; flags: 0x3; error info: 00c20058; virtual address: 
; physical address: 

As this section is ARM Processor Error Section, how about use 
arm_proc_err_event?
This is not for the ARM Processor Error Section, that is what the 
arm_event is handling. What you are adding this trace support for here 
is called the ARM Processor Error Information (UEFI 2.6 spec section 
N.2.4.4.1). So I think your trace event here should be called 
arm_err_info_event. This will also be consistent with the other two 
trace events that I'm planning on adding:


arm_ctx_info_event: ARM Processor Context Information (UEFI 2.6 section 
N.2.4.4.2)
arm_vendor_info_event: This is the "Vendor Specific Error Information" 
in the ARM Processor Error Section (Table 260). It's possible I may just 
add this into the arm_event trace event, but I haven't looked into it 
enough yet.


Thanks,
Tyler

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

Re: [PATCH v5 2/2] thermal: core: Add a back up thermal shutdown mechanism

2017-04-17 Thread Eduardo Valentin

On Sat, Apr 15, 2017 at 08:38:29AM +0530, Keerthy wrote:
> orderly_poweroff is triggered when a graceful shutdown
> of system is desired. This may be used in many critical states of the
> kernel such as when subsystems detects conditions such as critical
> temperature conditions. However, in certain conditions in system
> boot up sequences like those in the middle of driver probes being
> initiated, userspace will be unable to power off the system in a clean
> manner and leaves the system in a critical state. In cases like these,
> the /sbin/poweroff will return success (having forked off to attempt
> powering off the system. However, the system overall will fail to
> completely poweroff (since other modules will be probed) and the system
> is still functional with no userspace (since that would have shut itself
> off).
> 
> However, there is no clean way of detecting such failure of userspace
> powering off the system. In such scenarios, it is necessary for a backup
> workqueue to be able to force a shutdown of the system when orderly
> shutdown is not successful after a configurable time period.
> 
> Reported-by: Nishanth Menon 
> Signed-off-by: Keerthy 
> ---
> 
> Changes in v5:
> 
>   * Mandated delay for thermal emergency poweroff to be a non-zero value.
> 
> Changes in v4:
> 
>   * Updated documentation
>   * changed emergency_poweroff_func to thermal_emergency_poweroff_func
> 
> Changes in v3:
> 
>   * Removed unnecessary mutex init.
>   * Added WARN messages instead of a simple warning message.
>   * Added Documentation.
> 
>  Documentation/thermal/sysfs-api.txt | 21 +++
>  drivers/thermal/Kconfig | 15 +++
>  drivers/thermal/thermal_core.c  | 53 
> +
>  3 files changed, 89 insertions(+)
> 
> diff --git a/Documentation/thermal/sysfs-api.txt 
> b/Documentation/thermal/sysfs-api.txt
> index ef473dc..98dc04f 100644
> --- a/Documentation/thermal/sysfs-api.txt
> +++ b/Documentation/thermal/sysfs-api.txt
> @@ -582,3 +582,24 @@ platform data is provided, this uses the step_wise 
> throttling policy.
>  This function serves as an arbitrator to set the state of a cooling
>  device. It sets the cooling device to the deepest cooling state if
>  possible.
> +
> +6. thermal_emergency_poweroff:
> +
> +On an event of critical trip temperature crossing. Thermal framework
> +allows the system to shutdown gracefully by calling orderly_poweroff().
> +In the event of a failure of orderly_poweroff() to shut down the system
> +we are in danger of keeping the system alive at undesirably high
> +temperatures. To mitigate this high risk scenario we program a work
> +queue to fire after a pre-determined number of seconds to start
> +an emergency shutdown of the device using the kernel_power_off()
> +function. In case kernel_power_off() fails then finally
> +emergency_restart() is called in the worst case.
> +
> +The delay should be carefully profiled so as to give adequate time for
> +orderly_poweroff(). In case of failure of an orderly_poweroff() the
> +emergency poweroff kicks in after the delay has elapsed and shuts down
> +the system.
> +
> +If set to 0 emergency poweroff will not be supported. So a carefully
> +profiled non-zero positive value is a must for emergerncy poweroff to be
> +triggered.
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 9347401..2a748a6 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -15,6 +15,21 @@ menuconfig THERMAL
>  
>  if THERMAL
>  
> +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
> + int "Emergency poweroff delay in milli-seconds"
> + depends on THERMAL
> + default 0
> + help
> +   The number of milliseconds to delay before emergency
> +   poweroff kicks in. The delay should be carefully profiled
> +   so as to give adequate time for orderly_poweroff(). In case
> +   of failure of an orderly_poweroff() the emergency poweroff
> +   kicks in after the delay has elapsed and shuts down the system.
> +
> +   If set to 0 emergency poweroff will not be supported. So a carefully
> +   profiled non-zero positive value is a must for emergerncy poweroff to 
> be
> +   triggered.

Here is a suggestion for rephrase the above:

 +  help
 +Thermal subsystem will issue a graceful shutdown when
 +critical temperatures are reached using orderly_poweroff(). In
 +case of failure of an orderly_poweroff(), the thermal emergency 
poweroff
 +kicks in after a delay has elapsed and shuts down the system.
 +This config is number of milliseconds to delay before emergency
 +poweroff kicks in. Similarly to the critical trip point,
 +the delay should be carefully profiled so as to give adequate
 +time for orderly_poweroff() to finish on regular execution.
 +If set to 0 emergency poweroff will not be supported.
 +
 +In doubt, leave as 0.

Re: [PATCH v5 2/2] thermal: core: Add a back up thermal shutdown mechanism

2017-04-17 Thread Eduardo Valentin

On Sat, Apr 15, 2017 at 08:38:29AM +0530, Keerthy wrote:
> orderly_poweroff is triggered when a graceful shutdown
> of system is desired. This may be used in many critical states of the
> kernel such as when subsystems detects conditions such as critical
> temperature conditions. However, in certain conditions in system
> boot up sequences like those in the middle of driver probes being
> initiated, userspace will be unable to power off the system in a clean
> manner and leaves the system in a critical state. In cases like these,
> the /sbin/poweroff will return success (having forked off to attempt
> powering off the system. However, the system overall will fail to
> completely poweroff (since other modules will be probed) and the system
> is still functional with no userspace (since that would have shut itself
> off).
> 
> However, there is no clean way of detecting such failure of userspace
> powering off the system. In such scenarios, it is necessary for a backup
> workqueue to be able to force a shutdown of the system when orderly
> shutdown is not successful after a configurable time period.
> 
> Reported-by: Nishanth Menon 
> Signed-off-by: Keerthy 
> ---
> 
> Changes in v5:
> 
>   * Mandated delay for thermal emergency poweroff to be a non-zero value.
> 
> Changes in v4:
> 
>   * Updated documentation
>   * changed emergency_poweroff_func to thermal_emergency_poweroff_func
> 
> Changes in v3:
> 
>   * Removed unnecessary mutex init.
>   * Added WARN messages instead of a simple warning message.
>   * Added Documentation.
> 
>  Documentation/thermal/sysfs-api.txt | 21 +++
>  drivers/thermal/Kconfig | 15 +++
>  drivers/thermal/thermal_core.c  | 53 
> +
>  3 files changed, 89 insertions(+)
> 
> diff --git a/Documentation/thermal/sysfs-api.txt 
> b/Documentation/thermal/sysfs-api.txt
> index ef473dc..98dc04f 100644
> --- a/Documentation/thermal/sysfs-api.txt
> +++ b/Documentation/thermal/sysfs-api.txt
> @@ -582,3 +582,24 @@ platform data is provided, this uses the step_wise 
> throttling policy.
>  This function serves as an arbitrator to set the state of a cooling
>  device. It sets the cooling device to the deepest cooling state if
>  possible.
> +
> +6. thermal_emergency_poweroff:
> +
> +On an event of critical trip temperature crossing. Thermal framework
> +allows the system to shutdown gracefully by calling orderly_poweroff().
> +In the event of a failure of orderly_poweroff() to shut down the system
> +we are in danger of keeping the system alive at undesirably high
> +temperatures. To mitigate this high risk scenario we program a work
> +queue to fire after a pre-determined number of seconds to start
> +an emergency shutdown of the device using the kernel_power_off()
> +function. In case kernel_power_off() fails then finally
> +emergency_restart() is called in the worst case.
> +
> +The delay should be carefully profiled so as to give adequate time for
> +orderly_poweroff(). In case of failure of an orderly_poweroff() the
> +emergency poweroff kicks in after the delay has elapsed and shuts down
> +the system.
> +
> +If set to 0 emergency poweroff will not be supported. So a carefully
> +profiled non-zero positive value is a must for emergerncy poweroff to be
> +triggered.
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 9347401..2a748a6 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -15,6 +15,21 @@ menuconfig THERMAL
>  
>  if THERMAL
>  
> +config THERMAL_EMERGENCY_POWEROFF_DELAY_MS
> + int "Emergency poweroff delay in milli-seconds"
> + depends on THERMAL
> + default 0
> + help
> +   The number of milliseconds to delay before emergency
> +   poweroff kicks in. The delay should be carefully profiled
> +   so as to give adequate time for orderly_poweroff(). In case
> +   of failure of an orderly_poweroff() the emergency poweroff
> +   kicks in after the delay has elapsed and shuts down the system.
> +
> +   If set to 0 emergency poweroff will not be supported. So a carefully
> +   profiled non-zero positive value is a must for emergerncy poweroff to 
> be
> +   triggered.

Here is a suggestion for rephrase the above:

 +  help
 +Thermal subsystem will issue a graceful shutdown when
 +critical temperatures are reached using orderly_poweroff(). In
 +case of failure of an orderly_poweroff(), the thermal emergency 
poweroff
 +kicks in after a delay has elapsed and shuts down the system.
 +This config is number of milliseconds to delay before emergency
 +poweroff kicks in. Similarly to the critical trip point,
 +the delay should be carefully profiled so as to give adequate
 +time for orderly_poweroff() to finish on regular execution.
 +If set to 0 emergency poweroff will not be supported.
 +
 +In doubt, leave as 0.

> +
>  config THERMAL_HWMON
>

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Jerome Glisse

On Mon, Apr 17, 2017 at 11:22:08AM -0500, Alan Tull wrote:
> On Mon, Apr 17, 2017 at 10:57 AM, Jerome Glisse  wrote:
> > On Mon, Apr 17, 2017 at 10:35:01AM -0500, Alan Tull wrote:
> >> On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:
> >>
> >> Hi Jerome,
> >>
> >> > On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
> >> >> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
> >> >> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
> >> >> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
> >> >> > > wrote:
> >> >> > >
> >> >> > > > It is like if on GPU we only had close source compiler for the GPU
> >> >> > > > instructions set. So FPGA is definitly following different rules 
> >> >> > > > than
> >> >> > > > open source upstream GPU kernel driver abides to.
> >>
> >> Sorry, not a GPU guy, can you point me to something that documents
> >> this policy of 'only opensource compilers for GPU'?  I looked under
> >> linux/Documentation and didn't see anything.
> >
> > https://lists.freedesktop.org/archives/dri-devel/2010-July/001828.html
> 
> This starts out saying:
> 
> "Now this is just my opinion as maintainer of the drm, and doesn't
> reflect anyone or any official policy"
> 
> > There is no explicit mention about compiler
> 
> You are right about that, there is no mention about compiler.
> 
> > but trust me it is included
> > in everyones mind. You can ask Dave i am sure he would reject a driver
> > with everything open except the shader compiler.
> 
> How would that work?  Before the GPU driver is accepted, an open
> toolchain also needs to be submitted?
> 
> It's worth it to check out the responses since they not overwhelmingly
> positive and tend to rather be outlining the complicating factors.
> Many if not most say essentially that his stance was simplistic and
> unproductive, slamming the door on the people from whom the solution
> would come.  And keep in mind, this wasn't about what you've made it
> out to be in the first place; this is about open/closed source GPU
> drivers, not toolchains.

You are mixing thing. I say no driver without open source userspace.
For GPU this means something like a mesa driver and mesa driver include
a compiler but not in the form you expect (it is does not provide what
i would call a toolchain). But it can be something else. It could be
a standalone opencl implementation with what we call a finalizer (ie
something that goes from some intermediate representation down to the
hardare ISA).

So if you go look back at new driver submission like Adreno or Vivante.
They provided link to the open source driver they submited to mesa along
their kernel patchset posting.

For FPGA this would mean a tool that can map something like VHDL,Verilog
or even something lower level like a list of lut equation with a netlist
between them (and flip-flop and other blocks). So mapping this down to
the bitstream. It is call a toolchain for FPGA so that's the word i did
use.

> >> The current patchset doesn't have anything to do with FPGA toolchains
> >> but you're using this patchset as a platform to talk about toolchain
> >> issues.
> >
> > Well Intel inclusion of FPGA triggered my curiosity and when that patchset
> > came accross my inbox i did wonder where the open source userspace was and
> > went looking for it to no avail. So this isn't against a specific patchset
> > but more broadly against the whole drivers/fpga/ story. Sorry if this was
> > not clear.
> >
> >
> >> It sounds like you are opposed to any kernel support of loading images
> >> on FPGAs until all vendors have opensource toolchains.
> >
> > Yes that is what i am saying. They are different standard in the kernel
> > and i would rather have one clear standard about driver needing proper
> > open source userspace to go along with any upstream driver.
> 
> Deleting drivers/fpga wouldn't be a step forward to the openness you seek.

I am not saying let delete this. Ok it is there, i dislike that fact, but
it is there. I am asking: is this allowed for FPGA ? If so why ? Do everyone
understand the risks of accepting this ? ...

Now if community believe that we should only accept kernel code with open
source userspace bits then community can say no more changes to drivers/fpga
until you have said open source bits. That would send a clear signal.

Silently accepting drivers/fpga/ with no clear rules leave everyone in the
dark. People that wish to see only driver with open source userspace knows
that company will be happy to stay in this status quo and won't do anything
toward open source. Thinking differently would be utterly naive. A company
will not do something unless there is a clear incentive to do so.

Cheers,
Jérôme

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Jerome Glisse

On Mon, Apr 17, 2017 at 11:22:08AM -0500, Alan Tull wrote:
> On Mon, Apr 17, 2017 at 10:57 AM, Jerome Glisse  wrote:
> > On Mon, Apr 17, 2017 at 10:35:01AM -0500, Alan Tull wrote:
> >> On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:
> >>
> >> Hi Jerome,
> >>
> >> > On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
> >> >> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
> >> >> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
> >> >> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
> >> >> > > wrote:
> >> >> > >
> >> >> > > > It is like if on GPU we only had close source compiler for the GPU
> >> >> > > > instructions set. So FPGA is definitly following different rules 
> >> >> > > > than
> >> >> > > > open source upstream GPU kernel driver abides to.
> >>
> >> Sorry, not a GPU guy, can you point me to something that documents
> >> this policy of 'only opensource compilers for GPU'?  I looked under
> >> linux/Documentation and didn't see anything.
> >
> > https://lists.freedesktop.org/archives/dri-devel/2010-July/001828.html
> 
> This starts out saying:
> 
> "Now this is just my opinion as maintainer of the drm, and doesn't
> reflect anyone or any official policy"
> 
> > There is no explicit mention about compiler
> 
> You are right about that, there is no mention about compiler.
> 
> > but trust me it is included
> > in everyones mind. You can ask Dave i am sure he would reject a driver
> > with everything open except the shader compiler.
> 
> How would that work?  Before the GPU driver is accepted, an open
> toolchain also needs to be submitted?
> 
> It's worth it to check out the responses since they not overwhelmingly
> positive and tend to rather be outlining the complicating factors.
> Many if not most say essentially that his stance was simplistic and
> unproductive, slamming the door on the people from whom the solution
> would come.  And keep in mind, this wasn't about what you've made it
> out to be in the first place; this is about open/closed source GPU
> drivers, not toolchains.

You are mixing thing. I say no driver without open source userspace.
For GPU this means something like a mesa driver and mesa driver include
a compiler but not in the form you expect (it is does not provide what
i would call a toolchain). But it can be something else. It could be
a standalone opencl implementation with what we call a finalizer (ie
something that goes from some intermediate representation down to the
hardare ISA).

So if you go look back at new driver submission like Adreno or Vivante.
They provided link to the open source driver they submited to mesa along
their kernel patchset posting.

For FPGA this would mean a tool that can map something like VHDL,Verilog
or even something lower level like a list of lut equation with a netlist
between them (and flip-flop and other blocks). So mapping this down to
the bitstream. It is call a toolchain for FPGA so that's the word i did
use.

> >> The current patchset doesn't have anything to do with FPGA toolchains
> >> but you're using this patchset as a platform to talk about toolchain
> >> issues.
> >
> > Well Intel inclusion of FPGA triggered my curiosity and when that patchset
> > came accross my inbox i did wonder where the open source userspace was and
> > went looking for it to no avail. So this isn't against a specific patchset
> > but more broadly against the whole drivers/fpga/ story. Sorry if this was
> > not clear.
> >
> >
> >> It sounds like you are opposed to any kernel support of loading images
> >> on FPGAs until all vendors have opensource toolchains.
> >
> > Yes that is what i am saying. They are different standard in the kernel
> > and i would rather have one clear standard about driver needing proper
> > open source userspace to go along with any upstream driver.
> 
> Deleting drivers/fpga wouldn't be a step forward to the openness you seek.

I am not saying let delete this. Ok it is there, i dislike that fact, but
it is there. I am asking: is this allowed for FPGA ? If so why ? Do everyone
understand the risks of accepting this ? ...

Now if community believe that we should only accept kernel code with open
source userspace bits then community can say no more changes to drivers/fpga
until you have said open source bits. That would send a clear signal.

Silently accepting drivers/fpga/ with no clear rules leave everyone in the
dark. People that wish to see only driver with open source userspace knows
that company will be happy to stay in this status quo and won't do anything
toward open source. Thinking differently would be utterly naive. A company
will not do something unless there is a clear incentive to do so.

Cheers,
Jérôme

[PATCH] powerpc/mm/hugetlb: Add support for 1G huge pages

2017-04-17 Thread Aneesh Kumar K.V

POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch
enables the usage of 1G page size for hugetlbfs. This also update the helper
such we can do 1G page allocation at runtime.

Since we can do this only when radix translation mode is enabled, we can't use
the generic gigantic_page_supported helper. Hence provide a way for architecture
to override gigantic_page_supported helper.

We still don't enable 1G page size on DD1 version. This is to avoid doing
workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add
radix__tlb_flush_pte_p9_dd1()

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 13 +
 arch/powerpc/mm/hugetlbpage.c|  7 +--
 arch/powerpc/platforms/Kconfig.cputype   |  1 +
 mm/hugetlb.c |  4 
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index cd366596..86f27cc8ec61 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -50,4 +50,17 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct 
vm_area_struct *vma,
else
return entry;
 }
+
+#if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) &&  \
+   ((defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || \
+defined(CONFIG_CMA))
+#define gigantic_page_supported gigantic_page_supported
+static inline bool gigantic_page_supported(void)
+{
+   if (radix_enabled())
+   return true;
+   return false;
+}
+#endif
+
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a4f33de4008e..80f6d2ed551a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -763,8 +763,11 @@ static int __init add_huge_page_size(unsigned long long 
size)
 * Hash: 16M and 16G
 */
if (radix_enabled()) {
-   if (mmu_psize != MMU_PAGE_2M)
-   return -EINVAL;
+   if (mmu_psize != MMU_PAGE_2M) {
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1) ||
+   (mmu_psize != MMU_PAGE_1G))
+   return -EINVAL;
+   }
} else {
if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
return -EINVAL;
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index ef4c4b8fc547..f4ba4bf0d762 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -343,6 +343,7 @@ config PPC_STD_MMU_64
 config PPC_RADIX_MMU
bool "Radix MMU Support"
depends on PPC_BOOK3S_64
+   select ARCH_HAS_GIGANTIC_PAGE
default y
help
  Enable support for the Power ISA 3.0 Radix style MMU. Currently this
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d0aab9ee80d..2c090189f314 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1158,7 +1158,11 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
return 0;
 }
 
+#ifndef gigantic_page_supported
 static inline bool gigantic_page_supported(void) { return true; }
+#define gigantic_page_supported gigantic_page_supported
+#endif
+
 #else
 static inline bool gigantic_page_supported(void) { return false; }
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
-- 
2.7.4

[PATCH] powerpc/mm/hugetlb: Add support for 1G huge pages

2017-04-17 Thread Aneesh Kumar K.V

POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch
enables the usage of 1G page size for hugetlbfs. This also update the helper
such we can do 1G page allocation at runtime.

Since we can do this only when radix translation mode is enabled, we can't use
the generic gigantic_page_supported helper. Hence provide a way for architecture
to override gigantic_page_supported helper.

We still don't enable 1G page size on DD1 version. This is to avoid doing
workaround mentioned in commit: 6d3a0379ebdc8 (powerpc/mm: Add
radix__tlb_flush_pte_p9_dd1()

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 13 +
 arch/powerpc/mm/hugetlbpage.c|  7 +--
 arch/powerpc/platforms/Kconfig.cputype   |  1 +
 mm/hugetlb.c |  4 
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index cd366596..86f27cc8ec61 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -50,4 +50,17 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct 
vm_area_struct *vma,
else
return entry;
 }
+
+#if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) &&  \
+   ((defined(CONFIG_MEMORY_ISOLATION) && defined(CONFIG_COMPACTION)) || \
+defined(CONFIG_CMA))
+#define gigantic_page_supported gigantic_page_supported
+static inline bool gigantic_page_supported(void)
+{
+   if (radix_enabled())
+   return true;
+   return false;
+}
+#endif
+
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a4f33de4008e..80f6d2ed551a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -763,8 +763,11 @@ static int __init add_huge_page_size(unsigned long long 
size)
 * Hash: 16M and 16G
 */
if (radix_enabled()) {
-   if (mmu_psize != MMU_PAGE_2M)
-   return -EINVAL;
+   if (mmu_psize != MMU_PAGE_2M) {
+   if (cpu_has_feature(CPU_FTR_POWER9_DD1) ||
+   (mmu_psize != MMU_PAGE_1G))
+   return -EINVAL;
+   }
} else {
if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
return -EINVAL;
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index ef4c4b8fc547..f4ba4bf0d762 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -343,6 +343,7 @@ config PPC_STD_MMU_64
 config PPC_RADIX_MMU
bool "Radix MMU Support"
depends on PPC_BOOK3S_64
+   select ARCH_HAS_GIGANTIC_PAGE
default y
help
  Enable support for the Power ISA 3.0 Radix style MMU. Currently this
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3d0aab9ee80d..2c090189f314 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1158,7 +1158,11 @@ static int alloc_fresh_gigantic_page(struct hstate *h,
return 0;
 }
 
+#ifndef gigantic_page_supported
 static inline bool gigantic_page_supported(void) { return true; }
+#define gigantic_page_supported gigantic_page_supported
+#endif
+
 #else
 static inline bool gigantic_page_supported(void) { return false; }
 static inline void free_gigantic_page(struct page *page, unsigned int order) { 
}
-- 
2.7.4

RE: [PATCH] ACPICA: Export mutex functions

2017-04-17 Thread Moore, Robert

There is a model for the drivers to directly acquire an AML mutex object. That 
is why the acquire/release public interfaces were added to ACPICA.

I forget all of the details, but the model was developed with MS and others 
during the ACPI 6.0 timeframe.


> -Original Message-
> From: Guenter Roeck [mailto:li...@roeck-us.net]
> Sent: Monday, April 17, 2017 8:57 AM
> To: Zheng, Lv
> Cc: Moore, Robert; Wysocki, Rafael J; Len Brown; linux-
> a...@vger.kernel.org; de...@acpica.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] ACPICA: Export mutex functions
> 
> Hi,
> 
> On Mon, Apr 17, 2017 at 09:39:35AM +, Zheng, Lv wrote:
> > Hi,
> >
> > > From: Guenter Roeck [mailto:li...@roeck-us.net]
> > > Subject: Re: [PATCH] ACPICA: Export mutex functions
> > >
> > > On Wed, Apr 12, 2017 at 03:29:55PM +, Moore, Robert wrote:
> > > > The ACPICA mutex functions are based on the host OS functions, so
> they don't really buy you anything.
> > > You should just use the native Linux functions.
> > > >
> > >
> > > You mean they don't really acquire the requested ACPI mutex, and the
> > > underlying DSDT which declares and uses the mutex just ignores if
> > > the mutex was acquired by acpi_acquire_mutex() ?
> > >
> > > To clarify: You are saying that code such as
> > >
> > >   acpi_status status;
> > >
> > >   status = acpi_acquire_mutex(NULL, "\\_SB.PCI0.SBRG.SIO1.MUT0",
> 0x10);
> > >   if (ACPI_FAILURE(status)) {
> > >   pr_err("Failed to acquire ACPI mutex\n");
> > >   return -EBUSY;
> > >   }
> >
> > Why do you need to access \_SB.PCI0.SBRG.SIO1.MUT0?
> > OSPM should only invoke entry methods predefined by ACPI spec or
> whatever specs.
> > There shouldn't be any needs that a driver acquires an arbitrary AML
> mutex.
> > You do not seem to have justified the usage model, IMO.
> >
> 
> I am sorry, I have no idea how to do that. I can see that the resource in
> question (IO address 0x2e/0x2f) is accessed from the DSDT, that the
> resource is mutex protected, and that accesses to the same IO address from
> the Linux kernel are unreliable unless I acquire the mutex in question. At
> the same time, I can see that request_muxed_region() succeeds, so
> presumably ACPI does not reserve the region for its exclusive use.
> 
> It may well be that the "official" response to this problem is "you must
> not instantiate a watchdog, environmental monitor, or gpio driver (or
> anything else provided by the Super-IO chip that requires access to those
> ports) on this platform in Linux". Is that what you are suggesting ?
> 
> Thanks,
> Guenter

[PATCH 6/7] powerpc/hugetlb: Add follow_huge_pd implementation for ppc64.

2017-04-17 Thread Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 80f6d2ed551a..5c829a83a4cc 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -618,6 +620,46 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 }
 
 /*
+ * 64 bit book3s use generic follow_page_mask
+ */
+#ifdef CONFIG_PPC_BOOK3S_64
+
+struct page *follow_huge_pd(struct vm_area_struct *vma,
+   unsigned long address, hugepd_t hpd,
+   int flags, int pdshift)
+{
+   pte_t *ptep;
+   spinlock_t *ptl;
+   struct page *page = NULL;
+   unsigned long mask;
+   int shift = hugepd_shift(hpd);
+   struct mm_struct *mm = vma->vm_mm;
+
+retry:
+   ptl = >page_table_lock;
+   spin_lock(ptl);
+
+   ptep = hugepte_offset(hpd, address, pdshift);
+   if (pte_present(*ptep)) {
+   mask = (1UL << shift) - 1;
+   page = pte_page(*ptep);
+   page += ((address & mask) >> PAGE_SHIFT);
+   if (flags & FOLL_GET)
+   get_page(page);
+   } else {
+   if (is_hugetlb_entry_migration(*ptep)) {
+   spin_unlock(ptl);
+   __migration_entry_wait(mm, ptep, ptl);
+   goto retry;
+   }
+   }
+   spin_unlock(ptl);
+   return page;
+}
+
+#else /* !CONFIG_PPC_BOOK3S_64 */
+
+/*
  * We are holding mmap_sem, so a parallel huge page collapse cannot run.
  * To prevent hugepage split, disable irq.
  */
@@ -672,6 +714,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
BUG();
return NULL;
 }
+#endif
 
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
-- 
2.7.4

RE: [PATCH] ACPICA: Export mutex functions

2017-04-17 Thread Moore, Robert

There is a model for the drivers to directly acquire an AML mutex object. That 
is why the acquire/release public interfaces were added to ACPICA.

I forget all of the details, but the model was developed with MS and others 
during the ACPI 6.0 timeframe.


> -Original Message-
> From: Guenter Roeck [mailto:li...@roeck-us.net]
> Sent: Monday, April 17, 2017 8:57 AM
> To: Zheng, Lv
> Cc: Moore, Robert; Wysocki, Rafael J; Len Brown; linux-
> a...@vger.kernel.org; de...@acpica.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] ACPICA: Export mutex functions
> 
> Hi,
> 
> On Mon, Apr 17, 2017 at 09:39:35AM +, Zheng, Lv wrote:
> > Hi,
> >
> > > From: Guenter Roeck [mailto:li...@roeck-us.net]
> > > Subject: Re: [PATCH] ACPICA: Export mutex functions
> > >
> > > On Wed, Apr 12, 2017 at 03:29:55PM +, Moore, Robert wrote:
> > > > The ACPICA mutex functions are based on the host OS functions, so
> they don't really buy you anything.
> > > You should just use the native Linux functions.
> > > >
> > >
> > > You mean they don't really acquire the requested ACPI mutex, and the
> > > underlying DSDT which declares and uses the mutex just ignores if
> > > the mutex was acquired by acpi_acquire_mutex() ?
> > >
> > > To clarify: You are saying that code such as
> > >
> > >   acpi_status status;
> > >
> > >   status = acpi_acquire_mutex(NULL, "\\_SB.PCI0.SBRG.SIO1.MUT0",
> 0x10);
> > >   if (ACPI_FAILURE(status)) {
> > >   pr_err("Failed to acquire ACPI mutex\n");
> > >   return -EBUSY;
> > >   }
> >
> > Why do you need to access \_SB.PCI0.SBRG.SIO1.MUT0?
> > OSPM should only invoke entry methods predefined by ACPI spec or
> whatever specs.
> > There shouldn't be any needs that a driver acquires an arbitrary AML
> mutex.
> > You do not seem to have justified the usage model, IMO.
> >
> 
> I am sorry, I have no idea how to do that. I can see that the resource in
> question (IO address 0x2e/0x2f) is accessed from the DSDT, that the
> resource is mutex protected, and that accesses to the same IO address from
> the Linux kernel are unreliable unless I acquire the mutex in question. At
> the same time, I can see that request_muxed_region() succeeds, so
> presumably ACPI does not reserve the region for its exclusive use.
> 
> It may well be that the "official" response to this problem is "you must
> not instantiate a watchdog, environmental monitor, or gpio driver (or
> anything else provided by the Super-IO chip that requires access to those
> ports) on this platform in Linux". Is that what you are suggesting ?
> 
> Thanks,
> Guenter

[PATCH 6/7] powerpc/hugetlb: Add follow_huge_pd implementation for ppc64.

2017-04-17 Thread Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 80f6d2ed551a..5c829a83a4cc 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -618,6 +620,46 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 }
 
 /*
+ * 64 bit book3s use generic follow_page_mask
+ */
+#ifdef CONFIG_PPC_BOOK3S_64
+
+struct page *follow_huge_pd(struct vm_area_struct *vma,
+   unsigned long address, hugepd_t hpd,
+   int flags, int pdshift)
+{
+   pte_t *ptep;
+   spinlock_t *ptl;
+   struct page *page = NULL;
+   unsigned long mask;
+   int shift = hugepd_shift(hpd);
+   struct mm_struct *mm = vma->vm_mm;
+
+retry:
+   ptl = >page_table_lock;
+   spin_lock(ptl);
+
+   ptep = hugepte_offset(hpd, address, pdshift);
+   if (pte_present(*ptep)) {
+   mask = (1UL << shift) - 1;
+   page = pte_page(*ptep);
+   page += ((address & mask) >> PAGE_SHIFT);
+   if (flags & FOLL_GET)
+   get_page(page);
+   } else {
+   if (is_hugetlb_entry_migration(*ptep)) {
+   spin_unlock(ptl);
+   __migration_entry_wait(mm, ptep, ptl);
+   goto retry;
+   }
+   }
+   spin_unlock(ptl);
+   return page;
+}
+
+#else /* !CONFIG_PPC_BOOK3S_64 */
+
+/*
  * We are holding mmap_sem, so a parallel huge page collapse cannot run.
  * To prevent hugepage split, disable irq.
  */
@@ -672,6 +714,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
BUG();
return NULL;
 }
+#endif
 
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
-- 
2.7.4

[PATCH 7/7] powerpc/hugetlb: Enable hugetlb migration for ppc64

2017-04-17 Thread Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/Kconfig.cputype | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index f4ba4bf0d762..9fb075745c7f 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -350,6 +350,11 @@ config PPC_RADIX_MMU
  is only implemented by IBM Power9 CPUs, if you don't have one of them
  you can probably disable this.
 
+config ARCH_ENABLE_HUGEPAGE_MIGRATION
+   def_bool y
+   depends on PPC_BOOK3S_64 && HUGETLB_PAGE && MIGRATION
+
+
 config PPC_MMU_NOHASH
def_bool y
depends on !PPC_STD_MMU
-- 
2.7.4

[PATCH 5/7] mm/follow_page_mask: Add support for hugepage directory entry

2017-04-17 Thread Aneesh Kumar K.V

Architectures like ppc64 supports hugepage size that is not mapped to any of
of the page table levels. Instead they add an alternate page table entry format
called hugepage directory (hugepd). hugepd indicates that the page table entry 
maps
to a set of hugetlb pages. Add support for this in generic follow_page_mask
code. We already support this format in the generic gup code.

The defaul implementation prints warning and returns NULL. We will add ppc64
support in later patches

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h |  4 
 mm/gup.c| 33 +
 mm/hugetlb.c|  8 
 3 files changed, 45 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edab98f0a7b8..7a5917d190f2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -117,6 +117,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long 
addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
  int write);
+struct page *follow_huge_pd(struct vm_area_struct *vma,
+   unsigned long address, hugepd_t hpd,
+   int flags, int pdshift);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
@@ -151,6 +154,7 @@ static inline void hugetlb_report_meminfo(struct seq_file 
*m)
 static inline void hugetlb_show_meminfo(void)
 {
 }
+#define follow_huge_pd(vma, addr, hpd, flags, pdshift) NULL
 #define follow_huge_pmd(mm, addr, pmd, flags)  NULL
 #define follow_huge_pud(mm, addr, pud, flags)  NULL
 #define follow_huge_pgd(mm, addr, pgd, flags)  NULL
diff --git a/mm/gup.c b/mm/gup.c
index 65255389620a..a7f5b82e15f3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -226,6 +226,14 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pmd_val(*pmd {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pmd_val(*pmd)), flags,
+ PMD_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -292,6 +300,14 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pud_val(*pud {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pud_val(*pud)), flags,
+ PUD_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags);
@@ -311,6 +327,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct 
*vma,
unsigned int flags, unsigned int *page_mask)
 {
p4d_t *p4d;
+   struct page *page;
 
p4d = p4d_offset(pgdp, address);
if (p4d_none(*p4d))
@@ -319,6 +336,14 @@ static struct page *follow_p4d_mask(struct vm_area_struct 
*vma,
if (unlikely(p4d_bad(*p4d)))
return no_page_table(vma, flags);
 
+   if (is_hugepd(__hugepd(p4d_val(*p4d {
+   page = follow_huge_pd(vma, address,
+ __hugepd(p4d_val(*p4d)), flags,
+ P4D_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
return follow_pud_mask(vma, address, p4d, flags, page_mask);
 }
 
@@ -363,6 +388,14 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pgd_val(*pgd {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pgd_val(*pgd)), flags,
+ PGDIR_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
 
return follow_p4d_mask(vma, address, pgd, flags, page_mask);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83f39cf5162a..64ad00d97094 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4650,6 +4650,14 @@ follow_huge_addr(struct

[PATCH 7/7] powerpc/hugetlb: Enable hugetlb migration for ppc64

2017-04-17 Thread Aneesh Kumar K.V

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/Kconfig.cputype | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index f4ba4bf0d762..9fb075745c7f 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -350,6 +350,11 @@ config PPC_RADIX_MMU
  is only implemented by IBM Power9 CPUs, if you don't have one of them
  you can probably disable this.
 
+config ARCH_ENABLE_HUGEPAGE_MIGRATION
+   def_bool y
+   depends on PPC_BOOK3S_64 && HUGETLB_PAGE && MIGRATION
+
+
 config PPC_MMU_NOHASH
def_bool y
depends on !PPC_STD_MMU
-- 
2.7.4

[PATCH 5/7] mm/follow_page_mask: Add support for hugepage directory entry

2017-04-17 Thread Aneesh Kumar K.V

Architectures like ppc64 supports hugepage size that is not mapped to any of
of the page table levels. Instead they add an alternate page table entry format
called hugepage directory (hugepd). hugepd indicates that the page table entry 
maps
to a set of hugetlb pages. Add support for this in generic follow_page_mask
code. We already support this format in the generic gup code.

The defaul implementation prints warning and returns NULL. We will add ppc64
support in later patches

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h |  4 
 mm/gup.c| 33 +
 mm/hugetlb.c|  8 
 3 files changed, 45 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edab98f0a7b8..7a5917d190f2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -117,6 +117,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long 
addr);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
  int write);
+struct page *follow_huge_pd(struct vm_area_struct *vma,
+   unsigned long address, hugepd_t hpd,
+   int flags, int pdshift);
 struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
@@ -151,6 +154,7 @@ static inline void hugetlb_report_meminfo(struct seq_file 
*m)
 static inline void hugetlb_show_meminfo(void)
 {
 }
+#define follow_huge_pd(vma, addr, hpd, flags, pdshift) NULL
 #define follow_huge_pmd(mm, addr, pmd, flags)  NULL
 #define follow_huge_pud(mm, addr, pud, flags)  NULL
 #define follow_huge_pgd(mm, addr, pgd, flags)  NULL
diff --git a/mm/gup.c b/mm/gup.c
index 65255389620a..a7f5b82e15f3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -226,6 +226,14 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pmd_val(*pmd {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pmd_val(*pmd)), flags,
+ PMD_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
if (pmd_devmap(*pmd)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags);
@@ -292,6 +300,14 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pud_val(*pud {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pud_val(*pud)), flags,
+ PUD_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags);
@@ -311,6 +327,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct 
*vma,
unsigned int flags, unsigned int *page_mask)
 {
p4d_t *p4d;
+   struct page *page;
 
p4d = p4d_offset(pgdp, address);
if (p4d_none(*p4d))
@@ -319,6 +336,14 @@ static struct page *follow_p4d_mask(struct vm_area_struct 
*vma,
if (unlikely(p4d_bad(*p4d)))
return no_page_table(vma, flags);
 
+   if (is_hugepd(__hugepd(p4d_val(*p4d {
+   page = follow_huge_pd(vma, address,
+ __hugepd(p4d_val(*p4d)), flags,
+ P4D_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
return follow_pud_mask(vma, address, p4d, flags, page_mask);
 }
 
@@ -363,6 +388,14 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
+   if (is_hugepd(__hugepd(pgd_val(*pgd {
+   page = follow_huge_pd(vma, address,
+ __hugepd(pgd_val(*pgd)), flags,
+ PGDIR_SHIFT);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
 
return follow_p4d_mask(vma, address, pgd, flags, page_mask);
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83f39cf5162a..64ad00d97094 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4650,6 +4650,14 @@ follow_huge_addr(struct mm_struct *mm, unsigned long

[PATCH 1/7] mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at

2017-04-17 Thread Aneesh Kumar K.V

The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use
that instead of set_pte_at.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/migrate.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 9a0897a14d37..4c272ac6fe53 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -224,25 +224,26 @@ static int remove_migration_pte(struct page *page, struct 
vm_area_struct *vma,
if (is_write_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
 
+   flush_dcache_page(new);
 #ifdef CONFIG_HUGETLB_PAGE
if (PageHuge(new)) {
pte = pte_mkhuge(pte);
pte = arch_make_huge_pte(pte, vma, new, 0);
-   }
-#endif
-   flush_dcache_page(new);
-   set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
-
-   if (PageHuge(new)) {
+   set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, 
pte);
if (PageAnon(new))
hugepage_add_anon_rmap(new, vma, pvmw.address);
else
page_dup_rmap(new, true);
-   } else if (PageAnon(new))
-   page_add_anon_rmap(new, vma, pvmw.address, false);
-   else
-   page_add_file_rmap(new, false);
+   } else
+#endif
+   {
+   set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
+   if (PageAnon(new))
+   page_add_anon_rmap(new, vma, pvmw.address, 
false);
+   else
+   page_add_file_rmap(new, false);
+   }
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
 
-- 
2.7.4

[PATCH 1/7] mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at

2017-04-17 Thread Aneesh Kumar K.V

The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use
that instead of set_pte_at.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/migrate.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 9a0897a14d37..4c272ac6fe53 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -224,25 +224,26 @@ static int remove_migration_pte(struct page *page, struct 
vm_area_struct *vma,
if (is_write_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
 
+   flush_dcache_page(new);
 #ifdef CONFIG_HUGETLB_PAGE
if (PageHuge(new)) {
pte = pte_mkhuge(pte);
pte = arch_make_huge_pte(pte, vma, new, 0);
-   }
-#endif
-   flush_dcache_page(new);
-   set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
-
-   if (PageHuge(new)) {
+   set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, 
pte);
if (PageAnon(new))
hugepage_add_anon_rmap(new, vma, pvmw.address);
else
page_dup_rmap(new, true);
-   } else if (PageAnon(new))
-   page_add_anon_rmap(new, vma, pvmw.address, false);
-   else
-   page_add_file_rmap(new, false);
+   } else
+#endif
+   {
+   set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
+   if (PageAnon(new))
+   page_add_anon_rmap(new, vma, pvmw.address, 
false);
+   else
+   page_add_file_rmap(new, false);
+   }
if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
mlock_vma_page(new);
 
-- 
2.7.4

[PATCH 3/7] mm/hugetlb: export hugetlb_entry_migration helper

2017-04-17 Thread Aneesh Kumar K.V

We will be using this later from the ppc64 code. Change the return type to bool.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b857fc8cc2ec..fddf6cf403d5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -126,6 +126,7 @@ int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
 
+bool is_hugetlb_entry_migration(pte_t pte);
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2c090189f314..9b630e2195d5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3189,17 +3189,17 @@ static void set_huge_ptep_writable(struct 
vm_area_struct *vma,
update_mmu_cache(vma, address, ptep);
 }
 
-static int is_hugetlb_entry_migration(pte_t pte)
+bool is_hugetlb_entry_migration(pte_t pte)
 {
swp_entry_t swp;
 
if (huge_pte_none(pte) || pte_present(pte))
-   return 0;
+   return false;
swp = pte_to_swp_entry(pte);
if (non_swap_entry(swp) && is_migration_entry(swp))
-   return 1;
+   return true;
else
-   return 0;
+   return false;
 }
 
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
-- 
2.7.4

[PATCH 2/7] mm/follow_page_mask: Split follow_page_mask to smaller functions.

2017-04-17 Thread Aneesh Kumar K.V

Makes code reading easy. No functional changes in this patch. In a followup
patch, we will be updating the follow_page_mask to handle hugetlb hugepd format
so that archs like ppc64 can switch to the generic version. This split helps
in doing that nicely.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/gup.c | 148 +++
 1 file changed, 91 insertions(+), 57 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 04aa405350dc..73d46f9f7b81 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -208,68 +208,16 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
return no_page_table(vma, flags);
 }
 
-/**
- * follow_page_mask - look up a page descriptor from a user-virtual address
- * @vma: vm_area_struct mapping @address
- * @address: virtual address to look up
- * @flags: flags modifying lookup behaviour
- * @page_mask: on output, *page_mask is set according to the size of the page
- *
- * @flags can have FOLL_ flags set, defined in 
- *
- * Returns the mapped (struct page *), %NULL if no mapping exists, or
- * an error pointer if there is a mapping to something not represented
- * by a page descriptor (see also vm_normal_page()).
- */
-struct page *follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- unsigned int *page_mask)
+static struct page *follow_pmd_mask(struct vm_area_struct *vma,
+   unsigned long address, pud_t *pudp,
+   unsigned int flags, unsigned int *page_mask)
 {
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
pmd_t *pmd;
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
 
-   *page_mask = 0;
-
-   page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
-   if (!IS_ERR(page)) {
-   BUG_ON(flags & FOLL_GET);
-   return page;
-   }
-
-   pgd = pgd_offset(mm, address);
-   if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-   return no_page_table(vma, flags);
-   p4d = p4d_offset(pgd, address);
-   if (p4d_none(*p4d))
-   return no_page_table(vma, flags);
-   BUILD_BUG_ON(p4d_huge(*p4d));
-   if (unlikely(p4d_bad(*p4d)))
-   return no_page_table(vma, flags);
-   pud = pud_offset(p4d, address);
-   if (pud_none(*pud))
-   return no_page_table(vma, flags);
-   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
-   page = follow_huge_pud(mm, address, pud, flags);
-   if (page)
-   return page;
-   return no_page_table(vma, flags);
-   }
-   if (pud_devmap(*pud)) {
-   ptl = pud_lock(mm, pud);
-   page = follow_devmap_pud(vma, address, pud, flags);
-   spin_unlock(ptl);
-   if (page)
-   return page;
-   }
-   if (unlikely(pud_bad(*pud)))
-   return no_page_table(vma, flags);
-
-   pmd = pmd_offset(pud, address);
+   pmd = pmd_offset(pudp, address);
if (pmd_none(*pmd))
return no_page_table(vma, flags);
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
@@ -319,13 +267,99 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return ret ? ERR_PTR(ret) :
follow_page_pte(vma, address, pmd, flags);
}
-
page = follow_trans_huge_pmd(vma, address, pmd, flags);
spin_unlock(ptl);
*page_mask = HPAGE_PMD_NR - 1;
return page;
 }
 
+
+static struct page *follow_pud_mask(struct vm_area_struct *vma,
+   unsigned long address, p4d_t *p4dp,
+   unsigned int flags, unsigned int *page_mask)
+{
+   pud_t *pud;
+   spinlock_t *ptl;
+   struct page *page;
+   struct mm_struct *mm = vma->vm_mm;
+
+   pud = pud_offset(p4dp, address);
+   if (pud_none(*pud))
+   return no_page_table(vma, flags);
+   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
+   page = follow_huge_pud(mm, address, pud, flags);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
+   if (pud_devmap(*pud)) {
+   ptl = pud_lock(mm, pud);
+   page = follow_devmap_pud(vma, address, pud, flags);
+   spin_unlock(ptl);
+   if (page)
+   return page;
+   }
+   if (unlikely(pud_bad(*pud)))
+   return no_page_table(vma, flags);
+
+   return follow_pmd_mask(vma, address, pud, flags, page_mask);
+}
+
+
+static struct page *follow_p4d_mask(struct vm_area_struct *vma,
+   unsigned long address, pgd_t *pgdp,
+

[PATCH 4/7] mm/follow_page_mask: Add support for hugetlb pgd entries.

2017-04-17 Thread Aneesh Kumar K.V

From: Anshuman Khandual 

ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd entries to
follow_page_mask so that ppc64 can switch to it to handle hugetlbe entries.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 4 
 mm/gup.c| 7 +++
 mm/hugetlb.c| 9 +
 3 files changed, 20 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fddf6cf403d5..edab98f0a7b8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -121,6 +121,9 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned 
long address,
pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
+struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
+pgd_t *pgd, int flags);
+
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -150,6 +153,7 @@ static inline void hugetlb_show_meminfo(void)
 }
 #define follow_huge_pmd(mm, addr, pmd, flags)  NULL
 #define follow_huge_pud(mm, addr, pud, flags)  NULL
+#define follow_huge_pgd(mm, addr, pgd, flags)  NULL
 #define prepare_hugepage_range(file, addr, len)(-EINVAL)
 #define pmd_huge(x)0
 #define pud_huge(x)0
diff --git a/mm/gup.c b/mm/gup.c
index 73d46f9f7b81..65255389620a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -357,6 +357,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);
 
+   if (pgd_huge(*pgd)) {
+   page = follow_huge_pgd(mm, address, pgd, flags);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
+
return follow_p4d_mask(vma, address, pgd, flags, page_mask);
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9b630e2195d5..83f39cf5162a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4694,6 +4694,15 @@ follow_huge_pud(struct mm_struct *mm, unsigned long 
address,
return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
 }
 
+struct page * __weak
+follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int 
flags)
+{
+   if (flags & FOLL_GET)
+   return NULL;
+
+   return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
PAGE_SHIFT);
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 /*
-- 
2.7.4

[PATCH 3/7] mm/hugetlb: export hugetlb_entry_migration helper

2017-04-17 Thread Aneesh Kumar K.V

We will be using this later from the ppc64 code. Change the return type to bool.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b857fc8cc2ec..fddf6cf403d5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -126,6 +126,7 @@ int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
 
+bool is_hugetlb_entry_migration(pte_t pte);
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2c090189f314..9b630e2195d5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3189,17 +3189,17 @@ static void set_huge_ptep_writable(struct 
vm_area_struct *vma,
update_mmu_cache(vma, address, ptep);
 }
 
-static int is_hugetlb_entry_migration(pte_t pte)
+bool is_hugetlb_entry_migration(pte_t pte)
 {
swp_entry_t swp;
 
if (huge_pte_none(pte) || pte_present(pte))
-   return 0;
+   return false;
swp = pte_to_swp_entry(pte);
if (non_swap_entry(swp) && is_migration_entry(swp))
-   return 1;
+   return true;
else
-   return 0;
+   return false;
 }
 
 static int is_hugetlb_entry_hwpoisoned(pte_t pte)
-- 
2.7.4

[PATCH 2/7] mm/follow_page_mask: Split follow_page_mask to smaller functions.

2017-04-17 Thread Aneesh Kumar K.V

Makes code reading easy. No functional changes in this patch. In a followup
patch, we will be updating the follow_page_mask to handle hugetlb hugepd format
so that archs like ppc64 can switch to the generic version. This split helps
in doing that nicely.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/gup.c | 148 +++
 1 file changed, 91 insertions(+), 57 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 04aa405350dc..73d46f9f7b81 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -208,68 +208,16 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
return no_page_table(vma, flags);
 }
 
-/**
- * follow_page_mask - look up a page descriptor from a user-virtual address
- * @vma: vm_area_struct mapping @address
- * @address: virtual address to look up
- * @flags: flags modifying lookup behaviour
- * @page_mask: on output, *page_mask is set according to the size of the page
- *
- * @flags can have FOLL_ flags set, defined in 
- *
- * Returns the mapped (struct page *), %NULL if no mapping exists, or
- * an error pointer if there is a mapping to something not represented
- * by a page descriptor (see also vm_normal_page()).
- */
-struct page *follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- unsigned int *page_mask)
+static struct page *follow_pmd_mask(struct vm_area_struct *vma,
+   unsigned long address, pud_t *pudp,
+   unsigned int flags, unsigned int *page_mask)
 {
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
pmd_t *pmd;
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
 
-   *page_mask = 0;
-
-   page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
-   if (!IS_ERR(page)) {
-   BUG_ON(flags & FOLL_GET);
-   return page;
-   }
-
-   pgd = pgd_offset(mm, address);
-   if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-   return no_page_table(vma, flags);
-   p4d = p4d_offset(pgd, address);
-   if (p4d_none(*p4d))
-   return no_page_table(vma, flags);
-   BUILD_BUG_ON(p4d_huge(*p4d));
-   if (unlikely(p4d_bad(*p4d)))
-   return no_page_table(vma, flags);
-   pud = pud_offset(p4d, address);
-   if (pud_none(*pud))
-   return no_page_table(vma, flags);
-   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
-   page = follow_huge_pud(mm, address, pud, flags);
-   if (page)
-   return page;
-   return no_page_table(vma, flags);
-   }
-   if (pud_devmap(*pud)) {
-   ptl = pud_lock(mm, pud);
-   page = follow_devmap_pud(vma, address, pud, flags);
-   spin_unlock(ptl);
-   if (page)
-   return page;
-   }
-   if (unlikely(pud_bad(*pud)))
-   return no_page_table(vma, flags);
-
-   pmd = pmd_offset(pud, address);
+   pmd = pmd_offset(pudp, address);
if (pmd_none(*pmd))
return no_page_table(vma, flags);
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
@@ -319,13 +267,99 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
return ret ? ERR_PTR(ret) :
follow_page_pte(vma, address, pmd, flags);
}
-
page = follow_trans_huge_pmd(vma, address, pmd, flags);
spin_unlock(ptl);
*page_mask = HPAGE_PMD_NR - 1;
return page;
 }
 
+
+static struct page *follow_pud_mask(struct vm_area_struct *vma,
+   unsigned long address, p4d_t *p4dp,
+   unsigned int flags, unsigned int *page_mask)
+{
+   pud_t *pud;
+   spinlock_t *ptl;
+   struct page *page;
+   struct mm_struct *mm = vma->vm_mm;
+
+   pud = pud_offset(p4dp, address);
+   if (pud_none(*pud))
+   return no_page_table(vma, flags);
+   if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
+   page = follow_huge_pud(mm, address, pud, flags);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
+   if (pud_devmap(*pud)) {
+   ptl = pud_lock(mm, pud);
+   page = follow_devmap_pud(vma, address, pud, flags);
+   spin_unlock(ptl);
+   if (page)
+   return page;
+   }
+   if (unlikely(pud_bad(*pud)))
+   return no_page_table(vma, flags);
+
+   return follow_pmd_mask(vma, address, pud, flags, page_mask);
+}
+
+
+static struct page *follow_p4d_mask(struct vm_area_struct *vma,
+   unsigned long address, pgd_t *pgdp,
+   unsigned int flags, unsigned int

[PATCH 4/7] mm/follow_page_mask: Add support for hugetlb pgd entries.

2017-04-17 Thread Aneesh Kumar K.V

From: Anshuman Khandual 

ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd entries to
follow_page_mask so that ppc64 can switch to it to handle hugetlbe entries.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 4 
 mm/gup.c| 7 +++
 mm/hugetlb.c| 9 +
 3 files changed, 20 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fddf6cf403d5..edab98f0a7b8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -121,6 +121,9 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned 
long address,
pmd_t *pmd, int flags);
 struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
+struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
+pgd_t *pgd, int flags);
+
 int pmd_huge(pmd_t pmd);
 int pud_huge(pud_t pud);
 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -150,6 +153,7 @@ static inline void hugetlb_show_meminfo(void)
 }
 #define follow_huge_pmd(mm, addr, pmd, flags)  NULL
 #define follow_huge_pud(mm, addr, pud, flags)  NULL
+#define follow_huge_pgd(mm, addr, pgd, flags)  NULL
 #define prepare_hugepage_range(file, addr, len)(-EINVAL)
 #define pmd_huge(x)0
 #define pud_huge(x)0
diff --git a/mm/gup.c b/mm/gup.c
index 73d46f9f7b81..65255389620a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -357,6 +357,13 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
return no_page_table(vma, flags);
 
+   if (pgd_huge(*pgd)) {
+   page = follow_huge_pgd(mm, address, pgd, flags);
+   if (page)
+   return page;
+   return no_page_table(vma, flags);
+   }
+
return follow_p4d_mask(vma, address, pgd, flags, page_mask);
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9b630e2195d5..83f39cf5162a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4694,6 +4694,15 @@ follow_huge_pud(struct mm_struct *mm, unsigned long 
address,
return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
 }
 
+struct page * __weak
+follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int 
flags)
+{
+   if (flags & FOLL_GET)
+   return NULL;
+
+   return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
PAGE_SHIFT);
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 
 /*
-- 
2.7.4

[PATCH 0/7] HugeTLB migration support for PPC64

2017-04-17 Thread Aneesh Kumar K.V

This patch series add support for hugetlb page migration.

Aneesh Kumar K.V (6):
  mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at
  mm/follow_page_mask: Split follow_page_mask to smaller functions.
  mm/hugetlb: export hugetlb_entry_migration helper
  mm/follow_page_mask: Add support for hugepage directory entry
  powerpc/hugetlb: Add follow_huge_pd implementation for ppc64.
  powerpc/hugetlb: Enable hugetlb migration for ppc64

Anshuman Khandual (1):
  mm/follow_page_mask: Add support for hugetlb pgd entries.

 arch/powerpc/mm/hugetlbpage.c  |  43 
 arch/powerpc/platforms/Kconfig.cputype |   5 +
 include/linux/hugetlb.h|   9 ++
 mm/gup.c   | 186 +++--
 mm/hugetlb.c   |  25 -
 mm/migrate.c   |  21 ++--
 6 files changed, 219 insertions(+), 70 deletions(-)

-- 
2.7.4

[PATCH 0/7] HugeTLB migration support for PPC64

2017-04-17 Thread Aneesh Kumar K.V

This patch series add support for hugetlb page migration.

Aneesh Kumar K.V (6):
  mm/hugetlb/migration: Use set_huge_pte_at instead of set_pte_at
  mm/follow_page_mask: Split follow_page_mask to smaller functions.
  mm/hugetlb: export hugetlb_entry_migration helper
  mm/follow_page_mask: Add support for hugepage directory entry
  powerpc/hugetlb: Add follow_huge_pd implementation for ppc64.
  powerpc/hugetlb: Enable hugetlb migration for ppc64

Anshuman Khandual (1):
  mm/follow_page_mask: Add support for hugetlb pgd entries.

 arch/powerpc/mm/hugetlbpage.c  |  43 
 arch/powerpc/platforms/Kconfig.cputype |   5 +
 include/linux/hugetlb.h|   9 ++
 mm/gup.c   | 186 +++--
 mm/hugetlb.c   |  25 -
 mm/migrate.c   |  21 ++--
 6 files changed, 219 insertions(+), 70 deletions(-)

-- 
2.7.4

Re:drivers:watchdog:aspeed_wdt: using msleep instead of mdelay

2017-04-17 Thread Karim Eshapa

On Sun, 16 Apr 2017 12:53:28 -0700,Guenter Roeck wrote:
> On 04/16/2017 09:33 AM, Karim Eshapa wrote:
>>
>> that's useful for the scheduler, power management unless
>> the driver needs to delay in atomic context
>> look at documentation/timers/timers-howto
>
> Possibly, but how can you guarantee that the restart function is called with
> interrupts enabled ? Also, why would it be necessary or even useful for the 
> scheduler to do anything while the system is in the process of restarting ?

>From signaling or interruption point of view msleep() is uninterruptible. 
your process will sleep and won't be waked up until finish the time.

>From the cpu load and power point of view, mdelay() makes your code
stucked doing nothing until the delay finishes so, it's still headache
to the schedular from time slot perspective.
Although it's restating but it's still a long process that takes time. 

In addittion to mdelay() isn't preferable in case of large delays +10 as it 
uses udelay()
 
But the question now what about ptotecting your HW while being accessed
through manipulating the registers. and what about memory reordering may be 
generated
by the compiler or the machine itself! while accessing a sequence of registers.

>> Signed-off-by: Karim Eshapa 
>> ---
>> drivers/watchdog/aspeed_wdt.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)

>>  diff --git a/drivers/watchdog/aspeed_wdt.c b/drivers/watchdog/aspeed_wdt.c
>> index 1c65258..17f06d1 100644
>> --- a/drivers/watchdog/aspeed_wdt.c
>>  +++ b/drivers/watchdog/aspeed_wdt.c
>> @@ -115,7 +115,7 @@ static int aspeed_wdt_restart(struct watchdog_device 
>> *wdd>> , aspeed_wdt_enable(wdt, 128 * WDT_RATE_1MHZ / 1000);
>> -   mdelay(1000);
>> +   msleep(1000);
>>
>> return 0;
>>   }

Thanks,
Karim

Re:drivers:watchdog:aspeed_wdt: using msleep instead of mdelay

2017-04-17 Thread Karim Eshapa

On Sun, 16 Apr 2017 12:53:28 -0700,Guenter Roeck wrote:
> On 04/16/2017 09:33 AM, Karim Eshapa wrote:
>>
>> that's useful for the scheduler, power management unless
>> the driver needs to delay in atomic context
>> look at documentation/timers/timers-howto
>
> Possibly, but how can you guarantee that the restart function is called with
> interrupts enabled ? Also, why would it be necessary or even useful for the 
> scheduler to do anything while the system is in the process of restarting ?

>From signaling or interruption point of view msleep() is uninterruptible. 
your process will sleep and won't be waked up until finish the time.

>From the cpu load and power point of view, mdelay() makes your code
stucked doing nothing until the delay finishes so, it's still headache
to the schedular from time slot perspective.
Although it's restating but it's still a long process that takes time. 

In addittion to mdelay() isn't preferable in case of large delays +10 as it 
uses udelay()
 
But the question now what about ptotecting your HW while being accessed
through manipulating the registers. and what about memory reordering may be 
generated
by the compiler or the machine itself! while accessing a sequence of registers.

>> Signed-off-by: Karim Eshapa 
>> ---
>> drivers/watchdog/aspeed_wdt.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)

>>  diff --git a/drivers/watchdog/aspeed_wdt.c b/drivers/watchdog/aspeed_wdt.c
>> index 1c65258..17f06d1 100644
>> --- a/drivers/watchdog/aspeed_wdt.c
>>  +++ b/drivers/watchdog/aspeed_wdt.c
>> @@ -115,7 +115,7 @@ static int aspeed_wdt_restart(struct watchdog_device 
>> *wdd>> , aspeed_wdt_enable(wdt, 128 * WDT_RATE_1MHZ / 1000);
>> -   mdelay(1000);
>> +   msleep(1000);
>>
>> return 0;
>>   }

Thanks,
Karim

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Dan Williams

On Mon, Apr 17, 2017 at 9:52 AM, Logan Gunthorpe  wrote:
>
>
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
>> But is it ? For example take a GPU, does it, in your scheme, need an
>> additional "p2pmem" child ? Why can't the GPU driver just use some
>> helper to instantiate the necessary struct pages ? What does having an
>> actual "struct device" child buys you ?
>
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

I think you want to go the other way in the hierarchy and find a
shared *parent* to land the p2pmem capability. Because that same agent
is going to be responsible handling address translation for the peers.

>>> 2) In order to create the struct pages we use the ZONE_DEVICE
>>> infrastructure which requires a struct device. (See
>>> devm_memremap_pages.)
>>
>> Yup, but you already have one in the actual pci_dev ... What is the
>> benefit of adding a second one ?
>
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device. Having a specific class for this makes it very
> clear how this memory would be handled. For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

Peer-dma is always going to be a property of the bus and not the end
devices. Requiring each bus implementation to explicitly enable
peer-to-peer support is a feature not a bug.

>>>  This amazingly gets us the get_dev_pagemap
>>> architecture which also uses a struct device. So by using a p2pmem
>>> device we can go from struct page to struct device to p2pmem device
>>> quickly and effortlessly.
>>
>> Which isn't terribly useful in itself right ? What you care about is
>> the "enclosing" pci_dev no ? Or am I missing something ?
>
> Sure it is. What if we want to someday support p2pmem that's on another bus?

We shouldn't design for some future possible use case. Solve it for
pci and when / if another bus comes along then look at a more generic
abstraction.

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Dan Williams

On Mon, Apr 17, 2017 at 9:52 AM, Logan Gunthorpe  wrote:
>
>
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
>> But is it ? For example take a GPU, does it, in your scheme, need an
>> additional "p2pmem" child ? Why can't the GPU driver just use some
>> helper to instantiate the necessary struct pages ? What does having an
>> actual "struct device" child buys you ?
>
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

I think you want to go the other way in the hierarchy and find a
shared *parent* to land the p2pmem capability. Because that same agent
is going to be responsible handling address translation for the peers.

>>> 2) In order to create the struct pages we use the ZONE_DEVICE
>>> infrastructure which requires a struct device. (See
>>> devm_memremap_pages.)
>>
>> Yup, but you already have one in the actual pci_dev ... What is the
>> benefit of adding a second one ?
>
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device. Having a specific class for this makes it very
> clear how this memory would be handled. For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

Peer-dma is always going to be a property of the bus and not the end
devices. Requiring each bus implementation to explicitly enable
peer-to-peer support is a feature not a bug.

>>>  This amazingly gets us the get_dev_pagemap
>>> architecture which also uses a struct device. So by using a p2pmem
>>> device we can go from struct page to struct device to p2pmem device
>>> quickly and effortlessly.
>>
>> Which isn't terribly useful in itself right ? What you care about is
>> the "enclosing" pci_dev no ? Or am I missing something ?
>
> Sure it is. What if we want to someday support p2pmem that's on another bus?

We shouldn't design for some future possible use case. Solve it for
pci and when / if another bus comes along then look at a more generic
abstraction.

Re: [PATCH] net: thunderx: Fix set_max_bgx_per_node for 81xx rgx

2017-04-17 Thread David Miller

From: George Cherian 
Date: Thu, 13 Apr 2017 07:25:01 +

> Add the PCI_SUBSYS_DEVID_81XX_RGX and use the same to set
> the max bgx per node count.
> 
> This fixes the issue intoduced by following commit
> 78aacb6f6 net: thunderx: Fix invalid mac addresses for node1 interfaces
> With this commit the max_bgx_per_node for 81xx is set as 2 instead of 3
> because of which num_vfs is always calculated as zero.
> 
> Signed-off-by: George Cherian 

Applied.

Re: [PATCH] net: thunderx: Fix set_max_bgx_per_node for 81xx rgx

2017-04-17 Thread David Miller

From: George Cherian 
Date: Thu, 13 Apr 2017 07:25:01 +

> Add the PCI_SUBSYS_DEVID_81XX_RGX and use the same to set
> the max bgx per node count.
> 
> This fixes the issue intoduced by following commit
> 78aacb6f6 net: thunderx: Fix invalid mac addresses for node1 interfaces
> With this commit the max_bgx_per_node for 81xx is set as 2 instead of 3
> because of which num_vfs is always calculated as zero.
> 
> Signed-off-by: George Cherian 

Applied.

Re: [PATCH net-next] l2tp: device MTU setup, tunnel socket needs a lock

2017-04-17 Thread David Miller

From: "R. Parameswaran" 
Date: Wed, 12 Apr 2017 18:31:04 -0700 (PDT)

> 
> The MTU overhead calculation in L2TP device set-up
> merged via commit b784e7ebfce8cfb16c6f95e14e8532d0768ab7ff
> needs to be adjusted to lock the tunnel socket while
> referencing the sub-data structures to derive the
> socket's IP overhead.
> 
> Reported-by: Guillaume Nault 
> Tested-by: Guillaume Nault 
> Signed-off-by: R. Parameswaran 

Applied, thanks.

Re: [PATCH net-next] l2tp: device MTU setup, tunnel socket needs a lock

2017-04-17 Thread David Miller

From: "R. Parameswaran" 
Date: Wed, 12 Apr 2017 18:31:04 -0700 (PDT)

> 
> The MTU overhead calculation in L2TP device set-up
> merged via commit b784e7ebfce8cfb16c6f95e14e8532d0768ab7ff
> needs to be adjusted to lock the tunnel socket while
> referencing the sub-data structures to derive the
> socket's IP overhead.
> 
> Reported-by: Guillaume Nault 
> Tested-by: Guillaume Nault 
> Signed-off-by: R. Parameswaran 

Applied, thanks.

Re: [PATCH] clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK

2017-04-17 Thread Matthias Kaehlcke

Hi Dmitry,

El Fri, Apr 14, 2017 at 09:02:03PM -0700 Dmitry Torokhov ha dit:

> On Tue, Apr 11, 2017 at 12:17 PM, Matthias Kaehlcke  wrote:
> > Besides reusing existing code this removes the special case handling
> > for 64-bit masks, which causes clang to raise a shift count overflow
> > warning due to https://bugs.llvm.org//show_bug.cgi?id=10030.
> >
> > Suggested-by: Dmitry Torokhov 
> > Signed-off-by: Matthias Kaehlcke 
> > ---
> >  include/linux/clocksource.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index cfc75848a35d..06e604b9e9dc 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -120,7 +120,7 @@ struct clocksource {
> >  #define CLOCK_SOURCE_RESELECT  0x100
> >
> >  /* simplify initialization of mask field */
> > -#define CLOCKSOURCE_MASK(bits) (u64)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
> > +#define CLOCKSOURCE_MASK(bits) (u64)GENMASK_ULL((bits) - 1, 0)
> 
> I do not think cast to u64 is needed for GENMASK_ULL.

Indeed, it is not needed, I will update the patch

Thanks!

Matthias

Re: [PATCH] clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK

2017-04-17 Thread Matthias Kaehlcke

Hi Dmitry,

El Fri, Apr 14, 2017 at 09:02:03PM -0700 Dmitry Torokhov ha dit:

> On Tue, Apr 11, 2017 at 12:17 PM, Matthias Kaehlcke  wrote:
> > Besides reusing existing code this removes the special case handling
> > for 64-bit masks, which causes clang to raise a shift count overflow
> > warning due to https://bugs.llvm.org//show_bug.cgi?id=10030.
> >
> > Suggested-by: Dmitry Torokhov 
> > Signed-off-by: Matthias Kaehlcke 
> > ---
> >  include/linux/clocksource.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index cfc75848a35d..06e604b9e9dc 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -120,7 +120,7 @@ struct clocksource {
> >  #define CLOCK_SOURCE_RESELECT  0x100
> >
> >  /* simplify initialization of mask field */
> > -#define CLOCKSOURCE_MASK(bits) (u64)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
> > +#define CLOCKSOURCE_MASK(bits) (u64)GENMASK_ULL((bits) - 1, 0)
> 
> I do not think cast to u64 is needed for GENMASK_ULL.

Indeed, it is not needed, I will update the patch

Thanks!

Matthias

Re: [PATCH resend 3/4] uapi glibc compat: Do not check for __USE_MISC

2017-04-17 Thread David Miller

From: Hauke Mehrtens 
Date: Wed, 12 Apr 2017 22:31:02 +0200

> __USE_MISC is glibc specific and not available in musl libc. Only do
> this check when glibc is used. This fixes a problem with musl libc.
> 
> Acked-by: Mikko Rapeli 
> Signed-off-by: Hauke Mehrtens 

You yourself introduce this problem in patch #1.

The code before patch #1 was perfectly fine, as this code block only
existing in an area protected by __GLIBC__.

So you have to redo these patches such that you deal with all of
the fallout from changing __GLIBC__ into !__KERNEL__ at the same
time that you make that change.

Thanks.

Re: [PATCH resend 3/4] uapi glibc compat: Do not check for __USE_MISC

2017-04-17 Thread David Miller

From: Hauke Mehrtens 
Date: Wed, 12 Apr 2017 22:31:02 +0200

> __USE_MISC is glibc specific and not available in musl libc. Only do
> this check when glibc is used. This fixes a problem with musl libc.
> 
> Acked-by: Mikko Rapeli 
> Signed-off-by: Hauke Mehrtens 

You yourself introduce this problem in patch #1.

The code before patch #1 was perfectly fine, as this code block only
existing in an area protected by __GLIBC__.

So you have to redo these patches such that you deal with all of
the fallout from changing __GLIBC__ into !__KERNEL__ at the same
time that you make that change.

Thanks.

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Logan Gunthorpe

On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> But is it ? For example take a GPU, does it, in your scheme, need an
> additional "p2pmem" child ? Why can't the GPU driver just use some
> helper to instantiate the necessary struct pages ? What does having an
> actual "struct device" child buys you ?

Yes, in this scheme, it needs an additional p2pmem child. Why is that an
issue? It certainly makes it a lot easier for the user to understand the
p2pmem memory in the system (through the sysfs tree) and reason about
the topology and when to use it. This is important.

> 
>> 2) In order to create the struct pages we use the ZONE_DEVICE
>> infrastructure which requires a struct device. (See
>> devm_memremap_pages.)
> 
> Yup, but you already have one in the actual pci_dev ... What is the
> benefit of adding a second one ?

But that would tie all of this very tightly to be pci only and may get
hard to differentiate if more users of ZONE_DEVICE crop up who happen to
be using a pci device. Having a specific class for this makes it very
clear how this memory would be handled. For example, although I haven't
looked into it, this could very well be a point of conflict with HMM. If
they were to use the pci device to populate the dev_pagemap then we
couldn't also use the pci device. I feel it's much better for users of
dev_pagemap to have their struct devices they own to avoid such conflicts.

> 
>>  This amazingly gets us the get_dev_pagemap
>> architecture which also uses a struct device. So by using a p2pmem
>> device we can go from struct page to struct device to p2pmem device
>> quickly and effortlessly.
> 
> Which isn't terribly useful in itself right ? What you care about is
> the "enclosing" pci_dev no ? Or am I missing something ?

Sure it is. What if we want to someday support p2pmem that's on another bus?

>> 3) You wouldn't want to use the pci's struct device because it doesn't
>> really describe what's going on. For example, there may be multiple
>> devices on the pci device in question: eg. an NVME card and some p2pmem.
> 
> What is "some p2pmem" ?
>> Or it could be a NIC with some p2pmem.
> 
> Again what is "some p2pmem" ?

Some device local memory intended for use as a DMA target from a
neighbour device or itself. On a PCI device, this would be a BAR, or a
portion of a BAR with memory behind it.

Keep in mind device classes tend to carve out common use cases and don't
have a one to one mapping with a physical pci card.

> That a device might have some memory-like buffer space is all well and
> good but does it need to be specifically distinguished at the device
> level ? It could be inherent to what the device is... for example again
> take the GPU example, why would you call the FB memory "p2pmem" ? 

Well if you are using it for p2p transactions why wouldn't you call it
p2pmem? There's no technical downside here except some vague argument
over naming. Once registered as p2pmem, that device will handle all the
dma map stuff for you and have a central obvious place to put code which
helps decide whether to use it or not based on topology.

I can certainly see an issue you'd have with the current RFC in that the
p2pmem device currently also handles memory allocation which a GPU would
 want to do itself. There are plenty of solutions to this though: we
could provide hooks for the parent device to override allocation or
something like that. However, the use cases I'm concerned with don't do
their own allocation so that is an important feature for them.

> Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
> it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
> standard way to lookup the various offsets etc... I mentioned earlier,
> then yes, it would make sense to have it as a staging point. As-is, I
> don't know. 

Well of course, at some point it would have a standard way to lookup
offsets and figure out what's necessary for a mapping. We wouldn't make
that separate from this, that would make no sense.

I also forgot:

4) We need someway in the kernel to configure drivers that use p2pmem.
That means it needs a unique name that the user can understand, lookup
and pass to other drivers. Then a way for those drivers to find it in
the system. A specific device class gets that for us in a very simple
fashion. We also don't want to have drivers like nvmet having to walk
every pci device to figure out where the p2p memory is and whether it
can use it.

IMO there are many clear benefits here and you haven't really offered an
alternative that provides the same features and potential for future use
cases.

Logan

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Logan Gunthorpe

On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> But is it ? For example take a GPU, does it, in your scheme, need an
> additional "p2pmem" child ? Why can't the GPU driver just use some
> helper to instantiate the necessary struct pages ? What does having an
> actual "struct device" child buys you ?

Yes, in this scheme, it needs an additional p2pmem child. Why is that an
issue? It certainly makes it a lot easier for the user to understand the
p2pmem memory in the system (through the sysfs tree) and reason about
the topology and when to use it. This is important.

> 
>> 2) In order to create the struct pages we use the ZONE_DEVICE
>> infrastructure which requires a struct device. (See
>> devm_memremap_pages.)
> 
> Yup, but you already have one in the actual pci_dev ... What is the
> benefit of adding a second one ?

But that would tie all of this very tightly to be pci only and may get
hard to differentiate if more users of ZONE_DEVICE crop up who happen to
be using a pci device. Having a specific class for this makes it very
clear how this memory would be handled. For example, although I haven't
looked into it, this could very well be a point of conflict with HMM. If
they were to use the pci device to populate the dev_pagemap then we
couldn't also use the pci device. I feel it's much better for users of
dev_pagemap to have their struct devices they own to avoid such conflicts.

> 
>>  This amazingly gets us the get_dev_pagemap
>> architecture which also uses a struct device. So by using a p2pmem
>> device we can go from struct page to struct device to p2pmem device
>> quickly and effortlessly.
> 
> Which isn't terribly useful in itself right ? What you care about is
> the "enclosing" pci_dev no ? Or am I missing something ?

Sure it is. What if we want to someday support p2pmem that's on another bus?

>> 3) You wouldn't want to use the pci's struct device because it doesn't
>> really describe what's going on. For example, there may be multiple
>> devices on the pci device in question: eg. an NVME card and some p2pmem.
> 
> What is "some p2pmem" ?
>> Or it could be a NIC with some p2pmem.
> 
> Again what is "some p2pmem" ?

Some device local memory intended for use as a DMA target from a
neighbour device or itself. On a PCI device, this would be a BAR, or a
portion of a BAR with memory behind it.

Keep in mind device classes tend to carve out common use cases and don't
have a one to one mapping with a physical pci card.

> That a device might have some memory-like buffer space is all well and
> good but does it need to be specifically distinguished at the device
> level ? It could be inherent to what the device is... for example again
> take the GPU example, why would you call the FB memory "p2pmem" ? 

Well if you are using it for p2p transactions why wouldn't you call it
p2pmem? There's no technical downside here except some vague argument
over naming. Once registered as p2pmem, that device will handle all the
dma map stuff for you and have a central obvious place to put code which
helps decide whether to use it or not based on topology.

I can certainly see an issue you'd have with the current RFC in that the
p2pmem device currently also handles memory allocation which a GPU would
 want to do itself. There are plenty of solutions to this though: we
could provide hooks for the parent device to override allocation or
something like that. However, the use cases I'm concerned with don't do
their own allocation so that is an important feature for them.

> Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
> it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
> standard way to lookup the various offsets etc... I mentioned earlier,
> then yes, it would make sense to have it as a staging point. As-is, I
> don't know. 

Well of course, at some point it would have a standard way to lookup
offsets and figure out what's necessary for a mapping. We wouldn't make
that separate from this, that would make no sense.

I also forgot:

4) We need someway in the kernel to configure drivers that use p2pmem.
That means it needs a unique name that the user can understand, lookup
and pass to other drivers. Then a way for those drivers to find it in
the system. A specific device class gets that for us in a very simple
fashion. We also don't want to have drivers like nvmet having to walk
every pci device to figure out where the p2p memory is and whether it
can use it.

IMO there are many clear benefits here and you haven't really offered an
alternative that provides the same features and potential for future use
cases.

Logan

Re: [PATCH V2] PM / OPP: Use - instead of @ for DT entries

2017-04-17 Thread Rafael J. Wysocki

On Monday, April 17, 2017 06:35:25 PM Rafael J. Wysocki wrote:
> On Monday, April 17, 2017 11:07:51 AM Masahiro Yamada wrote:
> > 2017-04-15 7:47 GMT+09:00 Rafael J. Wysocki :
> > > On Monday, April 10, 2017 02:51:35 PM Viresh Kumar wrote:
> > >> Compiling the DT file with W=1, DTC warns like follows:
> > >>
> > >> Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
> > >> unit name, but no reg property
> > >>
> > >> Fix this by replacing '@' with '-' as the OPP nodes will never have a
> > >> "reg" property.
> > >>
> > >> Reported-by: Krzysztof Kozlowski 
> > >> Reported-by: Masahiro Yamada 
> > >> Suggested-by: Mark Rutland 
> > >> Signed-off-by: Viresh Kumar 
> > >> Acked-by: Maxime Ripard  (sunxi)
> > >> Reviewed-by: Chanwoo Choi 
> > >> Reviewed-by: Krzysztof Kozlowski 
> > >
> > > OK, so any ACKs from the DT side?  Rob?
> > >
> > > Thanks,
> > > Rafael
> > 
> > 
> > 
> > I see Rob's Acked-by.
> > 
> > https://lkml.org/lkml/2017/4/13/648
> 
> Indeed.  Thanks!

But it doesn't apply on top of -rc7 for me, so I guess there is new 
4.12-candidate
material in the DTS tree that conflicts with this, in which case it is better 
to route
it through the DTS tree IMO.

Thanks,
Rafael

Re: [PATCH V2] PM / OPP: Use - instead of @ for DT entries

2017-04-17 Thread Rafael J. Wysocki

On Monday, April 17, 2017 06:35:25 PM Rafael J. Wysocki wrote:
> On Monday, April 17, 2017 11:07:51 AM Masahiro Yamada wrote:
> > 2017-04-15 7:47 GMT+09:00 Rafael J. Wysocki :
> > > On Monday, April 10, 2017 02:51:35 PM Viresh Kumar wrote:
> > >> Compiling the DT file with W=1, DTC warns like follows:
> > >>
> > >> Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
> > >> unit name, but no reg property
> > >>
> > >> Fix this by replacing '@' with '-' as the OPP nodes will never have a
> > >> "reg" property.
> > >>
> > >> Reported-by: Krzysztof Kozlowski 
> > >> Reported-by: Masahiro Yamada 
> > >> Suggested-by: Mark Rutland 
> > >> Signed-off-by: Viresh Kumar 
> > >> Acked-by: Maxime Ripard  (sunxi)
> > >> Reviewed-by: Chanwoo Choi 
> > >> Reviewed-by: Krzysztof Kozlowski 
> > >
> > > OK, so any ACKs from the DT side?  Rob?
> > >
> > > Thanks,
> > > Rafael
> > 
> > 
> > 
> > I see Rob's Acked-by.
> > 
> > https://lkml.org/lkml/2017/4/13/648
> 
> Indeed.  Thanks!

But it doesn't apply on top of -rc7 for me, so I guess there is new 
4.12-candidate
material in the DTS tree that conflicts with this, in which case it is better 
to route
it through the DTS tree IMO.

Thanks,
Rafael

Re: [PATCH V2] PM / OPP: Use - instead of @ for DT entries

2017-04-17 Thread Rafael J. Wysocki

On Monday, April 17, 2017 11:07:51 AM Masahiro Yamada wrote:
> 2017-04-15 7:47 GMT+09:00 Rafael J. Wysocki :
> > On Monday, April 10, 2017 02:51:35 PM Viresh Kumar wrote:
> >> Compiling the DT file with W=1, DTC warns like follows:
> >>
> >> Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
> >> unit name, but no reg property
> >>
> >> Fix this by replacing '@' with '-' as the OPP nodes will never have a
> >> "reg" property.
> >>
> >> Reported-by: Krzysztof Kozlowski 
> >> Reported-by: Masahiro Yamada 
> >> Suggested-by: Mark Rutland 
> >> Signed-off-by: Viresh Kumar 
> >> Acked-by: Maxime Ripard  (sunxi)
> >> Reviewed-by: Chanwoo Choi 
> >> Reviewed-by: Krzysztof Kozlowski 
> >
> > OK, so any ACKs from the DT side?  Rob?
> >
> > Thanks,
> > Rafael
> 
> 
> 
> I see Rob's Acked-by.
> 
> https://lkml.org/lkml/2017/4/13/648

Indeed.  Thanks!

Regards,
Rafael

Re: [PATCH V2] PM / OPP: Use - instead of @ for DT entries

2017-04-17 Thread Rafael J. Wysocki

On Monday, April 17, 2017 11:07:51 AM Masahiro Yamada wrote:
> 2017-04-15 7:47 GMT+09:00 Rafael J. Wysocki :
> > On Monday, April 10, 2017 02:51:35 PM Viresh Kumar wrote:
> >> Compiling the DT file with W=1, DTC warns like follows:
> >>
> >> Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
> >> unit name, but no reg property
> >>
> >> Fix this by replacing '@' with '-' as the OPP nodes will never have a
> >> "reg" property.
> >>
> >> Reported-by: Krzysztof Kozlowski 
> >> Reported-by: Masahiro Yamada 
> >> Suggested-by: Mark Rutland 
> >> Signed-off-by: Viresh Kumar 
> >> Acked-by: Maxime Ripard  (sunxi)
> >> Reviewed-by: Chanwoo Choi 
> >> Reviewed-by: Krzysztof Kozlowski 
> >
> > OK, so any ACKs from the DT side?  Rob?
> >
> > Thanks,
> > Rafael
> 
> 
> 
> I see Rob's Acked-by.
> 
> https://lkml.org/lkml/2017/4/13/648

Indeed.  Thanks!

Regards,
Rafael

Re: [PATCH V8 4/5] PCI/ASPM: save power on values during bridge init

2017-04-17 Thread Bjorn Helgaas

On Fri, Apr 14, 2017 at 5:17 PM, Sinan Kaya  wrote:
> On 4/14/2017 5:44 PM, Bjorn Helgaas wrote:
>> I think there's an argument to be made that if we care about ASPM
>> configuration, we should be using a non-POLICY_DEFAULT setting.  And I
>> think there's value in having POLICY_DEFAULT be the absolute lowest-
>> risk setting, which I think means option 1.
>>
>> What do you think?
>
> I see your point. The counter argument is that most of the users do not
> know what an ASPM kernel command line is unless they understand PCI
> language.

I don't think the answer is using the "pcie_aspm.policy=" boot
argument.  I certainly don't want users to have to deal with that.  I
wish we didn't even have that parameter.

I think we need runtime knobs instead (and I guess we already have
/sys/module/pcie_aspm/parameters/policy and /sys/.../link_state), and
distro userspace should use them.  I'm envisioning something in
"System Settings / Power" or similar.  Basically I think the policy
doesn't *have* to be dictated by a kernel boot-time parameter, so it
should not be.

> I have been using the powersave policy option until now. I recently realized
> that nobody except me is using this option. Therefore, we are wasting
> power by default following a hotplug insertion.
>
> This is the case where I'm trying to avoid. With the introduction of NVMe
> u.2 drives, hotplug is becoming more and more mainstream. I decided to
> take the matters into my hand with this series for this very reason.
>
> Like you said, BIOS is out of the picture with pciehp. There is nobody
> to configure ASPM following a hotplug insertion.
>
> I can also claim that If user wants performance, they should boot with
> the performance policy or pcie_aspm=off parameters.
>
> I saw this recommendation in multiple DPDK tuning documents.
>
> Like you said, what do we do by default is the question. Should we opt
> for safe like we are doing, or try to save some power.

I think safety is paramount.  Every user should be able to boot safely
without any kernel parameters.  We don't want users to have a problem
booting and then have to search for a workaround like booting with
"pcie_aspm=off".  Most users will never do that.

Here's a long-term strawman proposal, see what you think:

  - Deprecate CONFIG_PCIEASPM_DEFAULT, CONFIG_PCIEASPM_POWERSAVE, etc.
  - Default aspm_policy is POLICY_DEFAULT always.
  - POLICY_DEFAULT means Linux doesn't touch anything: if BIOS enabled
ASPM, we leave it that way; we leave ASPM disabled on hot-added
devices.
  - Deprecate kernel boot parameters (possibly keep pcie_aspm=off for
debugging use).
  - Use /sys/module/pcie_aspm/parameters/policy for run-time
system-wide  control, including for future hot-added devices.
  - Remove CONFIG_PCIEASPM_DEBUG and enable that code always, so we
have fine-grained run-time control.

> Maybe, we are missing a HPP option from the PCI spec.

That's an interesting idea.  _HPX does have provision for manipulating
Link Control bits (see ACPI r5.0, sec 6.2.8.3), but I don't think very
many systems implement it.  And there's currently no connection
between program_hpp_type2() and aspm.c, so I'm a little worried that
we might have issues if a system did implement an _HPX that sets any
of the ASPM bits.

Bjorn

Re: [PATCH V8 4/5] PCI/ASPM: save power on values during bridge init

2017-04-17 Thread Bjorn Helgaas

On Fri, Apr 14, 2017 at 5:17 PM, Sinan Kaya  wrote:
> On 4/14/2017 5:44 PM, Bjorn Helgaas wrote:
>> I think there's an argument to be made that if we care about ASPM
>> configuration, we should be using a non-POLICY_DEFAULT setting.  And I
>> think there's value in having POLICY_DEFAULT be the absolute lowest-
>> risk setting, which I think means option 1.
>>
>> What do you think?
>
> I see your point. The counter argument is that most of the users do not
> know what an ASPM kernel command line is unless they understand PCI
> language.

I don't think the answer is using the "pcie_aspm.policy=" boot
argument.  I certainly don't want users to have to deal with that.  I
wish we didn't even have that parameter.

I think we need runtime knobs instead (and I guess we already have
/sys/module/pcie_aspm/parameters/policy and /sys/.../link_state), and
distro userspace should use them.  I'm envisioning something in
"System Settings / Power" or similar.  Basically I think the policy
doesn't *have* to be dictated by a kernel boot-time parameter, so it
should not be.

> I have been using the powersave policy option until now. I recently realized
> that nobody except me is using this option. Therefore, we are wasting
> power by default following a hotplug insertion.
>
> This is the case where I'm trying to avoid. With the introduction of NVMe
> u.2 drives, hotplug is becoming more and more mainstream. I decided to
> take the matters into my hand with this series for this very reason.
>
> Like you said, BIOS is out of the picture with pciehp. There is nobody
> to configure ASPM following a hotplug insertion.
>
> I can also claim that If user wants performance, they should boot with
> the performance policy or pcie_aspm=off parameters.
>
> I saw this recommendation in multiple DPDK tuning documents.
>
> Like you said, what do we do by default is the question. Should we opt
> for safe like we are doing, or try to save some power.

I think safety is paramount.  Every user should be able to boot safely
without any kernel parameters.  We don't want users to have a problem
booting and then have to search for a workaround like booting with
"pcie_aspm=off".  Most users will never do that.

Here's a long-term strawman proposal, see what you think:

  - Deprecate CONFIG_PCIEASPM_DEFAULT, CONFIG_PCIEASPM_POWERSAVE, etc.
  - Default aspm_policy is POLICY_DEFAULT always.
  - POLICY_DEFAULT means Linux doesn't touch anything: if BIOS enabled
ASPM, we leave it that way; we leave ASPM disabled on hot-added
devices.
  - Deprecate kernel boot parameters (possibly keep pcie_aspm=off for
debugging use).
  - Use /sys/module/pcie_aspm/parameters/policy for run-time
system-wide  control, including for future hot-added devices.
  - Remove CONFIG_PCIEASPM_DEBUG and enable that code always, so we
have fine-grained run-time control.

> Maybe, we are missing a HPP option from the PCI spec.

That's an interesting idea.  _HPX does have provision for manipulating
Link Control bits (see ACPI r5.0, sec 6.2.8.3), but I don't think very
many systems implement it.  And there's currently no connection
between program_hpp_type2() and aspm.c, so I'm a little worried that
we might have issues if a system did implement an _HPX that sets any
of the ASPM bits.

Bjorn

Re: [PATCH 0/2] i2c: xgene-slimpro: Add ACPI support

2017-04-17 Thread Hoan Tran

Hi All,

Do you have any comments on this patch set?

Thanks

On Tue, Mar 28, 2017 at 1:46 PM, Hoan Tran  wrote:
> This patch set adds ACPI support by using PCC mailbox communication interface.
>
> Hoan Tran (2):
>   i2c: xgene-slimpro: Use a single function to send command message
>   i2c: xgene-slimpro: Add ACPI support by using PCC mailbox
>
>  drivers/i2c/busses/i2c-xgene-slimpro.c | 246 
> +
>  1 file changed, 192 insertions(+), 54 deletions(-)
>
> --
> 1.9.1
>

Re: [PATCH 0/2] i2c: xgene-slimpro: Add ACPI support

2017-04-17 Thread Hoan Tran

Hi All,

Do you have any comments on this patch set?

Thanks

On Tue, Mar 28, 2017 at 1:46 PM, Hoan Tran  wrote:
> This patch set adds ACPI support by using PCC mailbox communication interface.
>
> Hoan Tran (2):
>   i2c: xgene-slimpro: Use a single function to send command message
>   i2c: xgene-slimpro: Add ACPI support by using PCC mailbox
>
>  drivers/i2c/busses/i2c-xgene-slimpro.c | 246 
> +
>  1 file changed, 192 insertions(+), 54 deletions(-)
>
> --
> 1.9.1
>

Re: [PATCH v2 0/3] perf: xgene: Add support for SoC PMU version 3

2017-04-17 Thread Hoan Tran

Hi All,

Do you have any comments on this patch set?

Thanks

On Mon, Apr 3, 2017 at 9:47 AM, Hoan Tran  wrote:
> This patch set adds support for SoC-wide (AKA uncore) Performance Monitoring
> Unit version 3.
>
> It can support up to
>  - 2 IOB PMU instances
>  - 8 L3C PMU instances
>  - 2 MCB PMU instances
>  - 8 MCU PMU instances
>
> v2:
>  * Split into separate patches
>  * Use the function pointers for the PMU leaf functions
>  * Parse PMU subnode by the match table
>  * Dont allow user change agent id by config1 for SoC PMU v3
>
> Hoan Tran (3):
>   perf: xgene: Parse PMU subnode from the match table
>   perf: xgene: Move PMU leaf functions into function pointer structure
>   perf: xgene: Add support for SoC PMU version 3
>
>  drivers/perf/xgene_pmu.c | 695 
> +--
>  1 file changed, 619 insertions(+), 76 deletions(-)
>
> --
> 1.9.1
>

Re: [PATCH v2 0/3] perf: xgene: Add support for SoC PMU version 3

2017-04-17 Thread Hoan Tran

Hi All,

Do you have any comments on this patch set?

Thanks

On Mon, Apr 3, 2017 at 9:47 AM, Hoan Tran  wrote:
> This patch set adds support for SoC-wide (AKA uncore) Performance Monitoring
> Unit version 3.
>
> It can support up to
>  - 2 IOB PMU instances
>  - 8 L3C PMU instances
>  - 2 MCB PMU instances
>  - 8 MCU PMU instances
>
> v2:
>  * Split into separate patches
>  * Use the function pointers for the PMU leaf functions
>  * Parse PMU subnode by the match table
>  * Dont allow user change agent id by config1 for SoC PMU v3
>
> Hoan Tran (3):
>   perf: xgene: Parse PMU subnode from the match table
>   perf: xgene: Move PMU leaf functions into function pointer structure
>   perf: xgene: Add support for SoC PMU version 3
>
>  drivers/perf/xgene_pmu.c | 695 
> +--
>  1 file changed, 619 insertions(+), 76 deletions(-)
>
> --
> 1.9.1
>

[PATCH] drm/vc4: Fix refcounting of runtime PM get if it errors out.

2017-04-17 Thread Eric Anholt

We were returning without decrementing if the error happened, meaning
that at the next submit we wouldn't try to bring up the power domain.

Signed-off-by: Eric Anholt 
---

I stumbled across this error when testing my CMA patch with a very low
(64MB) CMA area -- we would oops on the submit after the one that
failed.

 drivers/gpu/drm/vc4/vc4_gem.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/vc4/vc4_gem.c b/drivers/gpu/drm/vc4/vc4_gem.c
index 777a8d9afd60..735412e3725a 100644
--- a/drivers/gpu/drm/vc4/vc4_gem.c
+++ b/drivers/gpu/drm/vc4/vc4_gem.c
@@ -1016,13 +1016,16 @@ vc4_submit_cl_ioctl(struct drm_device *dev, void *data,
}
 
mutex_lock(>power_lock);
-   if (vc4->power_refcount++ == 0)
+   if (vc4->power_refcount++ == 0) {
ret = pm_runtime_get_sync(>v3d->pdev->dev);
-   mutex_unlock(>power_lock);
-   if (ret < 0) {
-   kfree(exec);
-   return ret;
+   if (ret < 0) {
+   mutex_unlock(>power_lock);
+   vc4->power_refcount--;
+   kfree(exec);
+   return ret;
+   }
}
+   mutex_unlock(>power_lock);
 
exec->args = args;
INIT_LIST_HEAD(>unref_list);
-- 
2.11.0

[PATCH] drm/vc4: Fix refcounting of runtime PM get if it errors out.

2017-04-17 Thread Eric Anholt

We were returning without decrementing if the error happened, meaning
that at the next submit we wouldn't try to bring up the power domain.

Signed-off-by: Eric Anholt 
---

I stumbled across this error when testing my CMA patch with a very low
(64MB) CMA area -- we would oops on the submit after the one that
failed.

 drivers/gpu/drm/vc4/vc4_gem.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/vc4/vc4_gem.c b/drivers/gpu/drm/vc4/vc4_gem.c
index 777a8d9afd60..735412e3725a 100644
--- a/drivers/gpu/drm/vc4/vc4_gem.c
+++ b/drivers/gpu/drm/vc4/vc4_gem.c
@@ -1016,13 +1016,16 @@ vc4_submit_cl_ioctl(struct drm_device *dev, void *data,
}
 
mutex_lock(>power_lock);
-   if (vc4->power_refcount++ == 0)
+   if (vc4->power_refcount++ == 0) {
ret = pm_runtime_get_sync(>v3d->pdev->dev);
-   mutex_unlock(>power_lock);
-   if (ret < 0) {
-   kfree(exec);
-   return ret;
+   if (ret < 0) {
+   mutex_unlock(>power_lock);
+   vc4->power_refcount--;
+   kfree(exec);
+   return ret;
+   }
}
+   mutex_unlock(>power_lock);
 
exec->args = args;
INIT_LIST_HEAD(>unref_list);
-- 
2.11.0

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Alan Tull

On Mon, Apr 17, 2017 at 10:57 AM, Jerome Glisse  wrote:
> On Mon, Apr 17, 2017 at 10:35:01AM -0500, Alan Tull wrote:
>> On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:
>>
>> Hi Jerome,
>>
>> > On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
>> >> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
>> >> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
>> >> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
>> >> > > wrote:
>> >> > >
>> >> > > > It is like if on GPU we only had close source compiler for the GPU
>> >> > > > instructions set. So FPGA is definitly following different rules 
>> >> > > > than
>> >> > > > open source upstream GPU kernel driver abides to.
>>
>> Sorry, not a GPU guy, can you point me to something that documents
>> this policy of 'only opensource compilers for GPU'?  I looked under
>> linux/Documentation and didn't see anything.
>
> https://lists.freedesktop.org/archives/dri-devel/2010-July/001828.html

This starts out saying:

"Now this is just my opinion as maintainer of the drm, and doesn't
reflect anyone or any official policy"

> There is no explicit mention about compiler

You are right about that, there is no mention about compiler.

> but trust me it is included
> in everyones mind. You can ask Dave i am sure he would reject a driver
> with everything open except the shader compiler.

How would that work?  Before the GPU driver is accepted, an open
toolchain also needs to be submitted?

It's worth it to check out the responses since they not overwhelmingly
positive and tend to rather be outlining the complicating factors.
Many if not most say essentially that his stance was simplistic and
unproductive, slamming the door on the people from whom the solution
would come.  And keep in mind, this wasn't about what you've made it
out to be in the first place; this is about open/closed source GPU
drivers, not toolchains.

>
>
>> The current patchset doesn't have anything to do with FPGA toolchains
>> but you're using this patchset as a platform to talk about toolchain
>> issues.
>
> Well Intel inclusion of FPGA triggered my curiosity and when that patchset
> came accross my inbox i did wonder where the open source userspace was and
> went looking for it to no avail. So this isn't against a specific patchset
> but more broadly against the whole drivers/fpga/ story. Sorry if this was
> not clear.
>
>
>> It sounds like you are opposed to any kernel support of loading images
>> on FPGAs until all vendors have opensource toolchains.
>
> Yes that is what i am saying. They are different standard in the kernel
> and i would rather have one clear standard about driver needing proper
> open source userspace to go along with any upstream driver.

Deleting drivers/fpga wouldn't be a step forward to the openness you seek.

>
> Beside when it comes to FPGA i am still puzzle on why no one release info
> on the bitstream. They all provide details documentation on the internal
> (LUT, flip-flop, logic block layout and connection, memory block, ...).
> So there is nothing hidden in the bitstream. I am guessing the only good
> reason i can think of is to make it harder to map a bitstream back to
> VHDL/Verilog/...
>
> Cheers,
> Jérôme

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Alan Tull

On Mon, Apr 17, 2017 at 10:57 AM, Jerome Glisse  wrote:
> On Mon, Apr 17, 2017 at 10:35:01AM -0500, Alan Tull wrote:
>> On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:
>>
>> Hi Jerome,
>>
>> > On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
>> >> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
>> >> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
>> >> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
>> >> > > wrote:
>> >> > >
>> >> > > > It is like if on GPU we only had close source compiler for the GPU
>> >> > > > instructions set. So FPGA is definitly following different rules 
>> >> > > > than
>> >> > > > open source upstream GPU kernel driver abides to.
>>
>> Sorry, not a GPU guy, can you point me to something that documents
>> this policy of 'only opensource compilers for GPU'?  I looked under
>> linux/Documentation and didn't see anything.
>
> https://lists.freedesktop.org/archives/dri-devel/2010-July/001828.html

This starts out saying:

"Now this is just my opinion as maintainer of the drm, and doesn't
reflect anyone or any official policy"

> There is no explicit mention about compiler

You are right about that, there is no mention about compiler.

> but trust me it is included
> in everyones mind. You can ask Dave i am sure he would reject a driver
> with everything open except the shader compiler.

How would that work?  Before the GPU driver is accepted, an open
toolchain also needs to be submitted?

It's worth it to check out the responses since they not overwhelmingly
positive and tend to rather be outlining the complicating factors.
Many if not most say essentially that his stance was simplistic and
unproductive, slamming the door on the people from whom the solution
would come.  And keep in mind, this wasn't about what you've made it
out to be in the first place; this is about open/closed source GPU
drivers, not toolchains.

>
>
>> The current patchset doesn't have anything to do with FPGA toolchains
>> but you're using this patchset as a platform to talk about toolchain
>> issues.
>
> Well Intel inclusion of FPGA triggered my curiosity and when that patchset
> came accross my inbox i did wonder where the open source userspace was and
> went looking for it to no avail. So this isn't against a specific patchset
> but more broadly against the whole drivers/fpga/ story. Sorry if this was
> not clear.
>
>
>> It sounds like you are opposed to any kernel support of loading images
>> on FPGAs until all vendors have opensource toolchains.
>
> Yes that is what i am saying. They are different standard in the kernel
> and i would rather have one clear standard about driver needing proper
> open source userspace to go along with any upstream driver.

Deleting drivers/fpga wouldn't be a step forward to the openness you seek.

>
> Beside when it comes to FPGA i am still puzzle on why no one release info
> on the bitstream. They all provide details documentation on the internal
> (LUT, flip-flop, logic block layout and connection, memory block, ...).
> So there is nothing hidden in the bitstream. I am guessing the only good
> reason i can think of is to make it harder to map a bitstream back to
> VHDL/Verilog/...
>
> Cheers,
> Jérôme

Re: [kernel-hardening] Re: [PATCH 3/4] restrict unprivileged TIOCSTI tty ioctl

2017-04-17 Thread Matt Brown


On 04/17/2017 10:18 AM, Jann Horn wrote:

On Mon, Apr 17, 2017 at 8:53 AM, Greg KH  wrote:

On Mon, Apr 17, 2017 at 02:07:05AM -0400, Matt Brown wrote:

this patch depends on patch 1 and 2

enforces restrictions on unprivileged users injecting commands
into other processes in the same tty session using the TIOCSTI ioctl

Signed-off-by: Matt Brown 
---
 drivers/tty/tty_io.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index e6d1a65..31894e8 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2296,11 +2296,15 @@ static int tty_fasync(int fd, struct file *filp, int on)
  *   FIXME: may race normal receive processing
  */

+int tiocsti_restrict = IS_ENABLED(CONFIG_SECURITY_TIOCSTI_RESTRICT);
+
 static int tiocsti(struct tty_struct *tty, char __user *p)
 {
  char ch, mbz = 0;
  struct tty_ldisc *ld;

+ if (tiocsti_restrict && !capable(CAP_SYS_ADMIN))
+ return -EPERM;


So, what type of "normal" userspace operations did you just break here?
What type of "not normal" did you break/change?


Looking at
,
it looks like there are a couple users whose behavior would probably
change when run by unprivileged users - in particular agetty, csh, xemacs
and tcsh.



This is why I set this Kconfig to default to no.

This is also why I think this belongs under security and not tty. This
is more of a security feature than a tty feature.




Why tie this to CAP_SYS_ADMIN as well?  That wasn't listed in your
Kconfig help text.  This seems like an additional capabilities
dependancy that odds are, most people do not want...


  if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN))
  return -EPERM;


I think CAP_SYS_ADMIN is a logical choice here. AFAIK
CAP_SYS_ADMIN is normally used for random functionality that permits
users to bypass normal access controls without constraints that clearly
map to existing capabilities. See seccomp() without NNP,
proc_dointvec_minmax_sysadmin() and so on.

I guess you could also argue for CAP_SYS_TTY_CONFIG, but I don't
think that fits well, given that this is more about using a TTY in a
privileged way than about configuring it.



I can update the Kconfig help text to mention the use of CAP_SYS_ADMIN.




And finally, why doesn't this original check handle what you want to do
already?

I don't understand your "threat model" you wish to address by this
change series, please be a lot more explicit in your patch changelog
descriptions.


While I don't know what Matt Brown's threat model is here, I like the
patch for the following reason:

For me, there are two usecases for having UIDs on a Linux
system. The first usecase is to isolate human users from each other.
The second usecase are service accounts, where the same human
administrator has access to multiple UIDs, each of which is used for
one specific service, reducing the impact of a compromise of a single
service.

In the "isolated services" usecase, the machine's administrator needs
to be able to access the various service accounts in some way. One
way that integrates well with existing tools is to log in as root, then
use su or sudo to run a shell with the service's UID with reduced
privileges.

The only issue I know of that makes this dangerous in the default
configuration is that the tty file descriptor becomes accessible to the
shell running under the service's UID, allowing the service to e.g.
use ptrace() or .bashrc or so to inject code into the service-privileged
shell, then abuse TIOCSTI to take over root's shell.

So, the reason the additional check is necessary is that there are
usecases where in practice, TTY file descriptors are shared over
privilege boundaries, and thereforce, access to the TTY fd should not
automatically grant the level of access that is normally only available
using a PTY master fd.

sudo can mitigate this using the use_pty config option, which creates
a new PTY and forwards I/O to and from that PTY, but this option is
off by default. I think the version of su that e.g. ships in Debian can't
even be configured to mitigate this.

In my opinion, while it's possible for system administrators or
programmers to avoid this pitfall if they know enough about how TTY
devices work, this behavior is highly unintuitive.

Also, quoting part of the help text for grsecurity's config option
GRKERNSEC_HARDEN_TTY:

| There are very few legitimate uses for this functionality and it
| has made vulnerabilities in several 'su'-like programs possible in
| the past.  Even without these vulnerabilities, it provides an
| attacker with an easy mechanism to move laterally among other
| processes within the same user's compromised session.

Basically, TIOCSTI permits exactly what Yama is designed to
prevent: It lets different processes in one session compromise
each other.


For context, in case someone in this thread

Re: [kernel-hardening] Re: [PATCH 3/4] restrict unprivileged TIOCSTI tty ioctl

2017-04-17 Thread Matt Brown


On 04/17/2017 10:18 AM, Jann Horn wrote:

On Mon, Apr 17, 2017 at 8:53 AM, Greg KH  wrote:

On Mon, Apr 17, 2017 at 02:07:05AM -0400, Matt Brown wrote:

this patch depends on patch 1 and 2

enforces restrictions on unprivileged users injecting commands
into other processes in the same tty session using the TIOCSTI ioctl

Signed-off-by: Matt Brown 
---
 drivers/tty/tty_io.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index e6d1a65..31894e8 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2296,11 +2296,15 @@ static int tty_fasync(int fd, struct file *filp, int on)
  *   FIXME: may race normal receive processing
  */

+int tiocsti_restrict = IS_ENABLED(CONFIG_SECURITY_TIOCSTI_RESTRICT);
+
 static int tiocsti(struct tty_struct *tty, char __user *p)
 {
  char ch, mbz = 0;
  struct tty_ldisc *ld;

+ if (tiocsti_restrict && !capable(CAP_SYS_ADMIN))
+ return -EPERM;


So, what type of "normal" userspace operations did you just break here?
What type of "not normal" did you break/change?


Looking at
,
it looks like there are a couple users whose behavior would probably
change when run by unprivileged users - in particular agetty, csh, xemacs
and tcsh.



This is why I set this Kconfig to default to no.

This is also why I think this belongs under security and not tty. This
is more of a security feature than a tty feature.




Why tie this to CAP_SYS_ADMIN as well?  That wasn't listed in your
Kconfig help text.  This seems like an additional capabilities
dependancy that odds are, most people do not want...


  if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN))
  return -EPERM;


I think CAP_SYS_ADMIN is a logical choice here. AFAIK
CAP_SYS_ADMIN is normally used for random functionality that permits
users to bypass normal access controls without constraints that clearly
map to existing capabilities. See seccomp() without NNP,
proc_dointvec_minmax_sysadmin() and so on.

I guess you could also argue for CAP_SYS_TTY_CONFIG, but I don't
think that fits well, given that this is more about using a TTY in a
privileged way than about configuring it.



I can update the Kconfig help text to mention the use of CAP_SYS_ADMIN.




And finally, why doesn't this original check handle what you want to do
already?

I don't understand your "threat model" you wish to address by this
change series, please be a lot more explicit in your patch changelog
descriptions.


While I don't know what Matt Brown's threat model is here, I like the
patch for the following reason:

For me, there are two usecases for having UIDs on a Linux
system. The first usecase is to isolate human users from each other.
The second usecase are service accounts, where the same human
administrator has access to multiple UIDs, each of which is used for
one specific service, reducing the impact of a compromise of a single
service.

In the "isolated services" usecase, the machine's administrator needs
to be able to access the various service accounts in some way. One
way that integrates well with existing tools is to log in as root, then
use su or sudo to run a shell with the service's UID with reduced
privileges.

The only issue I know of that makes this dangerous in the default
configuration is that the tty file descriptor becomes accessible to the
shell running under the service's UID, allowing the service to e.g.
use ptrace() or .bashrc or so to inject code into the service-privileged
shell, then abuse TIOCSTI to take over root's shell.

So, the reason the additional check is necessary is that there are
usecases where in practice, TTY file descriptors are shared over
privilege boundaries, and thereforce, access to the TTY fd should not
automatically grant the level of access that is normally only available
using a PTY master fd.

sudo can mitigate this using the use_pty config option, which creates
a new PTY and forwards I/O to and from that PTY, but this option is
off by default. I think the version of su that e.g. ships in Debian can't
even be configured to mitigate this.

In my opinion, while it's possible for system administrators or
programmers to avoid this pitfall if they know enough about how TTY
devices work, this behavior is highly unintuitive.

Also, quoting part of the help text for grsecurity's config option
GRKERNSEC_HARDEN_TTY:

| There are very few legitimate uses for this functionality and it
| has made vulnerabilities in several 'su'-like programs possible in
| the past.  Even without these vulnerabilities, it provides an
| attacker with an easy mechanism to move laterally among other
| processes within the same user's compromised session.

Basically, TIOCSTI permits exactly what Yama is designed to
prevent: It lets different processes in one session compromise
each other.


For context, in case someone in this thread hasn't seen it yet:

Re: Using ion memory for direct-io

2017-04-17 Thread Laura Abbott

On 04/14/2017 02:18 AM, Zengtao (B) wrote:
> Hi 
> 
> Currently, the ion mapped to userspace will be forced with VM_IO and 
> VM_PFNMAP flags.
> When I use the ion memory to do the direct-io, it will fail when reaching the 
> get_user_pages,
> 
> Back to the VM_IO and VM_PFNMAP flag, there two flags are introduced by the 
> remap_pfn_range called 
> by the ion_heap_mmap_user.
> 
> From my point of view, all ion memory(cma/vmalloc/system heap) are managed by 
> linux vm, it
> is not reasonable to have the VM_IO and VM_PFNMAP flag, but I don't any 
> suitable function
> to replace the remap_pfn_range, any suggestions?
> 
> Thanks && Regards
> 
> Zengtao 
> 

The carveout heap is omitted from your list of 'all ion memory'. At one
time, carveout memory was not backed by struct pages so I suspect
this is a holdover from then. This would probably be better served
by using vm_insert_page and handling higher order pages properly.

Thanks,
Laura

Re: Using ion memory for direct-io

2017-04-17 Thread Laura Abbott

On 04/14/2017 02:18 AM, Zengtao (B) wrote:
> Hi 
> 
> Currently, the ion mapped to userspace will be forced with VM_IO and 
> VM_PFNMAP flags.
> When I use the ion memory to do the direct-io, it will fail when reaching the 
> get_user_pages,
> 
> Back to the VM_IO and VM_PFNMAP flag, there two flags are introduced by the 
> remap_pfn_range called 
> by the ion_heap_mmap_user.
> 
> From my point of view, all ion memory(cma/vmalloc/system heap) are managed by 
> linux vm, it
> is not reasonable to have the VM_IO and VM_PFNMAP flag, but I don't any 
> suitable function
> to replace the remap_pfn_range, any suggestions?
> 
> Thanks && Regards
> 
> Zengtao 
> 

The carveout heap is omitted from your list of 'all ion memory'. At one
time, carveout memory was not backed by struct pages so I suspect
this is a holdover from then. This would probably be better served
by using vm_insert_page and handling higher order pages properly.

Thanks,
Laura

Re: [PATCH 5/6] platform/x86: fujitsu-laptop: do not log LED registration failures

2017-04-17 Thread Darren Hart

On Fri, Apr 07, 2017 at 03:07:12PM +0200, Michał Kępień wrote:
> If acpi_fujitsu_laptop_leds_register() returns an error, the latter will
> become the return value of acpi_fujitsu_laptop_add(), which in turn will
> be reported by driver core.  Simplify code by replacing pr_err() calls
> with return statements.  Return 0 instead of result when no errors occur
> in order to make the code easier to read.

Hi Michał,

Jonathan's comment regarding the information loss of removing the pr_err
statements seems valid to me. Based on the outer if block, I take it each
registration only fails in true error scenarios and not because some laptop
might have one but not another LED in the list. If so, then the pr_err messages
would only appear when there was a legitimate problem. I think they're worth

This seems to introduce a behavior change as well. Previously only the last
LED registered would determine the result - which is wrong of course and I
believe you noted a related bug in an early patch. Previously, however, if
LOGOLAMP_POWERON failed, for example, the KEYBOARD_LAMPS would still be 
attempted.

So the question really comes down to this: Is there a legitimate situation in
which one LEDs registration fails and another succeeds? If so, then this would
constitute a regression for such systems.

> 
> Signed-off-by: Michał Kępień 
> ---
>  drivers/platform/x86/fujitsu-laptop.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/platform/x86/fujitsu-laptop.c 
> b/drivers/platform/x86/fujitsu-laptop.c
> index ce658789e748..177b9b57ac2f 100644
> --- a/drivers/platform/x86/fujitsu-laptop.c
> +++ b/drivers/platform/x86/fujitsu-laptop.c
> @@ -739,22 +739,20 @@ static struct led_classdev eco_led = {
>  
>  static int acpi_fujitsu_laptop_leds_register(struct acpi_device *device)
>  {
> - int result = 0;
> + int result;
>  
>   if (call_fext_func(FUNC_LEDS, 0x0, 0x0, 0x0) & LOGOLAMP_POWERON) {
>   result = devm_led_classdev_register(>dev,
>   _led);
>   if (result)
> - pr_err("Could not register LED handler for logo lamp, 
> error %i\n",
> -result);
> + return result;
>   }
>  
>   if ((call_fext_func(FUNC_LEDS, 0x0, 0x0, 0x0) & KEYBOARD_LAMPS) &&
>   (call_fext_func(FUNC_BUTTONS, 0x0, 0x0, 0x0) == 0x0)) {
>   result = devm_led_classdev_register(>dev, _led);
>   if (result)
> - pr_err("Could not register LED handler for keyboard 
> lamps, error %i\n",
> -result);
> + return result;
>   }
>  
>   /*
> @@ -766,8 +764,7 @@ static int acpi_fujitsu_laptop_leds_register(struct 
> acpi_device *device)
>   if (call_fext_func(FUNC_BUTTONS, 0x0, 0x0, 0x0) & BIT(24)) {
>   result = devm_led_classdev_register(>dev, _led);
>   if (result)
> - pr_err("Could not register LED handler for radio LED, 
> error %i\n",
> -result);
> + return result;
>   }
>  
>   /* Support for eco led is not always signaled in bit corresponding
> @@ -779,11 +776,10 @@ static int acpi_fujitsu_laptop_leds_register(struct 
> acpi_device *device)
>   (call_fext_func(FUNC_LEDS, 0x2, ECO_LED, 0x0) != UNSUPPORTED_CMD)) {
>   result = devm_led_classdev_register(>dev, _led);
>   if (result)
> - pr_err("Could not register LED handler for eco LED, 
> error %i\n",
> -result);
> + return result;
>   }
>  
> - return result;
> + return 0;
>  }
>  
>  static int acpi_fujitsu_laptop_add(struct acpi_device *device)
> -- 
> 2.12.2
> 
> 

-- 
Darren Hart
VMware Open Source Technology Center

Re: [PATCH 5/6] platform/x86: fujitsu-laptop: do not log LED registration failures

2017-04-17 Thread Darren Hart

On Fri, Apr 07, 2017 at 03:07:12PM +0200, Michał Kępień wrote:
> If acpi_fujitsu_laptop_leds_register() returns an error, the latter will
> become the return value of acpi_fujitsu_laptop_add(), which in turn will
> be reported by driver core.  Simplify code by replacing pr_err() calls
> with return statements.  Return 0 instead of result when no errors occur
> in order to make the code easier to read.

Hi Michał,

Jonathan's comment regarding the information loss of removing the pr_err
statements seems valid to me. Based on the outer if block, I take it each
registration only fails in true error scenarios and not because some laptop
might have one but not another LED in the list. If so, then the pr_err messages
would only appear when there was a legitimate problem. I think they're worth

This seems to introduce a behavior change as well. Previously only the last
LED registered would determine the result - which is wrong of course and I
believe you noted a related bug in an early patch. Previously, however, if
LOGOLAMP_POWERON failed, for example, the KEYBOARD_LAMPS would still be 
attempted.

So the question really comes down to this: Is there a legitimate situation in
which one LEDs registration fails and another succeeds? If so, then this would
constitute a regression for such systems.

> 
> Signed-off-by: Michał Kępień 
> ---
>  drivers/platform/x86/fujitsu-laptop.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/platform/x86/fujitsu-laptop.c 
> b/drivers/platform/x86/fujitsu-laptop.c
> index ce658789e748..177b9b57ac2f 100644
> --- a/drivers/platform/x86/fujitsu-laptop.c
> +++ b/drivers/platform/x86/fujitsu-laptop.c
> @@ -739,22 +739,20 @@ static struct led_classdev eco_led = {
>  
>  static int acpi_fujitsu_laptop_leds_register(struct acpi_device *device)
>  {
> - int result = 0;
> + int result;
>  
>   if (call_fext_func(FUNC_LEDS, 0x0, 0x0, 0x0) & LOGOLAMP_POWERON) {
>   result = devm_led_classdev_register(>dev,
>   _led);
>   if (result)
> - pr_err("Could not register LED handler for logo lamp, 
> error %i\n",
> -result);
> + return result;
>   }
>  
>   if ((call_fext_func(FUNC_LEDS, 0x0, 0x0, 0x0) & KEYBOARD_LAMPS) &&
>   (call_fext_func(FUNC_BUTTONS, 0x0, 0x0, 0x0) == 0x0)) {
>   result = devm_led_classdev_register(>dev, _led);
>   if (result)
> - pr_err("Could not register LED handler for keyboard 
> lamps, error %i\n",
> -result);
> + return result;
>   }
>  
>   /*
> @@ -766,8 +764,7 @@ static int acpi_fujitsu_laptop_leds_register(struct 
> acpi_device *device)
>   if (call_fext_func(FUNC_BUTTONS, 0x0, 0x0, 0x0) & BIT(24)) {
>   result = devm_led_classdev_register(>dev, _led);
>   if (result)
> - pr_err("Could not register LED handler for radio LED, 
> error %i\n",
> -result);
> + return result;
>   }
>  
>   /* Support for eco led is not always signaled in bit corresponding
> @@ -779,11 +776,10 @@ static int acpi_fujitsu_laptop_leds_register(struct 
> acpi_device *device)
>   (call_fext_func(FUNC_LEDS, 0x2, ECO_LED, 0x0) != UNSUPPORTED_CMD)) {
>   result = devm_led_classdev_register(>dev, _led);
>   if (result)
> - pr_err("Could not register LED handler for eco LED, 
> error %i\n",
> -result);
> + return result;
>   }
>  
> - return result;
> + return 0;
>  }
>  
>  static int acpi_fujitsu_laptop_add(struct acpi_device *device)
> -- 
> 2.12.2
> 
> 

-- 
Darren Hart
VMware Open Source Technology Center

Re: pinctrl: imx: Checking reuse of grouping functionality

2017-04-17 Thread SF Markus Elfring

> However the imx pinctrl driver _absolutely_ needs this configuration to work.

Thanks for your explanation that the file “drivers/pinctrl/freescale/Kconfig”
will usually provide needed information for this software module.

Would it make sense to document such a requirement also by a preprocessor error
directive in the source file “drivers/pinctrl/freescale/pinctrl-imx.c”
for the special case if the macro “CONFIG_GENERIC_PINCTRL_GROUPS” would
accidentally not be defined?

Regards,
Markus

Re: pinctrl: imx: Checking reuse of grouping functionality

2017-04-17 Thread SF Markus Elfring

> However the imx pinctrl driver _absolutely_ needs this configuration to work.

Thanks for your explanation that the file “drivers/pinctrl/freescale/Kconfig”
will usually provide needed information for this software module.

Would it make sense to document such a requirement also by a preprocessor error
directive in the source file “drivers/pinctrl/freescale/pinctrl-imx.c”
for the special case if the macro “CONFIG_GENERIC_PINCTRL_GROUPS” would
accidentally not be defined?

Regards,
Markus

Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

2017-04-17 Thread Paul E. McKenney

On Sun, Apr 16, 2017 at 11:51:37AM +0200, Nicolai Stange wrote:
> Currently, a dentry's debugfs_fsdata instance is allocated from
> debugfs_file_get() at first usage, i.e. at first file opening.
> 
> It won't ever get freed though.
> 
> Ideally, these instances would get freed after the last open file handle
> gets closed again, that is from the fops' ->release().
> 
> Unfortunately, we can't hook easily into those ->release()ers of files
> created through debugfs_create_file_unsafe(), i.e. of those proxied through
> debugfs_open_proxy_file_operations rather than through the "full"
> debugfs_full_proxy_file_operations proxy.
> 
> Hence, free unreferenced debugfs_fsdata instances from debugfs_file_put(),
> with the drawback of a potential allocation + deallocation per
> debugfs_file_get() + debugfs_file_put() pair, that is per fops invocation.
> 
> In addition to its former role of tracking pending fops, use the
> ->active_users for reference counting on the debugfs_fsdata instance
> itself. In particular, don't keep a dummy reference to be dropped from
> __debugfs_remove_file(): a d_delete()ed dentry and thus, request for
> completion notification, is now signaled by the d_unlinked() dentry itself.
> 
> Once ->active_users drops to zero (and the dentry is still intact), free
> the debugfs_fsdata instance from debugfs_file_put(). RCU protects any
> concurrent debugfs_file_get() attempts to get a hold of the instance here.
> Likewise for full_proxy_release() which lacks a call to debugfs_file_get().
> 
> Note that due to non-atomic updates to the d_unlinked() + ->d_fsdata pair,
> care must be taken in order to avoid races between debugfs_file_put() and
> debugfs_file_get() as well as __debugfs_remove_file(). Rather than
> introducing a global lock, exploit the fact that there will ever be only a
> single !d_unlinked() -> d_unlinked() transition and add memory barriers
> where needed. Given the lack of proper benchmarking, that debugfs fops
> aren't performance critical and that we've already got a potential
> allocation/deallocation pair anyway, the added code complexity might be
> highly questionable though.
> 
> Signed-off-by: Nicolai Stange 

If you have not already done so, please run this with debug enabled,
especially CONFIG_PROVE_LOCKING=y (which implies CONFIG_PROVE_RCU=y).
This is important because there are configurations for which the deadlocks
you saw with SRCU turn into silent failure, including memory corruption.
CONFIG_PROVE_RCU=y will catch many of those situations.

(And yes, kfree_rcu() doesn't have that problem, but...)

Another issue called out inline.

Thanx, Paul

> ---
>  fs/debugfs/file.c | 102 
> --
>  fs/debugfs/inode.c|   8 +++-
>  fs/debugfs/internal.h |   1 +
>  3 files changed, 90 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index f4dfd7d0d625..b2cc25d44a39 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include "internal.h"
> @@ -78,10 +79,39 @@ int debugfs_file_get(struct dentry *dentry)
>   struct debugfs_fsdata *fsd;
>   void *d_fsd;
> 
> - d_fsd = READ_ONCE(dentry->d_fsdata);
> + rcu_read_lock();
> +retry:
> + d_fsd = rcu_dereference(dentry->d_fsdata);
>   if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
> + /*
> +  * Paired with the control dependency in
> +  * debugfs_file_put(): if we saw the debugfs_fsdata
> +  * instance "restored" there but not the dead dentry,
> +  * we'd erroneously instantiate a fresh debugfs_fsdata
> +  * instance below.
> +  */
> + smp_rmb();
> + if (d_unlinked(dentry)) {
> + rcu_read_unlock();
> + return -EIO;
> + }
> +
>   fsd = d_fsd;
> + if (!refcount_inc_not_zero(>active_users)) {
> + /*
> +  * A concurrent debugfs_file_put() dropped the
> +  * count to zero and is about to free the
> +  * debugfs_fsdata. Help out resetting the
> +  * ->d_fsdata and retry.
> +  */
> + d_fsd = (void *)((unsigned long)fsd->real_fops |
> + DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);

This is an infrequent race, I hope?  If on the other hand there is
a possibility of this branch being taken a huge number of times in
one call, it would be good to exit the RCU read-side critical section
before retrying.

> + goto retry;
> + }
> + rcu_read_unlock();
>   } else {
> +

Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

2017-04-17 Thread Paul E. McKenney

On Sun, Apr 16, 2017 at 11:51:37AM +0200, Nicolai Stange wrote:
> Currently, a dentry's debugfs_fsdata instance is allocated from
> debugfs_file_get() at first usage, i.e. at first file opening.
> 
> It won't ever get freed though.
> 
> Ideally, these instances would get freed after the last open file handle
> gets closed again, that is from the fops' ->release().
> 
> Unfortunately, we can't hook easily into those ->release()ers of files
> created through debugfs_create_file_unsafe(), i.e. of those proxied through
> debugfs_open_proxy_file_operations rather than through the "full"
> debugfs_full_proxy_file_operations proxy.
> 
> Hence, free unreferenced debugfs_fsdata instances from debugfs_file_put(),
> with the drawback of a potential allocation + deallocation per
> debugfs_file_get() + debugfs_file_put() pair, that is per fops invocation.
> 
> In addition to its former role of tracking pending fops, use the
> ->active_users for reference counting on the debugfs_fsdata instance
> itself. In particular, don't keep a dummy reference to be dropped from
> __debugfs_remove_file(): a d_delete()ed dentry and thus, request for
> completion notification, is now signaled by the d_unlinked() dentry itself.
> 
> Once ->active_users drops to zero (and the dentry is still intact), free
> the debugfs_fsdata instance from debugfs_file_put(). RCU protects any
> concurrent debugfs_file_get() attempts to get a hold of the instance here.
> Likewise for full_proxy_release() which lacks a call to debugfs_file_get().
> 
> Note that due to non-atomic updates to the d_unlinked() + ->d_fsdata pair,
> care must be taken in order to avoid races between debugfs_file_put() and
> debugfs_file_get() as well as __debugfs_remove_file(). Rather than
> introducing a global lock, exploit the fact that there will ever be only a
> single !d_unlinked() -> d_unlinked() transition and add memory barriers
> where needed. Given the lack of proper benchmarking, that debugfs fops
> aren't performance critical and that we've already got a potential
> allocation/deallocation pair anyway, the added code complexity might be
> highly questionable though.
> 
> Signed-off-by: Nicolai Stange 

If you have not already done so, please run this with debug enabled,
especially CONFIG_PROVE_LOCKING=y (which implies CONFIG_PROVE_RCU=y).
This is important because there are configurations for which the deadlocks
you saw with SRCU turn into silent failure, including memory corruption.
CONFIG_PROVE_RCU=y will catch many of those situations.

(And yes, kfree_rcu() doesn't have that problem, but...)

Another issue called out inline.

Thanx, Paul

> ---
>  fs/debugfs/file.c | 102 
> --
>  fs/debugfs/inode.c|   8 +++-
>  fs/debugfs/internal.h |   1 +
>  3 files changed, 90 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index f4dfd7d0d625..b2cc25d44a39 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> 
>  #include "internal.h"
> @@ -78,10 +79,39 @@ int debugfs_file_get(struct dentry *dentry)
>   struct debugfs_fsdata *fsd;
>   void *d_fsd;
> 
> - d_fsd = READ_ONCE(dentry->d_fsdata);
> + rcu_read_lock();
> +retry:
> + d_fsd = rcu_dereference(dentry->d_fsdata);
>   if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
> + /*
> +  * Paired with the control dependency in
> +  * debugfs_file_put(): if we saw the debugfs_fsdata
> +  * instance "restored" there but not the dead dentry,
> +  * we'd erroneously instantiate a fresh debugfs_fsdata
> +  * instance below.
> +  */
> + smp_rmb();
> + if (d_unlinked(dentry)) {
> + rcu_read_unlock();
> + return -EIO;
> + }
> +
>   fsd = d_fsd;
> + if (!refcount_inc_not_zero(>active_users)) {
> + /*
> +  * A concurrent debugfs_file_put() dropped the
> +  * count to zero and is about to free the
> +  * debugfs_fsdata. Help out resetting the
> +  * ->d_fsdata and retry.
> +  */
> + d_fsd = (void *)((unsigned long)fsd->real_fops |
> + DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);

This is an infrequent race, I hope?  If on the other hand there is
a possibility of this branch being taken a huge number of times in
one call, it would be good to exit the RCU read-side critical section
before retrying.

> + goto retry;
> + }
> + rcu_read_unlock();
>   } else {
> + rcu_read_unlock();
>

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Jerome Glisse

On Mon, Apr 17, 2017 at 10:35:01AM -0500, Alan Tull wrote:
> On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:
> 
> Hi Jerome,
> 
> > On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
> >> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
> >> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
> >> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
> >> > > wrote:
> >> > >
> >> > > > It is like if on GPU we only had close source compiler for the GPU
> >> > > > instructions set. So FPGA is definitly following different rules than
> >> > > > open source upstream GPU kernel driver abides to.
> 
> Sorry, not a GPU guy, can you point me to something that documents
> this policy of 'only opensource compilers for GPU'?  I looked under
> linux/Documentation and didn't see anything.

https://lists.freedesktop.org/archives/dri-devel/2010-July/001828.html

There is no explicit mention about compiler but trust me it is included
in everyones mind. You can ask Dave i am sure he would reject a driver
with everything open except the shader compiler.

> The current patchset doesn't have anything to do with FPGA toolchains
> but you're using this patchset as a platform to talk about toolchain
> issues.

Well Intel inclusion of FPGA triggered my curiosity and when that patchset
came accross my inbox i did wonder where the open source userspace was and
went looking for it to no avail. So this isn't against a specific patchset
but more broadly against the whole drivers/fpga/ story. Sorry if this was
not clear.

> It sounds like you are opposed to any kernel support of loading images
> on FPGAs until all vendors have opensource toolchains.

Yes that is what i am saying. They are different standard in the kernel
and i would rather have one clear standard about driver needing proper
open source userspace to go along with any upstream driver.

Beside when it comes to FPGA i am still puzzle on why no one release info
on the bitstream. They all provide details documentation on the internal
(LUT, flip-flop, logic block layout and connection, memory block, ...).
So there is nothing hidden in the bitstream. I am guessing the only good
reason i can think of is to make it harder to map a bitstream back to
VHDL/Verilog/...

Cheers,
Jérôme

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Jerome Glisse

On Mon, Apr 17, 2017 at 10:35:01AM -0500, Alan Tull wrote:
> On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:
> 
> Hi Jerome,
> 
> > On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
> >> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
> >> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
> >> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
> >> > > wrote:
> >> > >
> >> > > > It is like if on GPU we only had close source compiler for the GPU
> >> > > > instructions set. So FPGA is definitly following different rules than
> >> > > > open source upstream GPU kernel driver abides to.
> 
> Sorry, not a GPU guy, can you point me to something that documents
> this policy of 'only opensource compilers for GPU'?  I looked under
> linux/Documentation and didn't see anything.

https://lists.freedesktop.org/archives/dri-devel/2010-July/001828.html

There is no explicit mention about compiler but trust me it is included
in everyones mind. You can ask Dave i am sure he would reject a driver
with everything open except the shader compiler.

> The current patchset doesn't have anything to do with FPGA toolchains
> but you're using this patchset as a platform to talk about toolchain
> issues.

Well Intel inclusion of FPGA triggered my curiosity and when that patchset
came accross my inbox i did wonder where the open source userspace was and
went looking for it to no avail. So this isn't against a specific patchset
but more broadly against the whole drivers/fpga/ story. Sorry if this was
not clear.

> It sounds like you are opposed to any kernel support of loading images
> on FPGAs until all vendors have opensource toolchains.

Yes that is what i am saying. They are different standard in the kernel
and i would rather have one clear standard about driver needing proper
open source userspace to go along with any upstream driver.

Beside when it comes to FPGA i am still puzzle on why no one release info
on the bitstream. They all provide details documentation on the internal
(LUT, flip-flop, logic block layout and connection, memory block, ...).
So there is nothing hidden in the bitstream. I am guessing the only good
reason i can think of is to make it harder to map a bitstream back to
VHDL/Verilog/...

Cheers,
Jérôme

Re: [PATCH] ACPICA: Export mutex functions

2017-04-17 Thread Guenter Roeck

Hi,

On Mon, Apr 17, 2017 at 09:39:35AM +, Zheng, Lv wrote:
> Hi,
> 
> > From: Guenter Roeck [mailto:li...@roeck-us.net]
> > Subject: Re: [PATCH] ACPICA: Export mutex functions
> > 
> > On Wed, Apr 12, 2017 at 03:29:55PM +, Moore, Robert wrote:
> > > The ACPICA mutex functions are based on the host OS functions, so they 
> > > don't really buy you anything.
> > You should just use the native Linux functions.
> > >
> > 
> > You mean they don't really acquire the requested ACPI mutex,
> > and the underlying DSDT which declares and uses the mutex
> > just ignores if the mutex was acquired by acpi_acquire_mutex() ?
> > 
> > To clarify: You are saying that code such as
> > 
> > acpi_status status;
> > 
> > status = acpi_acquire_mutex(NULL, "\\_SB.PCI0.SBRG.SIO1.MUT0", 0x10);
> > if (ACPI_FAILURE(status)) {
> > pr_err("Failed to acquire ACPI mutex\n");
> > return -EBUSY;
> > }
> 
> Why do you need to access \_SB.PCI0.SBRG.SIO1.MUT0?
> OSPM should only invoke entry methods predefined by ACPI spec or whatever 
> specs.
> There shouldn't be any needs that a driver acquires an arbitrary AML mutex.
> You do not seem to have justified the usage model, IMO.
> 

I am sorry, I have no idea how to do that. I can see that the resource in
question (IO address 0x2e/0x2f) is accessed from the DSDT, that the resource
is mutex protected, and that accesses to the same IO address from the Linux
kernel are unreliable unless I acquire the mutex in question. At the same time,
I can see that request_muxed_region() succeeds, so presumably ACPI does not
reserve the region for its exclusive use.

It may well be that the "official" response to this problem is "you must
not instantiate a watchdog, environmental monitor, or gpio driver (or anything
else provided by the Super-IO chip that requires access to those ports) on this
platform in Linux". Is that what you are suggesting ?

Thanks,
Guenter

Re: [PATCH] ACPICA: Export mutex functions

2017-04-17 Thread Guenter Roeck

Hi,

On Mon, Apr 17, 2017 at 09:39:35AM +, Zheng, Lv wrote:
> Hi,
> 
> > From: Guenter Roeck [mailto:li...@roeck-us.net]
> > Subject: Re: [PATCH] ACPICA: Export mutex functions
> > 
> > On Wed, Apr 12, 2017 at 03:29:55PM +, Moore, Robert wrote:
> > > The ACPICA mutex functions are based on the host OS functions, so they 
> > > don't really buy you anything.
> > You should just use the native Linux functions.
> > >
> > 
> > You mean they don't really acquire the requested ACPI mutex,
> > and the underlying DSDT which declares and uses the mutex
> > just ignores if the mutex was acquired by acpi_acquire_mutex() ?
> > 
> > To clarify: You are saying that code such as
> > 
> > acpi_status status;
> > 
> > status = acpi_acquire_mutex(NULL, "\\_SB.PCI0.SBRG.SIO1.MUT0", 0x10);
> > if (ACPI_FAILURE(status)) {
> > pr_err("Failed to acquire ACPI mutex\n");
> > return -EBUSY;
> > }
> 
> Why do you need to access \_SB.PCI0.SBRG.SIO1.MUT0?
> OSPM should only invoke entry methods predefined by ACPI spec or whatever 
> specs.
> There shouldn't be any needs that a driver acquires an arbitrary AML mutex.
> You do not seem to have justified the usage model, IMO.
> 

I am sorry, I have no idea how to do that. I can see that the resource in
question (IO address 0x2e/0x2f) is accessed from the DSDT, that the resource
is mutex protected, and that accesses to the same IO address from the Linux
kernel are unreliable unless I acquire the mutex in question. At the same time,
I can see that request_muxed_region() succeeds, so presumably ACPI does not
reserve the region for its exclusive use.

It may well be that the "official" response to this problem is "you must
not instantiate a watchdog, environmental monitor, or gpio driver (or anything
else provided by the Super-IO chip that requires access to those ports) on this
platform in Linux". Is that what you are suggesting ?

Thanks,
Guenter

[PATCH -mm] fault-inject: don't convert unsigned type value as signed int

2017-04-17 Thread Akinobu Mita

This fixes fault-inject-simplify-access-check-for-fail-nth.patch in -mm
tree which by mistake partially reverts the change by fault-inject-
parse-as-natural-1-based-value-for-fail-nth-write-interface.patch.

Cc: Dmitry Vyukov 
Reported-by: Dmitry Vyukov 
Signed-off-by: Akinobu Mita 
---
 fs/proc/base.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ea1039d..1f5139e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1366,7 +1366,7 @@ static ssize_t proc_fail_nth_write(struct file *file, 
const char __user *buf,
int err;
unsigned int n;
 
-   err = kstrtoint_from_user(buf, count, 0, );
+   err = kstrtouint_from_user(buf, count, 0, );
if (err)
return err;
 
-- 
2.7.4

[PATCH -mm] fault-inject: don't convert unsigned type value as signed int

2017-04-17 Thread Akinobu Mita

This fixes fault-inject-simplify-access-check-for-fail-nth.patch in -mm
tree which by mistake partially reverts the change by fault-inject-
parse-as-natural-1-based-value-for-fail-nth-write-interface.patch.

Cc: Dmitry Vyukov 
Reported-by: Dmitry Vyukov 
Signed-off-by: Akinobu Mita 
---
 fs/proc/base.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ea1039d..1f5139e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1366,7 +1366,7 @@ static ssize_t proc_fail_nth_write(struct file *file, 
const char __user *buf,
int err;
unsigned int n;
 
-   err = kstrtoint_from_user(buf, count, 0, );
+   err = kstrtouint_from_user(buf, count, 0, );
if (err)
return err;
 
-- 
2.7.4

Re: [PATCH] ubifs: Fix O_TMPFILE corner case in ubifs_link()

2017-04-17 Thread Richard Weinberger

Am 17.04.2017 um 17:27 schrieb Ralph Sennhauser:
> Hi Amir, Richard
> 
> Looks like the fix didn't make it into 4.11-rc7 either, isn't it time
> to just disable O_TMPFILE support in ubifs for now? Giving plenty
> time for the proper fix.

As I said to Amir, don't panic. I'm *very* busy right now with non-computer 
stuff.
My first fix is correct, I was wrong about the testing, it does the right thing,
although it was not so clear.
Except that my fix misses one error case which I wanted to push tomorrow since 
today
is a official holiday here in Austria.
For the next merge window I have a better approach which is better than 
(ab)using
ubifs_jnl_update() for O_TMPFILE.

Thanks,
//richard

Re: [PATCH] ubifs: Fix O_TMPFILE corner case in ubifs_link()

2017-04-17 Thread Richard Weinberger

Am 17.04.2017 um 17:27 schrieb Ralph Sennhauser:
> Hi Amir, Richard
> 
> Looks like the fix didn't make it into 4.11-rc7 either, isn't it time
> to just disable O_TMPFILE support in ubifs for now? Giving plenty
> time for the proper fix.

As I said to Amir, don't panic. I'm *very* busy right now with non-computer 
stuff.
My first fix is correct, I was wrong about the testing, it does the right thing,
although it was not so clear.
Except that my fix misses one error case which I wanted to push tomorrow since 
today
is a official holiday here in Austria.
For the next merge window I have a better approach which is better than 
(ab)using
ubifs_jnl_update() for O_TMPFILE.

Thanks,
//richard

Re: [PATCH] pci-sysfs: Make PCI bridge attribute visible in sysfs

2017-04-17 Thread Bjorn Helgaas

On Mon, Apr 17, 2017 at 12:50 AM, Wong Vee Khee  wrote:
> From: vwong 
>
> Export the PCIe link attributes of PCI bridges to sysfs.

This needs justification for *why* we should export these via sysfs.

Some of these things, e.g., secondary/subordinate bus numbers, are
already visible to non-privileged users via "lspci -v".

> Signed-off-by: Wong Vee Khee 
> Signed-off-by: Hui Chun Ong 
> ---
>  drivers/pci/pci-sysfs.c   | 197 
> +-
>  include/uapi/linux/pci_regs.h |   4 +
>  2 files changed, 197 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 25d010d..a218c43 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -154,6 +154,127 @@ static ssize_t resource_show(struct device *dev, struct 
> device_attribute *attr,
>  }
>  static DEVICE_ATTR_RO(resource);
>
> +static ssize_t max_link_speed_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u32 linkcap;
> +   int err;
> +   const char *speed;
> +
> +   err = pcie_capability_read_dword(pci_dev, PCI_EXP_LNKCAP, );
> +
> +   if (!err && linkcap) {

  if (err)
return -EINVAL;

I don't think there's a reason to test "linkcap" here.  Per spec, zero
is not a valid value of LNKCAP, so if we got a zero, I think showing
"Unknown speed" would be fine.

> +   switch (linkcap & PCI_EXP_LNKCAP_MLS) {
> +   case PCI_EXP_LNKCAP_MLS_8_0GB:
> +   speed = "8 GT/s";
> +   break;
> +   case PCI_EXP_LNKCAP_MLS_5_0GB:
> +   speed = "5 GT/s";
> +   break;
> +   case PCI_EXP_LNKCAP_MLS_2_5GB:
> +   speed = "2.5 GT/s";
> +   break;
> +   default:
> +   speed = "Unknown speed";
> +   }
> +
> +   return sprintf(buf, "%s\n", speed);
> +   }
> +
> +   return -EINVAL;
> +}
> +static DEVICE_ATTR_RO(max_link_speed);
> +
> +static ssize_t max_link_width_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u32 linkcap;
> +   int err;
> +
> +   err = pcie_capability_read_dword(pci_dev, PCI_EXP_LNKCAP, );
> +
> +   return err ? -EINVAL : sprintf(
> +   buf, "%u\n", (linkcap & PCI_EXP_LNKCAP_MLW) >> 4);

  if (err)
return -EINVAL;

  return sprintf(...);

> +}
> +static DEVICE_ATTR_RO(max_link_width);
> +
> +static ssize_t current_link_speed_show(struct device *dev,
> +  struct device_attribute *attr, char 
> *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u16 linkstat;
> +   int err;
> +   const char *speed;
> +
> +   err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, );
> +
> +   if (!err && linkstat) {

See max_link_speed above.

> +   switch (linkstat & PCI_EXP_LNKSTA_CLS) {
> +   case PCI_EXP_LNKSTA_CLS_8_0GB:
> +   speed = "8 GT/s";
> +   break;
> +   case PCI_EXP_LNKSTA_CLS_5_0GB:
> +   speed = "5 GT/s";
> +   break;
> +   case PCI_EXP_LNKSTA_CLS_2_5GB:
> +   speed = "2.5 GT/s";
> +   break;
> +   default:
> +   speed = "Unknown speed";
> +   }
> +
> +   return sprintf(buf, "%s\n", speed);
> +   }
> +
> +   return -EINVAL;
> +}
> +static DEVICE_ATTR_RO(current_link_speed);
> +
> +static ssize_t current_link_width_show(struct device *dev,
> +  struct device_attribute *attr, char 
> *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u16 linkstat;
> +   int err;
> +
> +   err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, );
> +
> +   return err ? -EINVAL : sprintf(
> +   buf, "%u\n",
> +   (linkstat & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT);
> +}
> +static DEVICE_ATTR_RO(current_link_width);
> +
> +static ssize_t secondary_bus_number_show(struct device *dev,
> +struct device_attribute *attr,
> +char *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u8 sec_bus;
> +   int err;
> +
> +   err = pci_read_config_byte(pci_dev, PCI_SECONDARY_BUS, _bus);

There is already a /sys/devices/pci:BB/:BB:dd.f//
directory and a .../pci_bus/ directory that looks like it is the
secondary bus.  Is that enough?

If we also need this file, I would like it to do

Re: [PATCH] pci-sysfs: Make PCI bridge attribute visible in sysfs

2017-04-17 Thread Bjorn Helgaas

On Mon, Apr 17, 2017 at 12:50 AM, Wong Vee Khee  wrote:
> From: vwong 
>
> Export the PCIe link attributes of PCI bridges to sysfs.

This needs justification for *why* we should export these via sysfs.

Some of these things, e.g., secondary/subordinate bus numbers, are
already visible to non-privileged users via "lspci -v".

> Signed-off-by: Wong Vee Khee 
> Signed-off-by: Hui Chun Ong 
> ---
>  drivers/pci/pci-sysfs.c   | 197 
> +-
>  include/uapi/linux/pci_regs.h |   4 +
>  2 files changed, 197 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 25d010d..a218c43 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -154,6 +154,127 @@ static ssize_t resource_show(struct device *dev, struct 
> device_attribute *attr,
>  }
>  static DEVICE_ATTR_RO(resource);
>
> +static ssize_t max_link_speed_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u32 linkcap;
> +   int err;
> +   const char *speed;
> +
> +   err = pcie_capability_read_dword(pci_dev, PCI_EXP_LNKCAP, );
> +
> +   if (!err && linkcap) {

  if (err)
return -EINVAL;

I don't think there's a reason to test "linkcap" here.  Per spec, zero
is not a valid value of LNKCAP, so if we got a zero, I think showing
"Unknown speed" would be fine.

> +   switch (linkcap & PCI_EXP_LNKCAP_MLS) {
> +   case PCI_EXP_LNKCAP_MLS_8_0GB:
> +   speed = "8 GT/s";
> +   break;
> +   case PCI_EXP_LNKCAP_MLS_5_0GB:
> +   speed = "5 GT/s";
> +   break;
> +   case PCI_EXP_LNKCAP_MLS_2_5GB:
> +   speed = "2.5 GT/s";
> +   break;
> +   default:
> +   speed = "Unknown speed";
> +   }
> +
> +   return sprintf(buf, "%s\n", speed);
> +   }
> +
> +   return -EINVAL;
> +}
> +static DEVICE_ATTR_RO(max_link_speed);
> +
> +static ssize_t max_link_width_show(struct device *dev,
> +  struct device_attribute *attr, char *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u32 linkcap;
> +   int err;
> +
> +   err = pcie_capability_read_dword(pci_dev, PCI_EXP_LNKCAP, );
> +
> +   return err ? -EINVAL : sprintf(
> +   buf, "%u\n", (linkcap & PCI_EXP_LNKCAP_MLW) >> 4);

  if (err)
return -EINVAL;

  return sprintf(...);

> +}
> +static DEVICE_ATTR_RO(max_link_width);
> +
> +static ssize_t current_link_speed_show(struct device *dev,
> +  struct device_attribute *attr, char 
> *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u16 linkstat;
> +   int err;
> +   const char *speed;
> +
> +   err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, );
> +
> +   if (!err && linkstat) {

See max_link_speed above.

> +   switch (linkstat & PCI_EXP_LNKSTA_CLS) {
> +   case PCI_EXP_LNKSTA_CLS_8_0GB:
> +   speed = "8 GT/s";
> +   break;
> +   case PCI_EXP_LNKSTA_CLS_5_0GB:
> +   speed = "5 GT/s";
> +   break;
> +   case PCI_EXP_LNKSTA_CLS_2_5GB:
> +   speed = "2.5 GT/s";
> +   break;
> +   default:
> +   speed = "Unknown speed";
> +   }
> +
> +   return sprintf(buf, "%s\n", speed);
> +   }
> +
> +   return -EINVAL;
> +}
> +static DEVICE_ATTR_RO(current_link_speed);
> +
> +static ssize_t current_link_width_show(struct device *dev,
> +  struct device_attribute *attr, char 
> *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u16 linkstat;
> +   int err;
> +
> +   err = pcie_capability_read_word(pci_dev, PCI_EXP_LNKSTA, );
> +
> +   return err ? -EINVAL : sprintf(
> +   buf, "%u\n",
> +   (linkstat & PCI_EXP_LNKSTA_NLW) >> PCI_EXP_LNKSTA_NLW_SHIFT);
> +}
> +static DEVICE_ATTR_RO(current_link_width);
> +
> +static ssize_t secondary_bus_number_show(struct device *dev,
> +struct device_attribute *attr,
> +char *buf)
> +{
> +   struct pci_dev *pci_dev = to_pci_dev(dev);
> +   u8 sec_bus;
> +   int err;
> +
> +   err = pci_read_config_byte(pci_dev, PCI_SECONDARY_BUS, _bus);

There is already a /sys/devices/pci:BB/:BB:dd.f//
directory and a .../pci_bus/ directory that looks like it is the
secondary bus.  Is that enough?

If we also need this file, I would like it to do something sensible
when there is no secondary bus.  Maybe that is exposing the bus

Re: [PATCH] x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space

2017-04-17 Thread Dave Hansen

Hi Joerg,

> When this function fails it just sends a SIGSEGV signal to
> user-space using force_sig(). This signal is missing
> essential information about the cause, e.g. the trap_nr or
> an error code.
>
> Fix this by propagating the error to the only caller of
> mpx_handle_bd_fault(), do_bounds(), which sends the correct
> SIGSEGV signal to the process.

Just to be clear, the thing you're calling "correct" is this do_trap(),
right?

do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);

> Fixes: fe3d197f84319 ('x86, mpx: On-demand kernel allocation of bounds
tables')
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/mm/mpx.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae7..1c34b76 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -526,15 +526,7 @@ int mpx_handle_bd_fault(void)
>   if (!kernel_managing_mpx_tables(current->mm))
>   return -EINVAL;
>
> - if (do_mpx_bt_fault()) {
> - force_sig(SIGSEGV, current);
> - /*
> -  * The force_sig() is essentially "handling" this
> -  * exception, so we do not pass up the error
> -  * from do_mpx_bt_fault().
> -  */
> - }
> - return 0;
> + return do_mpx_bt_fault();
>  }

do_mpx_bt_fault() can fail for a bunch of reasons:
 * unexpected or invalid value in BNDCSR
 * out of memory (physical or virtual)
 * unresolvable fault walking/filling bounds tables
 * !valid and non-empty bad entry in the bounds tables

This will end up sending a signal that *looks* like a X86_TRAP_BR for
all of those, including those that are not really bounds-related, like
unresolvable faults.  We also don't populate enough information in the
siginfo that gets delivered for userspace to resolve the fault.

I'm not sure this patch is the right thing.

Re: [PATCH] x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space

2017-04-17 Thread Dave Hansen

Hi Joerg,

> When this function fails it just sends a SIGSEGV signal to
> user-space using force_sig(). This signal is missing
> essential information about the cause, e.g. the trap_nr or
> an error code.
>
> Fix this by propagating the error to the only caller of
> mpx_handle_bd_fault(), do_bounds(), which sends the correct
> SIGSEGV signal to the process.

Just to be clear, the thing you're calling "correct" is this do_trap(),
right?

do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);

> Fixes: fe3d197f84319 ('x86, mpx: On-demand kernel allocation of bounds
tables')
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/mm/mpx.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index cd44ae7..1c34b76 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -526,15 +526,7 @@ int mpx_handle_bd_fault(void)
>   if (!kernel_managing_mpx_tables(current->mm))
>   return -EINVAL;
>
> - if (do_mpx_bt_fault()) {
> - force_sig(SIGSEGV, current);
> - /*
> -  * The force_sig() is essentially "handling" this
> -  * exception, so we do not pass up the error
> -  * from do_mpx_bt_fault().
> -  */
> - }
> - return 0;
> + return do_mpx_bt_fault();
>  }

do_mpx_bt_fault() can fail for a bunch of reasons:
 * unexpected or invalid value in BNDCSR
 * out of memory (physical or virtual)
 * unresolvable fault walking/filling bounds tables
 * !valid and non-empty bad entry in the bounds tables

This will end up sending a signal that *looks* like a X86_TRAP_BR for
all of those, including those that are not really bounds-related, like
unresolvable faults.  We also don't populate enough information in the
siginfo that gets delivered for userspace to resolve the fault.

I'm not sure this patch is the right thing.

Re: [PATCH] backlight: pwm_bl: Fix condition to set enable gpio as output

2017-04-17 Thread Paul Kocialkowski

Hi,

Le dimanche 16 avril 2017 à 22:55 +0200, Geert Uytterhoeven a écrit :
> On Sun, Apr 16, 2017 at 12:35 PM, Paul Kocialkowski  wrote:
> > The move to a dedicated pwm_backlight_initial_power_state function in
> > commit 7613c922315e308a6486d802abed2eb74443dffd modified the condition
> > to set the enable gpio as output. This breaks specific use cases using
> > that GPIO, such as tegra124-based nyan Chromebooks where backlight
> > stopped working.
> > 
> > This puts the condition back to the way it was before the move.
> > 
> > Signed-off-by: Paul Kocialkowski 
> 
> Does "[PATCH v2] backlight: pwm_bl: Fix GPIO out for unimplemented
> .get_direction()"
> (https://lkml.org/lkml/2017/4/4/225) fix your issue?

This definitely solves the issue!

Since this fixes broken backlight (that, in turn, makes systems unusable),
perhaps this should be included in the next rc cycle?

Cheers!

-- 
Paul Kocialkowski, developer of free digital technology at the lower levels

Website: https://www.paulk.fr/
Coding blog: https://code.paulk.fr/
Git repositories: https://git.paulk.fr/ https://git.code.paulk.fr/

signature.asc
Description: This is a digitally signed message part

Re: [PATCH] backlight: pwm_bl: Fix condition to set enable gpio as output

2017-04-17 Thread Paul Kocialkowski

Hi,

Le dimanche 16 avril 2017 à 22:55 +0200, Geert Uytterhoeven a écrit :
> On Sun, Apr 16, 2017 at 12:35 PM, Paul Kocialkowski  wrote:
> > The move to a dedicated pwm_backlight_initial_power_state function in
> > commit 7613c922315e308a6486d802abed2eb74443dffd modified the condition
> > to set the enable gpio as output. This breaks specific use cases using
> > that GPIO, such as tegra124-based nyan Chromebooks where backlight
> > stopped working.
> > 
> > This puts the condition back to the way it was before the move.
> > 
> > Signed-off-by: Paul Kocialkowski 
> 
> Does "[PATCH v2] backlight: pwm_bl: Fix GPIO out for unimplemented
> .get_direction()"
> (https://lkml.org/lkml/2017/4/4/225) fix your issue?

This definitely solves the issue!

Since this fixes broken backlight (that, in turn, makes systems unusable),
perhaps this should be included in the next rc cycle?

Cheers!

-- 
Paul Kocialkowski, developer of free digital technology at the lower levels

Website: https://www.paulk.fr/
Coding blog: https://code.paulk.fr/
Git repositories: https://git.paulk.fr/ https://git.code.paulk.fr/

signature.asc
Description: This is a digitally signed message part

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Alan Tull

On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:

Hi Jerome,

> On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
>> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
>> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
>> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
>> > > wrote:
>> > >
>> > > > It is like if on GPU we only had close source compiler for the GPU
>> > > > instructions set. So FPGA is definitly following different rules than
>> > > > open source upstream GPU kernel driver abides to.

Sorry, not a GPU guy, can you point me to something that documents
this policy of 'only opensource compilers for GPU'?  I looked under
linux/Documentation and didn't see anything.

The current patchset doesn't have anything to do with FPGA toolchains
but you're using this patchset as a platform to talk about toolchain
issues.

It sounds like you are opposed to any kernel support of loading images
on FPGAs until all vendors have opensource toolchains.

Alan

Re: [PATCH 00/16] Intel FPGA Device Drivers

2017-04-17 Thread Alan Tull

On Fri, Apr 14, 2017 at 3:49 PM, Jerome Glisse  wrote:

Hi Jerome,

> On Fri, Apr 14, 2017 at 12:48:17PM -0700, Luebbers, Enno wrote:
>> On Wed, Apr 12, 2017 at 11:37:49AM -0400, Jerome Glisse wrote:
>> > On Wed, Apr 12, 2017 at 07:46:19AM -0700, Moritz Fischer wrote:
>> > > On Wed, Apr 12, 2017 at 6:29 AM, Jerome Glisse  
>> > > wrote:
>> > >
>> > > > It is like if on GPU we only had close source compiler for the GPU
>> > > > instructions set. So FPGA is definitly following different rules than
>> > > > open source upstream GPU kernel driver abides to.

Sorry, not a GPU guy, can you point me to something that documents
this policy of 'only opensource compilers for GPU'?  I looked under
linux/Documentation and didn't see anything.

The current patchset doesn't have anything to do with FPGA toolchains
but you're using this patchset as a platform to talk about toolchain
issues.

It sounds like you are opposed to any kernel support of loading images
on FPGAs until all vendors have opensource toolchains.

Alan

[PATCH 25/25] sky2: Use seq_puts() in sky2_debug_show()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 16:15:12 +0200

A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function "seq_puts".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/sky2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/sky2.c 
b/drivers/net/ethernet/marvell/sky2.c
index 2b2cc3f3ca10..1145cde2274a 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -4544,7 +4544,7 @@ static int sky2_debug_show(struct seq_file *seq, void *v)
   sky2_read32(hw, B0_Y2_SP_ICR));
 
if (!netif_running(dev)) {
-   seq_printf(seq, "network not running\n");
+   seq_puts(seq, "network not running\n");
return 0;
}
 
-- 
2.12.2

[PATCH 25/25] sky2: Use seq_puts() in sky2_debug_show()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 16:15:12 +0200

A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function "seq_puts".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/sky2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/sky2.c 
b/drivers/net/ethernet/marvell/sky2.c
index 2b2cc3f3ca10..1145cde2274a 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -4544,7 +4544,7 @@ static int sky2_debug_show(struct seq_file *seq, void *v)
   sky2_read32(hw, B0_Y2_SP_ICR));
 
if (!netif_running(dev)) {
-   seq_printf(seq, "network not running\n");
+   seq_puts(seq, "network not running\n");
return 0;
}
 
-- 
2.12.2

[PATCH 24/25] skge: Adjust a null pointer check in skge_down()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 16:08:39 +0200

The script "checkpatch.pl" pointed information out like the following.

Comparison to NULL could be written "!skge->mem".

Thus fix the affected source code place.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/skge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index 90bfdbcfd910..5d7d94de4e00 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -2657,7 +2657,7 @@ static int skge_down(struct net_device *dev)
struct skge_hw *hw = skge->hw;
int port = skge->port;
 
-   if (skge->mem == NULL)
+   if (!skge->mem)
return 0;
 
netif_info(skge, ifdown, skge->netdev, "disabling interface\n");
-- 
2.12.2

Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu

2017-04-17 Thread Lauro Venancio

On 04/14/2017 01:49 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote:
>> Currently, the group balance cpu is the groups's first CPU. But with
>> overlapping groups, two different groups can have the same first CPU.
>>
>> This patch uses the group mask to mark all the CPUs that have a
>> particular group as its main sched group. The group balance cpu is the
>> first group CPU that is also in the mask.
> Please give a NUMA configuration and CPU number where this goes wrong.
On a 4 nodes with ring topology, the groups (0-1,3 [cpu 0]),  (0-2 [cpu
1]) and (0,2-3 [cpu 3]) share the same sched_group_capacity instance
when the first groups cpu is used to select the sgc.
>
> Because only the first group of a domain matters, and with the other
> thing fixed, I'm not immediately seeing where we go wobbly.

Before patch 2, the group balance cpu was implicitly used to select the
sched_group_capacity instance. When two different groups had the same
balance cpu, they shared the same sched_group_capacity instance.

After patch 2, one different sched_group_capacity instance is assigned
to each group instance.

This patch ensures tree things:

1) different instances of the same group share the same
sched_group_capacity instance.

2) instances of different groups don't share the same
sched_group_capacity instance.

3) the group balance cpu must be one of the cpus where the group is
installed.

I am rebasing this patch on top of your patches.

[PATCH 24/25] skge: Adjust a null pointer check in skge_down()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 16:08:39 +0200

The script "checkpatch.pl" pointed information out like the following.

Comparison to NULL could be written "!skge->mem".

Thus fix the affected source code place.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/skge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index 90bfdbcfd910..5d7d94de4e00 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -2657,7 +2657,7 @@ static int skge_down(struct net_device *dev)
struct skge_hw *hw = skge->hw;
int port = skge->port;
 
-   if (skge->mem == NULL)
+   if (!skge->mem)
return 0;
 
netif_info(skge, ifdown, skge->netdev, "disabling interface\n");
-- 
2.12.2

Re: [RFC 3/3] sched/topology: Different sched groups must not have the same balance cpu

2017-04-17 Thread Lauro Venancio

On 04/14/2017 01:49 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:09AM -0300, Lauro Ramos Venancio wrote:
>> Currently, the group balance cpu is the groups's first CPU. But with
>> overlapping groups, two different groups can have the same first CPU.
>>
>> This patch uses the group mask to mark all the CPUs that have a
>> particular group as its main sched group. The group balance cpu is the
>> first group CPU that is also in the mask.
> Please give a NUMA configuration and CPU number where this goes wrong.
On a 4 nodes with ring topology, the groups (0-1,3 [cpu 0]),  (0-2 [cpu
1]) and (0,2-3 [cpu 3]) share the same sched_group_capacity instance
when the first groups cpu is used to select the sgc.
>
> Because only the first group of a domain matters, and with the other
> thing fixed, I'm not immediately seeing where we go wobbly.

Before patch 2, the group balance cpu was implicitly used to select the
sched_group_capacity instance. When two different groups had the same
balance cpu, they shared the same sched_group_capacity instance.

After patch 2, one different sched_group_capacity instance is assigned
to each group instance.

This patch ensures tree things:

1) different instances of the same group share the same
sched_group_capacity instance.

2) instances of different groups don't share the same
sched_group_capacity instance.

3) the group balance cpu must be one of the cpus where the group is
installed.

I am rebasing this patch on top of your patches.

[PATCH 23/25] skge: Use seq_puts() in skge_debug_show()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 15:43:08 +0200

A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function "seq_puts".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/skge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index edb95271a4f2..90bfdbcfd910 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -3718,7 +3718,7 @@ static int skge_debug_show(struct seq_file *seq, void *v)
   t->csum_offs, t->csum_write, t->csum_start);
}
 
-   seq_printf(seq, "\nRx Ring:\n");
+   seq_puts(seq, "\nRx Ring:\n");
for (e = skge->rx_ring.to_clean; ; e = e->next) {
const struct skge_rx_desc *r = e->desc;
 
-- 
2.12.2

[PATCH 23/25] skge: Use seq_puts() in skge_debug_show()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 15:43:08 +0200

A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function "seq_puts".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/skge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index edb95271a4f2..90bfdbcfd910 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -3718,7 +3718,7 @@ static int skge_debug_show(struct seq_file *seq, void *v)
   t->csum_offs, t->csum_write, t->csum_start);
}
 
-   seq_printf(seq, "\nRx Ring:\n");
+   seq_puts(seq, "\nRx Ring:\n");
for (e = skge->rx_ring.to_clean; ; e = e->next) {
const struct skge_rx_desc *r = e->desc;
 
-- 
2.12.2

[PATCH 22/25] net: pxa168_eth: Adjust four checks for null pointers

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 15:23:45 +0200
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written …

Thus fix the affected source code places.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/pxa168_eth.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c 
b/drivers/net/ethernet/marvell/pxa168_eth.c
index 2404eac51c63..993724959a7c 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -556,11 +556,11 @@ static int init_hash_table(struct pxa168_eth_private *pep)
 * function.Driver can dynamically switch to them if the 1/2kB hash
 * table is full.
 */
-   if (pep->htpr == NULL) {
+   if (!pep->htpr) {
pep->htpr = dma_zalloc_coherent(pep->dev->dev.parent,
HASH_ADDR_TABLE_SIZE,
>htpr_dma, GFP_KERNEL);
-   if (pep->htpr == NULL)
+   if (!pep->htpr)
return -ENOMEM;
} else {
memset(pep->htpr, 0, HASH_ADDR_TABLE_SIZE);
@@ -1356,7 +1356,7 @@ static int pxa168_smi_write(struct mii_bus *bus, int 
phy_addr, int regnum,
 static int pxa168_eth_do_ioctl(struct net_device *dev, struct ifreq *ifr,
   int cmd)
 {
-   if (dev->phydev != NULL)
+   if (dev->phydev)
return phy_mii_ioctl(dev->phydev, ifr, cmd);
 
return -EOPNOTSUPP;
@@ -1501,7 +1501,7 @@ static int pxa168_eth_probe(struct platform_device *pdev)
pep->timeout.data = (unsigned long)pep;
 
pep->smi_bus = mdiobus_alloc();
-   if (pep->smi_bus == NULL) {
+   if (!pep->smi_bus) {
err = -ENOMEM;
goto err_netdev;
}
-- 
2.12.2

[PATCH 22/25] net: pxa168_eth: Adjust four checks for null pointers

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 15:23:45 +0200
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The script “checkpatch.pl” pointed information out like the following.

Comparison to NULL could be written …

Thus fix the affected source code places.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/pxa168_eth.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c 
b/drivers/net/ethernet/marvell/pxa168_eth.c
index 2404eac51c63..993724959a7c 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -556,11 +556,11 @@ static int init_hash_table(struct pxa168_eth_private *pep)
 * function.Driver can dynamically switch to them if the 1/2kB hash
 * table is full.
 */
-   if (pep->htpr == NULL) {
+   if (!pep->htpr) {
pep->htpr = dma_zalloc_coherent(pep->dev->dev.parent,
HASH_ADDR_TABLE_SIZE,
>htpr_dma, GFP_KERNEL);
-   if (pep->htpr == NULL)
+   if (!pep->htpr)
return -ENOMEM;
} else {
memset(pep->htpr, 0, HASH_ADDR_TABLE_SIZE);
@@ -1356,7 +1356,7 @@ static int pxa168_smi_write(struct mii_bus *bus, int 
phy_addr, int regnum,
 static int pxa168_eth_do_ioctl(struct net_device *dev, struct ifreq *ifr,
   int cmd)
 {
-   if (dev->phydev != NULL)
+   if (dev->phydev)
return phy_mii_ioctl(dev->phydev, ifr, cmd);
 
return -EOPNOTSUPP;
@@ -1501,7 +1501,7 @@ static int pxa168_eth_probe(struct platform_device *pdev)
pep->timeout.data = (unsigned long)pep;
 
pep->smi_bus = mdiobus_alloc();
-   if (pep->smi_bus == NULL) {
+   if (!pep->smi_bus) {
err = -ENOMEM;
goto err_netdev;
}
-- 
2.12.2

[PATCH 21/25] net: pxa168_eth: Use kcalloc() in two functions

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 14:32:14 +0200

Multiplications for the size determination of memory allocations
indicated that array data structures should be processed.
Thus use the corresponding function "kcalloc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/pxa168_eth.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c 
b/drivers/net/ethernet/marvell/pxa168_eth.c
index 28cb36d9e50a..2404eac51c63 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -1036,8 +1036,7 @@ static int rxq_init(struct net_device *dev)
int rx_desc_num = pep->rx_ring_size;
 
/* Allocate RX skb rings */
-   pep->rx_skb = kzalloc(sizeof(*pep->rx_skb) * pep->rx_ring_size,
-GFP_KERNEL);
+   pep->rx_skb = kcalloc(rx_desc_num, sizeof(*pep->rx_skb), GFP_KERNEL);
if (!pep->rx_skb)
return -ENOMEM;
 
@@ -1096,8 +1095,7 @@ static int txq_init(struct net_device *dev)
int size = 0, i = 0;
int tx_desc_num = pep->tx_ring_size;
 
-   pep->tx_skb = kzalloc(sizeof(*pep->tx_skb) * pep->tx_ring_size,
-GFP_KERNEL);
+   pep->tx_skb = kcalloc(tx_desc_num, sizeof(*pep->tx_skb), GFP_KERNEL);
if (!pep->tx_skb)
return -ENOMEM;
 
-- 
2.12.2

[PATCH 21/25] net: pxa168_eth: Use kcalloc() in two functions

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 14:32:14 +0200

Multiplications for the size determination of memory allocations
indicated that array data structures should be processed.
Thus use the corresponding function "kcalloc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/pxa168_eth.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c 
b/drivers/net/ethernet/marvell/pxa168_eth.c
index 28cb36d9e50a..2404eac51c63 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -1036,8 +1036,7 @@ static int rxq_init(struct net_device *dev)
int rx_desc_num = pep->rx_ring_size;
 
/* Allocate RX skb rings */
-   pep->rx_skb = kzalloc(sizeof(*pep->rx_skb) * pep->rx_ring_size,
-GFP_KERNEL);
+   pep->rx_skb = kcalloc(rx_desc_num, sizeof(*pep->rx_skb), GFP_KERNEL);
if (!pep->rx_skb)
return -ENOMEM;
 
@@ -1096,8 +1095,7 @@ static int txq_init(struct net_device *dev)
int size = 0, i = 0;
int tx_desc_num = pep->tx_ring_size;
 
-   pep->tx_skb = kzalloc(sizeof(*pep->tx_skb) * pep->tx_ring_size,
-GFP_KERNEL);
+   pep->tx_skb = kcalloc(tx_desc_num, sizeof(*pep->tx_skb), GFP_KERNEL);
if (!pep->tx_skb)
return -ENOMEM;
 
-- 
2.12.2

[PATCH 20/25] net: mvpp2: Adjust a null pointer check in mvpp2_egress_enable()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 14:07:52 +0200

The script "checkpatch.pl" pointed information out like the following.

Comparison to NULL could be written "txq->descs".

Thus fix the affected source code place.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/mvpp2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 3bdd3f1fe34e..9b875d776b29 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4415,7 +4415,7 @@ static void mvpp2_egress_enable(struct mvpp2_port *port)
for (queue = 0; queue < txq_number; queue++) {
struct mvpp2_tx_queue *txq = port->txqs[queue];
 
-   if (txq->descs != NULL)
+   if (txq->descs)
qmap |= (1 << queue);
}
 
-- 
2.12.2

[PATCH 20/25] net: mvpp2: Adjust a null pointer check in mvpp2_egress_enable()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 14:07:52 +0200

The script "checkpatch.pl" pointed information out like the following.

Comparison to NULL could be written "txq->descs".

Thus fix the affected source code place.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/mvpp2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 3bdd3f1fe34e..9b875d776b29 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -4415,7 +4415,7 @@ static void mvpp2_egress_enable(struct mvpp2_port *port)
for (queue = 0; queue < txq_number; queue++) {
struct mvpp2_tx_queue *txq = port->txqs[queue];
 
-   if (txq->descs != NULL)
+   if (txq->descs)
qmap |= (1 << queue);
}
 
-- 
2.12.2

[PATCH 19/25] net: mvpp2: Rename a jump label in mvpp2_prs_vlan_add()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 13:50:35 +0200

Adjust jump labels according to the Linux coding style convention.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/mvpp2.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 0c190bd003b1..3bdd3f1fe34e 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -2021,7 +2021,7 @@ static int mvpp2_prs_vlan_add(struct mvpp2 *priv, 
unsigned short tpid, int ai,
 
if (tid <= tid_aux) {
ret = -EINVAL;
-   goto error;
+   goto free_pe;
}
 
memset(pe, 0, sizeof(*pe));
@@ -2053,8 +2053,7 @@ static int mvpp2_prs_vlan_add(struct mvpp2 *priv, 
unsigned short tpid, int ai,
mvpp2_prs_tcam_port_map_set(pe, port_map);
 
mvpp2_prs_hw_write(priv, pe);
-
-error:
+free_pe:
kfree(pe);
 
return ret;
-- 
2.12.2

[PATCH 19/25] net: mvpp2: Rename a jump label in mvpp2_prs_vlan_add()

2017-04-17 Thread SF Markus Elfring

From: Markus Elfring 
Date: Mon, 17 Apr 2017 13:50:35 +0200

Adjust jump labels according to the Linux coding style convention.

Signed-off-by: Markus Elfring 
---
 drivers/net/ethernet/marvell/mvpp2.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 0c190bd003b1..3bdd3f1fe34e 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -2021,7 +2021,7 @@ static int mvpp2_prs_vlan_add(struct mvpp2 *priv, 
unsigned short tpid, int ai,
 
if (tid <= tid_aux) {
ret = -EINVAL;
-   goto error;
+   goto free_pe;
}
 
memset(pe, 0, sizeof(*pe));
@@ -2053,8 +2053,7 @@ static int mvpp2_prs_vlan_add(struct mvpp2 *priv, 
unsigned short tpid, int ai,
mvpp2_prs_tcam_port_map_set(pe, port_map);
 
mvpp2_prs_hw_write(priv, pe);
-
-error:
+free_pe:
kfree(pe);
 
return ret;
-- 
2.12.2

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1040 matches

Mail list logo