Re: [PATCH 1/2] mm: use is_migrate_highatomic() to simplify the code

2017-03-06 Thread Michal Hocko
On Fri 03-03-17 15:06:19, Andrew Morton wrote:
> On Fri, 3 Mar 2017 14:18:08 +0100 Michal Hocko  wrote:
> 
> > On Fri 03-03-17 19:10:13, Xishi Qiu wrote:
> > > Introduce two helpers, is_migrate_highatomic() and 
> > > is_migrate_highatomic_page().
> > > Simplify the code, no functional changes.
> > 
> > static inline helpers would be nicer than macros
> 
> Always.
> 
> We made a big dependency mess in mmzone.h.  internal.h works.

Just too bad we have three different header files for
is_migrate_isolate{_page} - include/linux/page-isolation.h
is_migrate_cma{_page} - include/linux/mmzone.h
is_migrate_highatomic{_page} - mm/internal.h

I guess we want all of them in internal.h?

-- 
Michal Hocko
SUSE Labs


Re: [Patch v2 03/11] s5p-mfc: Use min scratch buffer size as provided by F/W

2017-03-06 Thread Andrzej Hajda
On 03.03.2017 10:07, Smitha T Murthy wrote:
> After MFC v8.0, mfc f/w lets the driver know how much scratch buffer
> size is required for decoder. If mfc f/w has the functionality,
> E_MIN_SCRATCH_BUFFER_SIZE, driver can know how much scratch buffer size
> is required for encoder too.
>
> Signed-off-by: Smitha T Murthy 
Reviewed-by: Andrzej Hajda 
--
Regards
Andrzej



[v2 PATCH 3/3] mmc: sdhci-cadence: Update PHY delay configuration

2017-03-06 Thread Piotr Sroka
PHY settings can be different for different platforms and SoCs.
Fixed PHY input delays was replaced with SoC specific compatible data.
DTS properties are used for configuration new PHY DLL delays.

Signed-off-by: Piotr Sroka 
---
Changes for v2:
- dts part was removed from this patch
- most delays were moved from dts file 
  to data associated with an SoC specific compatible
- remove unrelated changes
---

 drivers/mmc/host/sdhci-cadence.c | 124 ---
 1 file changed, 116 insertions(+), 8 deletions(-)

diff --git a/drivers/mmc/host/sdhci-cadence.c b/drivers/mmc/host/sdhci-cadence.c
index b2334ec..29b5d11 100644
--- a/drivers/mmc/host/sdhci-cadence.c
+++ b/drivers/mmc/host/sdhci-cadence.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "sdhci-pltfm.h"
 
@@ -54,6 +55,9 @@
 #define SDHCI_CDNS_PHY_DLY_EMMC_LEGACY 0x06
 #define SDHCI_CDNS_PHY_DLY_EMMC_SDR0x07
 #define SDHCI_CDNS_PHY_DLY_EMMC_DDR0x08
+#define SDHCI_CDNS_PHY_DLY_SDCLK   0x0b
+#define SDHCI_CDNS_PHY_DLY_HSMMC   0x0c
+#define SDHCI_CDNS_PHY_DLY_STROBE  0x0d
 
 /*
  * The tuned val register is 6 bit-wide, but not the whole of the range is
@@ -62,10 +66,24 @@
  */
 #define SDHCI_CDNS_MAX_TUNING_LOOP 40
 
+static const struct of_device_id sdhci_cdns_match[];
+
 struct sdhci_cdns_priv {
void __iomem *hrs_addr;
 };
 
+struct sdhci_cdns_config {
+   u8 phy_dly_sd_highspeed;
+   u8 phy_dly_sd_legacy;
+   u8 phy_dly_sd_uhs_sdr12;
+   u8 phy_dly_sd_uhs_sdr25;
+   u8 phy_dly_sd_uhs_sdr50;
+   u8 phy_dly_sd_uhs_ddr50;
+   u8 phy_dly_emmc_legacy;
+   u8 phy_dly_emmc_sdr;
+   u8 phy_dly_emmc_ddr;
+};
+
 static int sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv,
u8 addr, u8 data)
 {
@@ -90,13 +108,77 @@ static int sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv 
*priv,
return 0;
 }
 
-static void sdhci_cdns_phy_init(struct sdhci_cdns_priv *priv)
+static int sdhci_cdns_phy_in_delay_init(struct sdhci_cdns_priv *priv,
+   const struct sdhci_cdns_config *config)
+{
+   int ret = 0;
+
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_HS,
+  config->phy_dly_sd_highspeed);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_DEFAULT,
+  config->phy_dly_sd_legacy);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_SDR12,
+  config->phy_dly_sd_uhs_sdr12);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_SDR25,
+  config->phy_dly_sd_uhs_sdr25);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_SDR50,
+  config->phy_dly_sd_uhs_sdr50);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_DDR50,
+  config->phy_dly_sd_uhs_ddr50);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_LEGACY,
+  config->phy_dly_emmc_legacy);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_SDR,
+  config->phy_dly_emmc_sdr);
+   if (ret)
+   return ret;
+   ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_DDR,
+  config->phy_dly_emmc_ddr);
+   if (ret)
+   return ret;
+   return 0;
+}
+
+static int sdhci_cdns_phy_dll_delay_parse_dt(struct device_node *np,
+struct sdhci_cdns_priv *priv)
 {
-   sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_HS, 4);
-   sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_DEFAULT, 4);
-   sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_LEGACY, 9);
-   sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_SDR, 2);
-   sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_DDR, 3);
+   u32 tmp;
+   int ret;
+
+   if (!of_property_read_u32(np, "phy-dll-delay-sdclk", &tmp)) {
+   ret = sdhci_cdns_write_phy_reg(priv,
+  SDHCI_CDNS_PHY_DLY_SDCLK, tmp);
+
+   if (ret)
+   return ret;
+   }
+   if (!of_property_read_u32(np, "phy-dll-delay-sdclk-hsmmc", &tmp)) {
+   ret = sdhci_cdns_write_phy_reg(priv,
+  SDHCI_CDNS_PHY_DLY_HSMMC, tmp);
+   if (ret)
+   return ret;
+   }
+   

[PATCH v2 1/4] mmc: core: Add post_ios_power_on callback for power sequences

2017-03-06 Thread Romain Perier
Currently, ->pre_power_on() callback is called at the beginning of the
mmc_power_up() function before MMC_POWER_UP and MMC_POWER_ON sequences.
The callback ->post_power_on() is called at the end of the
mmc_power_up() function. Some SDIO Chipsets require to gate the clock
after than the vqmmc supply is powered on and then toggle the reset
line. Currently, there is no way for doing this.

This commit introduces a new callback ->post_ios_power_on(), that is
called at the end of the mmc_power_up() function after the mmc_set_ios()
operation. In this way the entire power sequences can be done from this
function after the enablement of the power supply.

Signed-off-by: Romain Perier 
---

Changes in v2:
 - Added missing declaration for mmc_pwrseq_post_ios_power_on when
   CONFIG_OF is disabled.

 drivers/mmc/core/core.c   | 1 +
 drivers/mmc/core/pwrseq.c | 8 
 drivers/mmc/core/pwrseq.h | 3 +++
 3 files changed, 12 insertions(+)

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 1076b9d..d73a050 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -1831,6 +1831,7 @@ void mmc_power_up(struct mmc_host *host, u32 ocr)
 * time required to reach a stable voltage.
 */
mmc_delay(10);
+   mmc_pwrseq_post_ios_power_on(host);
 }
 
 void mmc_power_off(struct mmc_host *host)
diff --git a/drivers/mmc/core/pwrseq.c b/drivers/mmc/core/pwrseq.c
index 9386c47..98f50b7 100644
--- a/drivers/mmc/core/pwrseq.c
+++ b/drivers/mmc/core/pwrseq.c
@@ -68,6 +68,14 @@ void mmc_pwrseq_post_power_on(struct mmc_host *host)
pwrseq->ops->post_power_on(host);
 }
 
+void mmc_pwrseq_post_ios_power_on(struct mmc_host *host)
+{
+   struct mmc_pwrseq *pwrseq = host->pwrseq;
+
+   if (pwrseq && pwrseq->ops->post_ios_power_on)
+   pwrseq->ops->post_ios_power_on(host);
+}
+
 void mmc_pwrseq_power_off(struct mmc_host *host)
 {
struct mmc_pwrseq *pwrseq = host->pwrseq;
diff --git a/drivers/mmc/core/pwrseq.h b/drivers/mmc/core/pwrseq.h
index d69e751..ad6e3af 100644
--- a/drivers/mmc/core/pwrseq.h
+++ b/drivers/mmc/core/pwrseq.h
@@ -13,6 +13,7 @@
 struct mmc_pwrseq_ops {
void (*pre_power_on)(struct mmc_host *host);
void (*post_power_on)(struct mmc_host *host);
+   void (*post_ios_power_on)(struct mmc_host *host);
void (*power_off)(struct mmc_host *host);
 };
 
@@ -31,6 +32,7 @@ void mmc_pwrseq_unregister(struct mmc_pwrseq *pwrseq);
 int mmc_pwrseq_alloc(struct mmc_host *host);
 void mmc_pwrseq_pre_power_on(struct mmc_host *host);
 void mmc_pwrseq_post_power_on(struct mmc_host *host);
+void mmc_pwrseq_post_ios_power_on(struct mmc_host *host);
 void mmc_pwrseq_power_off(struct mmc_host *host);
 void mmc_pwrseq_free(struct mmc_host *host);
 
@@ -44,6 +46,7 @@ static inline void mmc_pwrseq_unregister(struct mmc_pwrseq 
*pwrseq) {}
 static inline int mmc_pwrseq_alloc(struct mmc_host *host) { return 0; }
 static inline void mmc_pwrseq_pre_power_on(struct mmc_host *host) {}
 static inline void mmc_pwrseq_post_power_on(struct mmc_host *host) {}
+static inline void mmc_pwrseq_post_ios_power_on(struct mmc_host *host) {}
 static inline void mmc_pwrseq_power_off(struct mmc_host *host) {}
 static inline void mmc_pwrseq_free(struct mmc_host *host) {}
 
-- 
2.9.3



Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR

2017-03-06 Thread Kirill A. Shutemov
On Mon, Mar 06, 2017 at 05:00:28PM +0300, Dmitry Safonov wrote:
> 2017-02-21 15:42 GMT+03:00 Kirill A. Shutemov :
> > On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote:
> >> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski :
> >> > On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
> >> >  wrote:
> >> >> This patch introduces two new prctl(2) handles to manage maximum virtual
> >> >> address available to userspace to map.
> >> ...
> >> > Anyway, can you and Dmitry try to reconcile your patches?
> >>
> >> So, how can I help that?
> >> Is there the patch's version, on which I could rebase?
> >> Here are BTW the last patches, which I will resend with trivial ifdef-fixup
> >> after the merge window:
> >> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com
> >
> > Could you check if this patch collides with anything you do:
> >
> > http://lkml.kernel.org/r/20170220131515.ga9...@node.shutemov.name
> 
> Ok, sorry for the late reply - it was the merge window anyway and I've got
> urgent work to do.
> 
> Let's see:
> 
> I'll need minor merge fixup here:
> >-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
> >+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))
> while in my patches:
> >+#define __TASK_UNMAPPED_BASE(task_size)(PAGE_ALIGN(task_size / 3))
> >+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)
> 
> This should be just fine with my changes:
> >- info.high_limit = end;
> >+ info.high_limit = min(end, DEFAULT_MAP_WINDOW);
> 
> This will need another minor fixup:
> >-#define MAX_GAP (TASK_SIZE/6*5)
> >+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)
> I've moved it from macro to mmap_base() as local var,
> which depends on task_size parameter.
> 
> That's all, as far as I can see at this moment.
> Does not seems hard to fix. So I suggest sending patches sets
> in parallel, the second accepted will rebase the set.
> Is it convenient for you?

Works for me.

In fact, I've just sent v4 of the patchset.

-- 
 Kirill A. Shutemov


[PATCH v2 3/4] mmc: pwrseq_simple: Add an optional pre-power-on-delay

2017-03-06 Thread Romain Perier
Some devices need a while between the enablement of its clk and the time
where the reset line is asserted. When this time happens between the
pre_power_on and the post_power_on callbacks, there is a need to do an
msleep at the end of the pre_power_on callback.

This commit adds an optional DT property for such devices.

Signed-off-by: Romain Perier 
---
 Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt | 2 ++
 drivers/mmc/core/pwrseq_simple.c| 6 ++
 2 files changed, 8 insertions(+)

diff --git a/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt 
b/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt
index e254368..821feaaf 100644
--- a/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt
+++ b/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt
@@ -18,6 +18,8 @@ Optional properties:
   "ext_clock" (External clock provided to the card).
 - post-power-on-delay-ms : Delay in ms after powering the card and
de-asserting the reset-gpios (if any)
+- pre-power-on-delay-ms : Delay in ms before powering the card and
+   asserting the reset-gpios (if any)
 
 Example:
 
diff --git a/drivers/mmc/core/pwrseq_simple.c b/drivers/mmc/core/pwrseq_simple.c
index e27019f..d8d7166 100644
--- a/drivers/mmc/core/pwrseq_simple.c
+++ b/drivers/mmc/core/pwrseq_simple.c
@@ -27,6 +27,7 @@ struct mmc_pwrseq_simple {
struct mmc_pwrseq pwrseq;
bool clk_enabled;
u32 post_power_on_delay_ms;
+   u32 pre_power_on_delay_ms;
struct clk *ext_clk;
struct gpio_descs *reset_gpios;
 };
@@ -60,6 +61,9 @@ static void mmc_pwrseq_simple_pre_power_on(struct mmc_host 
*host)
}
 
mmc_pwrseq_simple_set_gpios_value(pwrseq, 1);
+
+   if (pwrseq->pre_power_on_delay_ms)
+   msleep(pwrseq->pre_power_on_delay_ms);
 }
 
 static void mmc_pwrseq_simple_post_power_on(struct mmc_host *host)
@@ -130,6 +134,8 @@ static int mmc_pwrseq_simple_probe(struct platform_device 
*pdev)
 
device_property_read_u32(dev, "post-power-on-delay-ms",
 &pwrseq->post_power_on_delay_ms);
+   device_property_read_u32(dev, "pre-power-on-delay-ms",
+&pwrseq->pre_power_on_delay_ms);
 
pwrseq->pwrseq.dev = dev;
if (device_property_read_bool(dev, "post-ios-power-on"))
-- 
2.9.3



[PATCH v2 2/4] mmc: pwrseq-simple: Add optional op. for post_ios_power_on callback

2017-03-06 Thread Romain Perier
Some devices require to do their entire power sequence after that the
power supply of the MMC has been powered on. This can be done by
only implementing the optional post_ios_power_on() callback that rely on
pre_power_on/post_power_on functions, other functions being NULL. Then
we introduce a new DT property "post_ios_power_on", when this property
is set the driver will use its post_ios_power_on operations, otherwise
it fallbacks to the default operations with pre_power_on/post_power_on.

Signed-off-by: Romain Perier 
---

Changes in v2:
 - Added missing power_off function in mmc_pwrseq_post_ios_ops

 drivers/mmc/core/pwrseq_simple.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/core/pwrseq_simple.c b/drivers/mmc/core/pwrseq_simple.c
index 1304160..e27019f 100644
--- a/drivers/mmc/core/pwrseq_simple.c
+++ b/drivers/mmc/core/pwrseq_simple.c
@@ -84,12 +84,23 @@ static void mmc_pwrseq_simple_power_off(struct mmc_host 
*host)
}
 }
 
+static void mmc_pwrseq_simple_post_ios_power_on(struct mmc_host *host)
+{
+   mmc_pwrseq_simple_pre_power_on(host);
+   mmc_pwrseq_simple_post_power_on(host);
+}
+
 static const struct mmc_pwrseq_ops mmc_pwrseq_simple_ops = {
.pre_power_on = mmc_pwrseq_simple_pre_power_on,
.post_power_on = mmc_pwrseq_simple_post_power_on,
.power_off = mmc_pwrseq_simple_power_off,
 };
 
+static const struct mmc_pwrseq_ops mmc_pwrseq_post_ios_ops = {
+   .post_ios_power_on = mmc_pwrseq_simple_post_ios_power_on,
+   .power_off = mmc_pwrseq_simple_power_off,
+};
+
 static const struct of_device_id mmc_pwrseq_simple_of_match[] = {
{ .compatible = "mmc-pwrseq-simple",},
{/* sentinel */},
@@ -121,7 +132,10 @@ static int mmc_pwrseq_simple_probe(struct platform_device 
*pdev)
 &pwrseq->post_power_on_delay_ms);
 
pwrseq->pwrseq.dev = dev;
-   pwrseq->pwrseq.ops = &mmc_pwrseq_simple_ops;
+   if (device_property_read_bool(dev, "post-ios-power-on"))
+   pwrseq->pwrseq.ops = &mmc_pwrseq_post_ios_ops;
+   else
+   pwrseq->pwrseq.ops = &mmc_pwrseq_simple_ops;
pwrseq->pwrseq.owner = THIS_MODULE;
platform_set_drvdata(pdev, pwrseq);
 
-- 
2.9.3



[PATCH v2 0/4] mmc: pwrseq: post_ios power sequence

2017-03-06 Thread Romain Perier
Some devices, like WiFi chipsets AP6335 require a specific power-up
sequence ordering before being used. You must enable the vqmmc power supply
and wait until it reaches its minimum voltage, gate the clock and wait
at least two cycles and then assert the reset line.

See DS 1/

Currently, there is no generic manner for doing this with pwrseq_simple.
This set of patches proposes an approach to support this use case.

It is related to the old patch 2/

1. 
http://www.t-firefly.com/download/firefly-rk3288/hardware/AP6335%20datasheet_V1.3_02102014.pdf
2. http://lists.infradead.org/pipermail/linux-arm-kernel/2017-March/490681.html

Changes in v2:
- Added missing power_off function in operations for post_ios
- Fixed warning found by 0day-ci about missing
  mmc_pwrseq_post_ios_power_on when CONFIG_OF is disabled.

Romain Perier (4):
  mmc: core: Add post_ios_power_on callback for power sequences
  mmc: pwrseq-simple: Add optional op. for post_ios_power_on callback
  mmc: pwrseq_simple: Add an optional pre-power-on-delay
  arm: dts: rockchip: Enable post_ios_power_on and pre-power-on-delay-ms

 .../devicetree/bindings/mmc/mmc-pwrseq-simple.txt  |  2 ++
 arch/arm/boot/dts/rk3288-rock2-square.dts  |  2 ++
 drivers/mmc/core/core.c|  1 +
 drivers/mmc/core/pwrseq.c  |  8 
 drivers/mmc/core/pwrseq.h  |  3 +++
 drivers/mmc/core/pwrseq_simple.c   | 22 +-
 6 files changed, 37 insertions(+), 1 deletion(-)

-- 
2.9.3



Re: [PATCH 01/10] x86: assembly, ENTRY for fn, GLOBAL for data

2017-03-06 Thread Jiri Slaby
On 03/03/2017, 07:20 PM, h...@zytor.com wrote:
> On March 1, 2017 2:27:54 AM PST, Ingo Molnar  wrote:
>>
>> * Thomas Gleixner  wrote:
>>
>>> On Wed, 1 Mar 2017, Ingo Molnar wrote:

 * Jiri Slaby  wrote:

> This is a start of series to unify use of ENTRY, ENDPROC, GLOBAL,
>> END,
> and other macros across x86. When we have all this sorted out,
>> this will
> help to inject DWARF unwinding info by objtool later.
>
> So, let us use the macros this way:
> * ENTRY -- start of a global function
> * ENDPROC -- end of a local/global function
> * GLOBAL -- start of a globally visible data symbol
> * END -- end of local/global data symbol

 So how about using macro names that actually show the purpose,
>> instead of 
 importing all the crappy, historic, essentially randomly chosen
>> debug symbol macro 
 names from the binutils and older kernels?

 Something sane, like:

SYM__FUNCTION_START
>>>
>>> Sane would be:
>>>
>>> SYM_FUNCTION_START
>>>
>>> The double underscore is just not giving any value.
>>
>> So the double underscore (at least in my view) has two advantages:
>>
>> 1) it helps separate the prefix from the postfix.
>>
>> I.e. it's a 'symbols' namespace, and a 'function start', not the
>> 'start' of a 
>> 'symbol function'.
>>
>> 2) It also helps easy greppability.
>>
>> Try this in latest -tip:
>>
>>  git grep e820__
>>
>> To see all the E820 API calls - with no false positives!
>>
>> 'git grep e820_' on the other hand is a lot less reliable...
>>
>> But no strong feelings either way, I just try to sneak in these small
>> namespace 
>> structure tricks when nobody's looking! ;-)
>>
>> Thanks,
>>
>>  Ingo
> 
> This seems needlessly verbose to me and clutters the code.
> 
> How about:
> 
> PROC..ENDPROC, LOCALPROC..ENDPROC and DATA..ENDDATA.  Clear, unambiguous and 
> balanced.

I tried this, but:
arch/x86/kernel/relocate_kernel_64.S:27:0: warning: "DATA" redefined
 #define DATA(offset)  (KEXEC_CONTROL_CODE_MAX_SIZE+(offset))


I am not saying that I cannot fix it up. I just want to say, that these
names might be too generic, especially "PROC" and "DATA". So should I
really stick to these?

thanks,
-- 
js
suse labs


[PATCH 3/3] dt-bindings: mtd: Add Octal SPI support to Cadence QSPI.

2017-03-06 Thread Artur Jedrysek
This patch updates Cadence QSPI Device Tree documentation to include
information about new property used to indicate, whether or not
Octal SPI transfers are supported by the device.

Signed-off-by: Artur Jedrysek 
---
 Documentation/devicetree/bindings/mtd/cadence-quadspi.txt | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt 
b/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt
index f248056..8438184 100644
--- a/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt
+++ b/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt
@@ -14,6 +14,9 @@ Required properties:
 
 Optional properties:
 - cdns,is-decoded-cs : Flag to indicate whether decoder is used or not.
+- cdns,octal-controller : Flag to indicate, that used controller supports Octal
+  SPI transfer mode. May be intentionally omitted to
+  switch it back to Quad SPI mode.
 
 Optional subnodes:
 Subnodes of the Cadence Quad SPI controller are spi slave nodes with additional
@@ -44,6 +47,7 @@ Example:
cdns,fifo-depth = <128>;
cdns,fifo-width = <4>;
cdns,trigger-address = <0x>;
+   #cdns,octal-controller
 
flash0: n25q00@0 {
...
-- 
2.2.2



Re: [PATCH 1/3] futex: remove duplicated code

2017-03-06 Thread Geert Uytterhoeven
Hi Jiri,

On Mon, Mar 6, 2017 at 9:46 AM, Jiri Slaby  wrote:
> futex: make the encoded_op decoding readable
>
> Decoding of encoded_op is a bit unreadable. It contains shifts to the
> left and to the right by some constants. Make it clearly visible what
> part of the bit mask is taken and shift the values only to the right
> appropriately. And make sure sign extension takes place using
> sign_extend32.
>
> Signed-off-by: Jiri Slaby 
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index 0ead0756a593..f90314bd42cb 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1461,10 +1461,10 @@ futex_wake(u32 __user *uaddr, unsigned int
> flags, int nr_wake, u32 bitset)
>
>  static int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
>  {
> -   int op = (encoded_op >> 28) & 7;
> -   int cmp = (encoded_op >> 24) & 15;

At least for the two above (modulo 7 vs 15?), the old decoding code matched
the flow of operation in FUTEX_OP().

> -   int oparg = (encoded_op << 8) >> 20;
> -   int cmparg = (encoded_op << 20) >> 20;
> +   int op =  (encoded_op & 0x7000) >> 28;
> +   int cmp = (encoded_op & 0x0f00) >> 24;
> +   int oparg = sign_extend32((encoded_op & 0x00fff000) >> 12, 12);
> +   int cmparg = sign_extend32(encoded_op & 0x0fff, 12);
> int oldval, ret;

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH v4 3/7] perf/sdt: Allow recording of existing events

2017-03-06 Thread Ravi Bangoria
Add functionality to fetch matching events from uprobe_events. If no
events are fourd from it, fetch matching events from probe-cache and
add them in uprobe_events. If all events are already present in
uprobe_events, reuse them. If few of them are present, add entries
for missing events and record them. At the end of the record session,
delete newly added entries. Below is detailed algorithm that describe
implementation of this patch:

arr1 = fetch all sdt events from uprobe_events

if (event with exact name in arr1)
add that in sdt_event_list
return

if (user has used pattern)
if (pattern matching entries found from arr1)
add those events in sdt_event_list
return

arr2 = lookup probe-cache
if (arr2 empty)
return

ctr = 0
foreach (compare entries of arr1 and arr2 using filepath+address)
if (match)
add event from arr1 to sdt_event_list
ctr++
if (!pattern used)
remove entry from arr2

if (!pattern used || ctr == 0)
add all entries of arr2 in sdt_event_list


Example: Consider sdt event sdt_libpthread:mutex_release found in
/usr/lib64/libpthread-2.24.so.

  $ readelf -n /usr/lib64/libpthread-2.24.so | grep -A2 Provider
  Provider: libpthread
  Name: mutex_release
  Location: 0xb126, Base: 0x000139cc, Semaphore: 
0x
--
  Provider: libpthread
  Name: mutex_release
  Location: 0xb2f6, Base: 0x000139cc, Semaphore: 
0x
--
  Provider: libpthread
  Name: mutex_release
  Location: 0xb498, Base: 0x000139cc, Semaphore: 
0x
--
  Provider: libpthread
  Name: mutex_release
  Location: 0xb596, Base: 0x000139cc, Semaphore: 
0x

When no probepoint exists,

  $ sudo ./perf record -a -e sdt_libpthread:mutex_*
Warning: Recording on 15 occurrences of sdt_libpthread:mutex_*

  $ sudo ./perf record -a -e sdt_libpthread:mutex_release
Warning: Recording on 4 occurrences of sdt_libpthread:mutex_release
  $ sudo ./perf evlist
sdt_libpthread:mutex_release_3
sdt_libpthread:mutex_release_2
sdt_libpthread:mutex_release_1
sdt_libpthread:mutex_release

When probepoints already exists for all matching events,

  $ sudo ./perf probe sdt_libpthread:mutex_release
Added new events:
  sdt_libpthread:mutex_release (on %mutex_release in 
/usr/lib64/libpthread-2.24.so)
  sdt_libpthread:mutex_release_1 (on %mutex_release in 
/usr/lib64/libpthread-2.24.so)
  sdt_libpthread:mutex_release_2 (on %mutex_release in 
/usr/lib64/libpthread-2.24.so)
  sdt_libpthread:mutex_release_3 (on %mutex_release in 
/usr/lib64/libpthread-2.24.so)

  $ sudo ./perf record -a -e sdt_libpthread:mutex_release_1
  $ sudo ./perf evlist
sdt_libpthread:mutex_release_1

  $ sudo ./perf record -a -e sdt_libpthread:mutex_release
  $ sudo ./perf evlist
sdt_libpthread:mutex_release

  $ sudo ./perf record -a -e sdt_libpthread:mutex_*
Warning: Recording on 4 occurrences of sdt_libpthread:mutex_*
  $ sudo ./perf evlist
sdt_libpthread:mutex_release_3
sdt_libpthread:mutex_release_2
sdt_libpthread:mutex_release_1
sdt_libpthread:mutex_release

  $ sudo ./perf record -a -e sdt_libpthread:mutex_release_*
Warning: Recording on 3 occurrences of sdt_libpthread:mutex_release_*

When probepoints are partially exists,

  $ sudo ./perf probe -d sdt_libpthread:mutex_release
  $ sudo ./perf probe -d sdt_libpthread:mutex_release_2

  $ sudo ./perf record -a -e sdt_libpthread:mutex_release
Warning: Recording on 4 occurrences of sdt_libpthread:mutex_release
  $ sudo ./perf evlist
sdt_libpthread:mutex_release
sdt_libpthread:mutex_release_3
sdt_libpthread:mutex_release_2
sdt_libpthread:mutex_release_1

  $ sudo ./perf record -a -e sdt_libpthread:mutex_release*
Warning: Recording on 2 occurrences of sdt_libpthread:mutex_release*
  $ sudo ./perf evlist
sdt_libpthread:mutex_release_3
sdt_libpthread:mutex_release_1

  $ sudo ./perf record -a -e sdt_libpthread:*
Warning: Recording on 2 occurrences of sdt_libpthread:*
  $ sudo ./perf evlist
sdt_libpthread:mutex_release_3
sdt_libpthread:mutex_release_1

Signed-off-by: Ravi Bangoria 
---
 tools/perf/util/probe-event.c |  58 +-
 tools/perf/util/probe-event.h |   5 ++
 tools/perf/util/probe-file.c  | 173 +-
 tools/perf/util/probe-file.h  |   3 +-
 4 files changed, 215 insertions(+), 24 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index b879076..947b2ec 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -231,7 +231,7 @@ static void clear_perf_probe_point(struct perf_probe_point 
*pp)
free(pp->lazy_line);
 }
 
-static void clear_probe_trace_events(struct probe_trac

[PATCHv4 19/33] x86: convert the rest of the code to support p4d_t

2017-03-06 Thread Kirill A. Shutemov
This patch converts x86 to use proper folding of new page table level
with .

That's a bit of kitchen sink, but I don't see how to split it further.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/paravirt.h   |  33 +-
 arch/x86/include/asm/paravirt_types.h |  12 ++-
 arch/x86/include/asm/pgalloc.h|  35 ++-
 arch/x86/include/asm/pgtable.h|  59 ++-
 arch/x86/include/asm/pgtable_64.h |  12 ++-
 arch/x86/include/asm/pgtable_types.h  |  10 +-
 arch/x86/include/asm/xen/page.h   |   8 +-
 arch/x86/kernel/paravirt.c|  10 +-
 arch/x86/mm/init_64.c | 183 +++---
 arch/x86/xen/mmu.c| 152 
 include/trace/events/xen.h|  28 +++---
 11 files changed, 401 insertions(+), 141 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 0489884fdc44..158d877ce9e9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -536,7 +536,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
PVOP_VCALL2(pv_mmu_ops.set_pud, pudp,
val);
 }
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
 static inline pud_t __pud(pudval_t val)
 {
pudval_t ret;
@@ -565,6 +565,32 @@ static inline pudval_t pud_val(pud_t pud)
return ret;
 }
 
+static inline void pud_clear(pud_t *pudp)
+{
+   set_pud(pudp, __pud(0));
+}
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+   p4dval_t val = native_p4d_val(p4d);
+
+   if (sizeof(p4dval_t) > sizeof(long))
+   PVOP_VCALL3(pv_mmu_ops.set_p4d, p4dp,
+   val, (u64)val >> 32);
+   else
+   PVOP_VCALL2(pv_mmu_ops.set_p4d, p4dp,
+   val);
+}
+
+static inline void p4d_clear(p4d_t *p4dp)
+{
+   set_p4d(p4dp, __p4d(0));
+}
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+#error FIXME
+
 static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
pgdval_t val = native_pgd_val(pgd);
@@ -582,10 +608,7 @@ static inline void pgd_clear(pgd_t *pgdp)
set_pgd(pgdp, __pgd(0));
 }
 
-static inline void pud_clear(pud_t *pudp)
-{
-   set_pud(pudp, __pud(0));
-}
+#endif  /* CONFIG_PGTABLE_LEVELS == 5 */
 
 #endif /* CONFIG_PGTABLE_LEVELS == 4 */
 
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index b060f962d581..93c49cf09b63 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -279,12 +279,18 @@ struct pv_mmu_ops {
struct paravirt_callee_save pmd_val;
struct paravirt_callee_save make_pmd;
 
-#if CONFIG_PGTABLE_LEVELS == 4
+#if CONFIG_PGTABLE_LEVELS >= 4
struct paravirt_callee_save pud_val;
struct paravirt_callee_save make_pud;
 
-   void (*set_pgd)(pgd_t *pudp, pgd_t pgdval);
-#endif /* CONFIG_PGTABLE_LEVELS == 4 */
+   void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#error FIXME
+#endif /* CONFIG_PGTABLE_LEVELS >= 5 */
+
+#endif /* CONFIG_PGTABLE_LEVELS >= 4 */
+
 #endif /* CONFIG_PGTABLE_LEVELS >= 3 */
 
struct pv_lazy_ops lazy_mode;
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b6d425999f99..2f585054c63c 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -121,10 +121,10 @@ static inline void pud_populate(struct mm_struct *mm, 
pud_t *pud, pmd_t *pmd)
 #endif /* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
 {
paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
-   set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud)));
+   set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -150,6 +150,37 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, 
pud_t *pud,
___pud_free_tlb(tlb, pud);
 }
 
+#if CONFIG_PGTABLE_LEVELS > 4
+static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
+{
+   paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
+   set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+}
+
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+   gfp_t gfp = GFP_KERNEL_ACCOUNT;
+
+   if (mm == &init_mm)
+   gfp &= ~__GFP_ACCOUNT;
+   return (p4d_t *)get_zeroed_page(gfp);
+}
+
+static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)
+{
+   BUG_ON((unsigned long)p4d & (PAGE_SIZE-1));
+   free_page((unsigned long)p4d);
+}
+
+extern void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d);
+
+static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
+ unsigned long address)
+{
+   ___p4d_free_tlb(tlb,

[PATCHv4 02/33] asm-generic: introduce 5level-fixup.h

2017-03-06 Thread Kirill A. Shutemov
We are going to switch core MM to 5-level paging abstraction.

This is preparation step which adds 
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.

In long run we would like to switch architectures to properly folded p4d
level by using , but it requires more
changes to arch-specific code.

Signed-off-by: Kirill A. Shutemov 
---
 include/asm-generic/4level-fixup.h |  3 ++-
 include/asm-generic/5level-fixup.h | 41 ++
 include/linux/mm.h |  3 +++
 3 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100644 include/asm-generic/5level-fixup.h

diff --git a/include/asm-generic/4level-fixup.h 
b/include/asm-generic/4level-fixup.h
index 5bdab6bffd23..928fd66b1271 100644
--- a/include/asm-generic/4level-fixup.h
+++ b/include/asm-generic/4level-fixup.h
@@ -15,7 +15,6 @@
((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
NULL: pmd_offset(pud, address))
 
-#define pud_alloc(mm, pgd, address)(pgd)
 #define pud_offset(pgd, start) (pgd)
 #define pud_none(pud)  0
 #define pud_bad(pud)   0
@@ -35,4 +34,6 @@
 #undef  pud_addr_end
 #define pud_addr_end(addr, end)(end)
 
+#include 
+
 #endif
diff --git a/include/asm-generic/5level-fixup.h 
b/include/asm-generic/5level-fixup.h
new file mode 100644
index ..b5ca82dc4175
--- /dev/null
+++ b/include/asm-generic/5level-fixup.h
@@ -0,0 +1,41 @@
+#ifndef _5LEVEL_FIXUP_H
+#define _5LEVEL_FIXUP_H
+
+#define __ARCH_HAS_5LEVEL_HACK
+#define __PAGETABLE_P4D_FOLDED
+
+#define P4D_SHIFT  PGDIR_SHIFT
+#define P4D_SIZE   PGDIR_SIZE
+#define P4D_MASK   PGDIR_MASK
+#define PTRS_PER_P4D   1
+
+#define p4d_t  pgd_t
+
+#define pud_alloc(mm, p4d, address) \
+   ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
+   NULL : pud_offset(p4d, address))
+
+#define p4d_alloc(mm, pgd, address)(pgd)
+#define p4d_offset(pgd, start) (pgd)
+#define p4d_none(p4d)  0
+#define p4d_bad(p4d)   0
+#define p4d_present(p4d)   1
+#define p4d_ERROR(p4d) do { } while (0)
+#define p4d_clear(p4d) pgd_clear(p4d)
+#define p4d_val(p4d)   pgd_val(p4d)
+#define p4d_populate(mm, p4d, pud) pgd_populate(mm, p4d, pud)
+#define p4d_page(p4d)  pgd_page(p4d)
+#define p4d_page_vaddr(p4d)pgd_page_vaddr(p4d)
+
+#define __p4d(x)   __pgd(x)
+#define set_p4d(p4dp, p4d) set_pgd(p4dp, p4d)
+
+#undef p4d_free_tlb
+#define p4d_free_tlb(tlb, x, addr) do { } while (0)
+#define p4d_free(mm, x)do { } while (0)
+#define __p4d_free_tlb(tlb, x, addr)   do { } while (0)
+
+#undef  p4d_addr_end
+#define p4d_addr_end(addr, end)(end)
+
+#endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d65dd72c0f4..be1fe264eb37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1619,11 +1619,14 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long 
address);
  * Remove it when 4level-fixup.h has been removed.
  */
 #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
+
+#ifndef __ARCH_HAS_5LEVEL_HACK
 static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long 
address)
 {
return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
NULL: pud_offset(pgd, address);
 }
+#endif /* !__ARCH_HAS_5LEVEL_HACK */
 
 static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long 
address)
 {
-- 
2.11.0



[PATCHv4 28/33] x86/mm: add support of additional page table level during early boot

2017-03-06 Thread Kirill A. Shutemov
This patch adds support for 5-level paging during early boot.
It generalizes boot for 4- and 5-level paging on 64-bit systems with
compile-time switch between them.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/boot/compressed/head_64.S  | 23 +--
 arch/x86/include/asm/pgtable.h  |  2 +-
 arch/x86/include/asm/pgtable_64.h   |  6 ++-
 arch/x86/include/uapi/asm/processor-flags.h |  2 +
 arch/x86/kernel/espfix_64.c |  2 +-
 arch/x86/kernel/head64.c| 40 +-
 arch/x86/kernel/head_64.S   | 63 +
 arch/x86/kernel/machine_kexec_64.c  |  2 +-
 arch/x86/mm/dump_pagetables.c   |  2 +-
 arch/x86/mm/kasan_init_64.c | 12 +++---
 arch/x86/realmode/init.c|  2 +-
 arch/x86/xen/mmu.c  | 38 ++---
 12 files changed, 135 insertions(+), 59 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S 
b/arch/x86/boot/compressed/head_64.S
index d2ae1f821e0c..3ed26769810b 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -122,9 +122,12 @@ ENTRY(startup_32)
addl%ebp, gdt+2(%ebp)
lgdtgdt(%ebp)
 
-   /* Enable PAE mode */
+   /* Enable PAE and LA57 mode */
movl%cr4, %eax
orl $X86_CR4_PAE, %eax
+#ifdef CONFIG_X86_5LEVEL
+   orl $X86_CR4_LA57, %eax
+#endif
movl%eax, %cr4
 
  /*
@@ -136,13 +139,24 @@ ENTRY(startup_32)
movl$(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl
 
+   xorl%edx, %edx
+
+   /* Build Top Level */
+   lealpgtable(%ebx,%edx,1), %edi
+   leal0x1007 (%edi), %eax
+   movl%eax, 0(%edi)
+
+#ifdef CONFIG_X86_5LEVEL
/* Build Level 4 */
-   lealpgtable + 0(%ebx), %edi
+   addl$0x1000, %edx
+   lealpgtable(%ebx,%edx), %edi
leal0x1007 (%edi), %eax
movl%eax, 0(%edi)
+#endif
 
/* Build Level 3 */
-   lealpgtable + 0x1000(%ebx), %edi
+   addl$0x1000, %edx
+   lealpgtable(%ebx,%edx), %edi
leal0x1007(%edi), %eax
movl$4, %ecx
 1: movl%eax, 0x00(%edi)
@@ -152,7 +166,8 @@ ENTRY(startup_32)
jnz 1b
 
/* Build Level 2 */
-   lealpgtable + 0x2000(%ebx), %edi
+   addl$0x1000, %edx
+   lealpgtable(%ebx,%edx), %edi
movl$0x0183, %eax
movl$2048, %ecx
 1: movl%eax, 0(%edi)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 90f32116acd8..6cefd861ac65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -917,7 +917,7 @@ extern pgd_t trampoline_pgd_entry;
 static inline void __meminit init_trampoline_default(void)
 {
/* Default trampoline pgd value */
-   trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)];
+   trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
 }
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
diff --git a/arch/x86/include/asm/pgtable_64.h 
b/arch/x86/include/asm/pgtable_64.h
index 9991224f6238..c9e41f1599dd 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -14,15 +14,17 @@
 #include 
 #include 
 
+extern p4d_t level4_kernel_pgt[512];
+extern p4d_t level4_ident_pgt[512];
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
 extern pte_t level1_fixmap_pgt[512];
-extern pgd_t init_level4_pgt[];
+extern pgd_t init_top_pgt[];
 
-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir init_top_pgt
 
 extern void paging_init(void);
 
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index 567de50a4c2a..185f3d10c194 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -104,6 +104,8 @@
 #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT)
 #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */
 #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT)
+#define X86_CR4_LA57_BIT   12 /* enable 5-level page tables */
+#define X86_CR4_LA57   _BITUL(X86_CR4_LA57_BIT)
 #define X86_CR4_VMXE_BIT   13 /* enable VMX virtualization */
 #define X86_CR4_VMXE   _BITUL(X86_CR4_VMXE_BIT)
 #define X86_CR4_SMXE_BIT   14 /* enable safer mode (TXT) */
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8e598a1ad986..6b91e2eb8d3f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -125,7 +125,7 @@ void __init init_espfix_bsp(void)
p4d_t *p4d;
 
/* Install the espfix pud into the kernel page directory */
-   pgd = &init_level4_pgt[pgd_index(

Re: Build regressions/improvements in v4.11-rc1

2017-03-06 Thread Geert Uytterhoeven
On Mon, Mar 6, 2017 at 2:59 PM, Geert Uytterhoeven  wrote:
> Below is the list of build error/warning regressions/improvements in
> v4.11-rc1[1] compared to v4.10[2].
>
> Summarized:
>   - build errors: +19/-1

> [1] 
> http://kisskb.ellerman.id.au/kisskb/head/c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201/
>  (all 266 configs)

> 19 error regressions:
>   + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: 
> dereferencing pointer to incomplete type:  => 58
>   + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: implicit 
> declaration of function 'user_mode':  => 60
>   + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: implicit 
> declaration of function 'task_stack_page' 
> [-Werror=implicit-function-declaration]:  => 31:3
>   + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: invalid 
> application of 'sizeof' to incomplete type 'struct pt_regs' :  => 31:3
>   + /home/kisskb/slave/src/arch/mips/cavium-octeon/crypto/octeon-crypto.c: 
> error: implicit declaration of function 'task_stack_page' 
> [-Werror=implicit-function-declaration]:  => 35:6
>   + /home/kisskb/slave/src/arch/mips/cavium-octeon/smp.c: error: implicit 
> declaration of function 'task_stack_page' 
> [-Werror=implicit-function-declaration]:  => 214:2
>   + /home/kisskb/slave/src/arch/mips/include/asm/fpu.h: error: invalid 
> application of 'sizeof' to incomplete type 'struct pt_regs' :  => 140:3, 
> 188:2, 138:3, 136:2
>   + /home/kisskb/slave/src/arch/mips/include/asm/processor.h: error: invalid 
> application of 'sizeof' to incomplete type 'struct pt_regs':  => 385:31
>   + /home/kisskb/slave/src/arch/mips/kernel/smp-mt.c: error: implicit 
> declaration of function 'task_stack_page' 
> [-Werror=implicit-function-declaration]:  => 215:2
>   + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: 
> dereferencing pointer to incomplete type:  => 59:17, 66:13
>   + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: implicit 
> declaration of function 'force_sig' [-Werror=implicit-function-declaration]:  
> => 75:2
>   + /home/kisskb/slave/src/arch/mips/sgi-ip32/ip32-berr.c: error: implicit 
> declaration of function 'force_sig' [-Werror=implicit-function-declaration]:  
> => 31:2
>   + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown 
> opcode2 `l.lwa'.:  => 70, 107, 69
>   + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown 
> opcode2 `l.swa'.:  => 72, 71, 111
>   + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: 
> unknown opcode2 `l.lwa'.:  => 18, 35, 70, 90
>   + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: 
> unknown opcode2 `l.swa'.:  => 20, 37, 92, 72
>   + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: 
> unknown opcode2 `l.lwa'.:  => 68, 30
>   + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: 
> unknown opcode2 `l.swa'.:  => 34, 69
>   + /home/kisskb/slave/src/drivers/char/nwbutton.c: error: implicit 
> declaration of function 'kill_cad_pid' 
> [-Werror=implicit-function-declaration]:  => 134:3

CC mingo ;-)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCHv4 22/33] x86/mm: define virtual memory map for 5-level paging

2017-03-06 Thread Kirill A. Shutemov
The first part of memory map (up to %esp fixup) simply scales existing
map for 4-level paging by factor of 9 -- number of bits addressed by
additional page table level.

The rest of the map is uncahnged.

Signed-off-by: Kirill A. Shutemov 
---
 Documentation/x86/x86_64/mm.txt | 33 ++---
 arch/x86/Kconfig|  1 +
 arch/x86/include/asm/kasan.h|  9 ++---
 arch/x86/include/asm/page_64_types.h| 10 ++
 arch/x86/include/asm/pgtable_64_types.h |  6 ++
 arch/x86/include/asm/sparsemem.h|  9 +++--
 6 files changed, 60 insertions(+), 8 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 5724092db811..0303a47b82f8 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -4,7 +4,7 @@
 Virtual memory map with 4 level page tables:
 
  - 7fff (=47 bits) user space, different per mm
-hole caused by [48:63] sign extension
+hole caused by [47:63] sign extension
 8000 - 87ff (=43 bits) guard hole, reserved for 
hypervisor
 8800 - c7ff (=64 TB) direct mapping of all phys. memory
 c800 - c8ff (=40 bits) hole
@@ -23,12 +23,39 @@ a000 - ff5f (=1526 MB) module 
mapping space
 ff60 - ffdf (=8 MB) vsyscalls
 ffe0 -  (=2 MB) unused hole
 
+Virtual memory map with 5 level page tables:
+
+ - 00ff (=56 bits) user space, different per mm
+hole caused by [56:63] sign extension
+ff00 - ff0f (=52 bits) guard hole, reserved for 
hypervisor
+ff10 - ff8f (=55 bits) direct mapping of all phys. 
memory
+ff90 - ff91 (=49 bits) hole
+ff92 - ffd1 (=54 bits) vmalloc/ioremap space
+ffd2 - ffd3 (=49 bits) hole
+ffd4 - ffd5 (=49 bits) virtual memory map (512TB)
+... unused hole ...
+ffd8 - fff7 (=53 bits) kasan shadow memory (8PB)
+... unused hole ...
+fffe - fffe007f (=39 bits) %esp fixup stacks
+... unused hole ...
+ffef - fffe (=64 GB) EFI region mapping space
+... unused hole ...
+8000 - 9fff (=512 MB)  kernel text mapping, from phys 0
+a000 - ff5f (=1526 MB) module mapping space
+ff60 - ffdf (=8 MB) vsyscalls
+ffe0 -  (=2 MB) unused hole
+
+Architecture defines a 64-bit virtual address. Implementations can support
+less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
+through to the most-significant implemented bit are set to either all ones
+or all zero. This causes hole between user space and kernel addresses.
+
 The direct mapping covers all memory in the system up to the highest
 memory address (this means in some cases it can also include PCI memory
 holes).
 
-vmalloc space is lazily synchronized into the different PML4 pages of
-the processes using the page fault handler, with init_level4_pgt as
+vmalloc space is lazily synchronized into the different PML4/PML5 pages of
+the processes using the page fault handler, with init_top_pgt as
 reference.
 
 Current X86-64 implementations support up to 46 bits of address space (64 TB),
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..747f06f00a22 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -290,6 +290,7 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 config KASAN_SHADOW_OFFSET
hex
depends on KASAN
+   default 0xdff8 if X86_5LEVEL
default 0xdc00
 
 config HAVE_INTEL_TXT
diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h
index 1410b567ecde..f527b02a0ee3 100644
--- a/arch/x86/include/asm/kasan.h
+++ b/arch/x86/include/asm/kasan.h
@@ -11,9 +11,12 @@
  * 'kernel address space start' >> KASAN_SHADOW_SCALE_SHIFT
  */
 #define KASAN_SHADOW_START  (KASAN_SHADOW_OFFSET + \
-   (0x8000ULL >> 3))
-/* 47 bits for kernel address -> (47 - 3) bits for shadow */
-#define KASAN_SHADOW_END(KASAN_SHADOW_START + (1ULL << (47 - 3)))
+   ((-1UL << __VIRTUAL_MASK_SHIFT) >> 3))
+/*
+ * 47 bits for kernel address -> (47 - 3) bits for shadow
+ * 56 bits for kernel address -> (56 - 3) bits for shadow
+ */
+#define KASAN_SHADOW_END(KASAN_SHADOW_START + (1ULL << 
(__VIRTUAL_MASK_SHIFT - 3)))
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 9215e0527647..3f5f08b010d0 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -36,7 +36,12 @@
  * hypervisor to fit.  Choosing 16 slots here is arbitrary, but it's
  * what Xen re

[PATCHv4 23/33] x86/paravirt: make paravirt code support 5-level paging

2017-03-06 Thread Kirill A. Shutemov
Add operations to allocate/release p4ds.

TODO: cover XEN.

Not-yet-Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/paravirt.h   | 44 +++
 arch/x86/include/asm/paravirt_types.h |  7 +-
 arch/x86/include/asm/pgalloc.h|  2 ++
 arch/x86/kernel/paravirt.c|  9 +--
 4 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 158d877ce9e9..677edf3b6421 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -357,6 +357,16 @@ static inline void paravirt_release_pud(unsigned long pfn)
PVOP_VCALL1(pv_mmu_ops.release_pud, pfn);
 }
 
+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn)
+{
+   PVOP_VCALL2(pv_mmu_ops.alloc_p4d, mm, pfn);
+}
+
+static inline void paravirt_release_p4d(unsigned long pfn)
+{
+   PVOP_VCALL1(pv_mmu_ops.release_p4d, pfn);
+}
+
 static inline void pte_update(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep)
 {
@@ -582,14 +592,35 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
val);
 }
 
-static inline void p4d_clear(p4d_t *p4dp)
+#if CONFIG_PGTABLE_LEVELS >= 5
+
+static inline p4d_t __p4d(p4dval_t val)
 {
-   set_p4d(p4dp, __p4d(0));
+   p4dval_t ret;
+
+   if (sizeof(p4dval_t) > sizeof(long))
+   ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.make_p4d,
+  val, (u64)val >> 32);
+   else
+   ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.make_p4d,
+  val);
+
+   return (p4d_t) { ret };
 }
 
-#if CONFIG_PGTABLE_LEVELS >= 5
+static inline p4dval_t p4d_val(p4d_t p4d)
+{
+   p4dval_t ret;
+
+   if (sizeof(p4dval_t) > sizeof(long))
+   ret =  PVOP_CALLEE2(p4dval_t, pv_mmu_ops.p4d_val,
+   p4d.p4d, (u64)p4d.p4d >> 32);
+   else
+   ret =  PVOP_CALLEE1(p4dval_t, pv_mmu_ops.p4d_val,
+   p4d.p4d);
 
-#error FIXME
+   return ret;
+}
 
 static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
@@ -610,6 +641,11 @@ static inline void pgd_clear(pgd_t *pgdp)
 
 #endif  /* CONFIG_PGTABLE_LEVELS == 5 */
 
+static inline void p4d_clear(p4d_t *p4dp)
+{
+   set_p4d(p4dp, __p4d(0));
+}
+
 #endif /* CONFIG_PGTABLE_LEVELS == 4 */
 
 #endif /* CONFIG_PGTABLE_LEVELS >= 3 */
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 93c49cf09b63..7465d6fe336f 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -238,9 +238,11 @@ struct pv_mmu_ops {
void (*alloc_pte)(struct mm_struct *mm, unsigned long pfn);
void (*alloc_pmd)(struct mm_struct *mm, unsigned long pfn);
void (*alloc_pud)(struct mm_struct *mm, unsigned long pfn);
+   void (*alloc_p4d)(struct mm_struct *mm, unsigned long pfn);
void (*release_pte)(unsigned long pfn);
void (*release_pmd)(unsigned long pfn);
void (*release_pud)(unsigned long pfn);
+   void (*release_p4d)(unsigned long pfn);
 
/* Pagetable manipulation functions */
void (*set_pte)(pte_t *ptep, pte_t pteval);
@@ -286,7 +288,10 @@ struct pv_mmu_ops {
void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval);
 
 #if CONFIG_PGTABLE_LEVELS >= 5
-#error FIXME
+   struct paravirt_callee_save p4d_val;
+   struct paravirt_callee_save make_p4d;
+
+   void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval);
 #endif /* CONFIG_PGTABLE_LEVELS >= 5 */
 
 #endif /* CONFIG_PGTABLE_LEVELS >= 4 */
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 2f585054c63c..b2d0cd8288aa 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -17,9 +17,11 @@ static inline void paravirt_alloc_pmd(struct mm_struct *mm, 
unsigned long pfn)   {
 static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long 
clonepfn,
unsigned long start, unsigned long 
count) {}
 static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn) 
{}
+static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn) 
{}
 static inline void paravirt_release_pte(unsigned long pfn) {}
 static inline void paravirt_release_pmd(unsigned long pfn) {}
 static inline void paravirt_release_pud(unsigned long pfn) {}
+static inline void paravirt_release_p4d(unsigned long pfn) {}
 #endif
 
 /*
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 110daf22f5c7..3586996fc50d 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -405,9 +405,11 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
.alloc_pte = paravirt_nop,
.alloc_pmd = paravirt_nop,
.alloc_pud = paravirt_nop,
+   .alloc_p4d = paravirt_nop,
.release_

[PATCHv4 01/33] x86/cpufeature: Add 5-level paging detection

2017-03-06 Thread Kirill A. Shutemov
Look for 'la57' in /proc/cpuinfo to see if your machine supports 5-level
paging.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/cpufeatures.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 4e7772387c6e..b04bb6dfed7f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -289,7 +289,8 @@
 #define X86_FEATURE_PKU(16*32+ 3) /* Protection Keys for 
Userspace */
 #define X86_FEATURE_OSPKE  (16*32+ 4) /* OS Protection Keys Enable */
 #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW 
*/
-#define X86_FEATURE_RDPID  (16*32+ 22) /* RDPID instruction */
+#define X86_FEATURE_LA57   (16*32+16) /* 5-level page tables */
+#define X86_FEATURE_RDPID  (16*32+22) /* RDPID instruction */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (ebx), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+0) /* MCA overflow recovery support 
*/
-- 
2.11.0



[PATCHv4 30/33] x86/mm: make kernel_physical_mapping_init() support 5-level paging

2017-03-06 Thread Kirill A. Shutemov
Properly populate addition pagetable level if CONFIG_X86_5LEVEL is
enabled.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/init_64.c | 71 ---
 1 file changed, 62 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ba99090dc3c..ef117a69f74e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -622,6 +622,58 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
return paddr_last;
 }
 
+static unsigned long __meminit
+phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
+ unsigned long page_size_mask)
+{
+   unsigned long paddr_next, paddr_last = paddr_end;
+   unsigned long vaddr = (unsigned long)__va(paddr);
+   int i = p4d_index(vaddr);
+
+   if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+   return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, 
page_size_mask);
+
+   for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+   p4d_t *p4d;
+   pud_t *pud;
+
+   vaddr = (unsigned long)__va(paddr);
+   p4d = p4d_page + p4d_index(vaddr);
+   paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+   if (paddr >= paddr_end) {
+   if (!after_bootmem &&
+   !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+E820_RAM) &&
+   !e820_any_mapped(paddr & P4D_MASK, paddr_next,
+E820_RESERVED_KERN)) {
+   set_p4d(p4d, __p4d(0));
+   }
+   continue;
+   }
+
+   if (!p4d_none(*p4d)) {
+   pud = pud_offset(p4d, 0);
+   paddr_last = phys_pud_init(pud, paddr,
+   paddr_end,
+   page_size_mask);
+   __flush_tlb_all();
+   continue;
+   }
+
+   pud = alloc_low_page();
+   paddr_last = phys_pud_init(pud, paddr, paddr_end,
+  page_size_mask);
+
+   spin_lock(&init_mm.page_table_lock);
+   p4d_populate(&init_mm, p4d, pud);
+   spin_unlock(&init_mm.page_table_lock);
+   }
+   __flush_tlb_all();
+
+   return paddr_last;
+}
+
 /*
  * Create page table mapping for the physical memory for specific physical
  * addresses. The virtual and physical addresses have to be aligned on PMD 
level
@@ -643,26 +695,27 @@ kernel_physical_mapping_init(unsigned long paddr_start,
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
pgd_t *pgd = pgd_offset_k(vaddr);
p4d_t *p4d;
-   pud_t *pud;
 
vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE;
 
-   BUILD_BUG_ON(pgd_none(*pgd));
-   p4d = p4d_offset(pgd, vaddr);
-   if (p4d_val(*p4d)) {
-   pud = (pud_t *)p4d_page_vaddr(*p4d);
-   paddr_last = phys_pud_init(pud, __pa(vaddr),
+   if (pgd_val(*pgd)) {
+   p4d = (p4d_t *)pgd_page_vaddr(*pgd);
+   paddr_last = phys_p4d_init(p4d, __pa(vaddr),
   __pa(vaddr_end),
   page_size_mask);
continue;
}
 
-   pud = alloc_low_page();
-   paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
+   p4d = alloc_low_page();
+   paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
   page_size_mask);
 
spin_lock(&init_mm.page_table_lock);
-   p4d_populate(&init_mm, p4d, pud);
+   if (IS_ENABLED(CONFIG_X86_5LEVEL))
+   pgd_populate(&init_mm, pgd, p4d);
+   else
+   p4d_populate(&init_mm, p4d_offset(pgd, vaddr),
+   (pud_t *) p4d);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
-- 
2.11.0



[PATCHv4 05/33] asm-generic: introduce

2017-03-06 Thread Kirill A. Shutemov
Like with pgtable-nopud.h for 4-level paging, this new header is base
for converting an architectures to properly folded p4d_t level.

Signed-off-by: Kirill A. Shutemov 
---
 include/asm-generic/pgtable-nop4d.h | 56 +
 include/asm-generic/pgtable-nopud.h | 43 ++--
 include/asm-generic/tlb.h   | 14 --
 3 files changed, 89 insertions(+), 24 deletions(-)
 create mode 100644 include/asm-generic/pgtable-nop4d.h

diff --git a/include/asm-generic/pgtable-nop4d.h 
b/include/asm-generic/pgtable-nop4d.h
new file mode 100644
index ..de364ecb8df6
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -0,0 +1,56 @@
+#ifndef _PGTABLE_NOP4D_H
+#define _PGTABLE_NOP4D_H
+
+#ifndef __ASSEMBLY__
+
+#define __PAGETABLE_P4D_FOLDED
+
+typedef struct { pgd_t pgd; } p4d_t;
+
+#define P4D_SHIFT  PGDIR_SHIFT
+#define PTRS_PER_P4D   1
+#define P4D_SIZE   (1UL << P4D_SHIFT)
+#define P4D_MASK   (~(P4D_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the p4d is never bad, and a p4d always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd)  { return 0; }
+static inline int pgd_bad(pgd_t pgd)   { return 0; }
+static inline int pgd_present(pgd_t pgd)   { return 1; }
+static inline void pgd_clear(pgd_t *pgd)   { }
+#define p4d_ERROR(p4d) (pgd_ERROR((p4d).pgd))
+
+#define pgd_populate(mm, pgd, p4d) do { } while (0)
+/*
+ * (p4ds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval)set_p4d((p4d_t *)(pgdptr), (p4d_t) { 
pgdval })
+
+static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
+{
+   return (p4d_t *)pgd;
+}
+
+#define p4d_val(x) (pgd_val((x).pgd))
+#define __p4d(x)   ((p4d_t) { __pgd(x) })
+
+#define pgd_page(pgd)  (p4d_page((p4d_t){ pgd }))
+#define pgd_page_vaddr(pgd)(p4d_page_vaddr((p4d_t){ pgd }))
+
+/*
+ * allocating and freeing a p4d is trivial: the 1-entry p4d is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define p4d_alloc_one(mm, address) NULL
+#define p4d_free(mm, x)do { } while (0)
+#define __p4d_free_tlb(tlb, x, a)  do { } while (0)
+
+#undef  p4d_addr_end
+#define p4d_addr_end(addr, end)(end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_H */
diff --git a/include/asm-generic/pgtable-nopud.h 
b/include/asm-generic/pgtable-nopud.h
index 5e49430a30a4..c2b9b96d6268 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -6,53 +6,54 @@
 #ifdef __ARCH_USE_5LEVEL_HACK
 #include 
 #else
+#include 
 
 #define __PAGETABLE_PUD_FOLDED
 
 /*
- * Having the pud type consist of a pgd gets the size right, and allows
- * us to conceptually access the pgd entry that this pud is folded into
+ * Having the pud type consist of a p4d gets the size right, and allows
+ * us to conceptually access the p4d entry that this pud is folded into
  * without casting.
  */
-typedef struct { pgd_t pgd; } pud_t;
+typedef struct { p4d_t p4d; } pud_t;
 
-#define PUD_SHIFT  PGDIR_SHIFT
+#define PUD_SHIFT  P4D_SHIFT
 #define PTRS_PER_PUD   1
 #define PUD_SIZE   (1UL << PUD_SHIFT)
 #define PUD_MASK   (~(PUD_SIZE-1))
 
 /*
- * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * The "p4d_xxx()" functions here are trivial for a folded two-level
  * setup: the pud is never bad, and a pud always exists (as it's folded
- * into the pgd entry)
+ * into the p4d entry)
  */
-static inline int pgd_none(pgd_t pgd)  { return 0; }
-static inline int pgd_bad(pgd_t pgd)   { return 0; }
-static inline int pgd_present(pgd_t pgd)   { return 1; }
-static inline void pgd_clear(pgd_t *pgd)   { }
-#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
+static inline int p4d_none(p4d_t p4d)  { return 0; }
+static inline int p4d_bad(p4d_t p4d)   { return 0; }
+static inline int p4d_present(p4d_t p4d)   { return 1; }
+static inline void p4d_clear(p4d_t *p4d)   { }
+#define pud_ERROR(pud) (p4d_ERROR((pud).p4d))
 
-#define pgd_populate(mm, pgd, pud) do { } while (0)
+#define p4d_populate(mm, p4d, pud) do { } while (0)
 /*
- * (puds are folded into pgds so this doesn't get actually called,
+ * (puds are folded into p4ds so this doesn't get actually called,
  * but the define is needed for a generic inline function.)
  */
-#define set_pgd(pgdptr, pgdval)set_pud((pud_t 
*)(pgdptr), (pud_t) { pgdval })
+#define set_p4d(p4dptr, p4dval)set_pud((pud_t *)(p4dptr), (pud_t) { 
p4dval })
 
-static inline 

[PATCHv4 27/33] x86/espfix: support 5-level paging

2017-03-06 Thread Kirill A. Shutemov
We don't need extra virtual address space for ESPFIX, so it stays within
one PUD page table for both 4- and 5-level paging.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kernel/espfix_64.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 04f89caef9c4..8e598a1ad986 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -50,11 +50,11 @@
 #define ESPFIX_STACKS_PER_PAGE (PAGE_SIZE/ESPFIX_STACK_SIZE)
 
 /* There is address space for how many espfix pages? */
-#define ESPFIX_PAGE_SPACE  (1UL << (PGDIR_SHIFT-PAGE_SHIFT-16))
+#define ESPFIX_PAGE_SPACE  (1UL << (P4D_SHIFT-PAGE_SHIFT-16))
 
 #define ESPFIX_MAX_CPUS(ESPFIX_STACKS_PER_PAGE * 
ESPFIX_PAGE_SPACE)
 #if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS
-# error "Need more than one PGD for the ESPFIX hack"
+# error "Need more virtual address space for the ESPFIX hack"
 #endif
 
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
@@ -121,11 +121,13 @@ static void init_espfix_random(void)
 
 void __init init_espfix_bsp(void)
 {
-   pgd_t *pgd_p;
+   pgd_t *pgd;
+   p4d_t *p4d;
 
/* Install the espfix pud into the kernel page directory */
-   pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
-   pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
+   pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
+   p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
+   p4d_populate(&init_mm, p4d, espfix_pud_page);
 
/* Randomize the locations */
init_espfix_random();
-- 
2.11.0



[PATCHv4 21/33] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert

2017-03-06 Thread Kirill A. Shutemov
We don't need it anymore. 17be0aec74fb ("x86/asm/entry/64: Implement
better check for canonical addresses") made canonical address check
generic wrt. address width.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/entry/entry_64.S | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 044d18ebc43c..f07b4efb34d5 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -265,12 +265,9 @@ return_from_SYSCALL_64:
 *
 * If width of "canonical tail" ever becomes variable, this will need
 * to be updated to remain correct on both old and new CPUs.
+*
+* Change top 16 bits to be the sign-extension of 47th bit
 */
-   .ifne __VIRTUAL_MASK_SHIFT - 47
-   .error "virtual address width changed -- SYSRET checks need update"
-   .endif
-
-   /* Change top 16 bits to be the sign-extension of 47th bit */
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
 
-- 
2.11.0



[PATCHv4 14/33] x86/kexec: support p4d_t

2017-03-06 Thread Kirill A. Shutemov
Handle additional page table level in kexec code.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/kexec.h   |  1 +
 arch/x86/kernel/machine_kexec_32.c |  4 +++-
 arch/x86/kernel/machine_kexec_64.c | 14 --
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 282630e4c6ea..70ef205489f0 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -164,6 +164,7 @@ struct kimage_arch {
 };
 #else
 struct kimage_arch {
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
diff --git a/arch/x86/kernel/machine_kexec_32.c 
b/arch/x86/kernel/machine_kexec_32.c
index 469b23d6acc2..5f43cec296c5 100644
--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -103,6 +103,7 @@ static void machine_kexec_page_table_set_one(
pgd_t *pgd, pmd_t *pmd, pte_t *pte,
unsigned long vaddr, unsigned long paddr)
 {
+   p4d_t *p4d;
pud_t *pud;
 
pgd += pgd_index(vaddr);
@@ -110,7 +111,8 @@ static void machine_kexec_page_table_set_one(
if (!(pgd_val(*pgd) & _PAGE_PRESENT))
set_pgd(pgd, __pgd(__pa(pmd) | _PAGE_PRESENT));
 #endif
-   pud = pud_offset(pgd, vaddr);
+   p4d = p4d_offset(pgd, vaddr);
+   pud = pud_offset(p4d, vaddr);
pmd = pmd_offset(pud, vaddr);
if (!(pmd_val(*pmd) & _PAGE_PRESENT))
set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
diff --git a/arch/x86/kernel/machine_kexec_64.c 
b/arch/x86/kernel/machine_kexec_64.c
index 307b1f4543de..42eae96c8450 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -36,6 +36,7 @@ static struct kexec_file_ops *kexec_file_loaders[] = {
 
 static void free_transition_pgtable(struct kimage *image)
 {
+   free_page((unsigned long)image->arch.p4d);
free_page((unsigned long)image->arch.pud);
free_page((unsigned long)image->arch.pmd);
free_page((unsigned long)image->arch.pte);
@@ -43,6 +44,7 @@ static void free_transition_pgtable(struct kimage *image)
 
 static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 {
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -53,13 +55,21 @@ static int init_transition_pgtable(struct kimage *image, 
pgd_t *pgd)
paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
pgd += pgd_index(vaddr);
if (!pgd_present(*pgd)) {
+   p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
+   if (!p4d)
+   goto err;
+   image->arch.p4d = p4d;
+   set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+   }
+   p4d = p4d_offset(pgd, vaddr);
+   if (!p4d_present(*p4d)) {
pud = (pud_t *)get_zeroed_page(GFP_KERNEL);
if (!pud)
goto err;
image->arch.pud = pud;
-   set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+   set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
}
-   pud = pud_offset(pgd, vaddr);
+   pud = pud_offset(p4d, vaddr);
if (!pud_present(*pud)) {
pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL);
if (!pmd)
-- 
2.11.0



[PATCHv4 20/33] x86: detect 5-level paging support

2017-03-06 Thread Kirill A. Shutemov
5-level paging support is required from hardware when compiled with
CONFIG_X86_5LEVEL=y. We may implement runtime switch support later.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/boot/cpucheck.c |  9 +
 arch/x86/boot/cpuflags.c | 12 ++--
 arch/x86/include/asm/disabled-features.h |  8 +++-
 arch/x86/include/asm/required-features.h |  8 +++-
 4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/cpucheck.c b/arch/x86/boot/cpucheck.c
index 4ad7d70e8739..8f0c4c9fc904 100644
--- a/arch/x86/boot/cpucheck.c
+++ b/arch/x86/boot/cpucheck.c
@@ -44,6 +44,15 @@ static const u32 req_flags[NCAPINTS] =
0, /* REQUIRED_MASK5 not implemented in this file */
REQUIRED_MASK6,
0, /* REQUIRED_MASK7 not implemented in this file */
+   0, /* REQUIRED_MASK8 not implemented in this file */
+   0, /* REQUIRED_MASK9 not implemented in this file */
+   0, /* REQUIRED_MASK10 not implemented in this file */
+   0, /* REQUIRED_MASK11 not implemented in this file */
+   0, /* REQUIRED_MASK12 not implemented in this file */
+   0, /* REQUIRED_MASK13 not implemented in this file */
+   0, /* REQUIRED_MASK14 not implemented in this file */
+   0, /* REQUIRED_MASK15 not implemented in this file */
+   REQUIRED_MASK16,
 };
 
 #define A32(a, b, c, d) (((d) << 24)+((c) << 16)+((b) << 8)+(a))
diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c
index 6687ab953257..9e77c23c2422 100644
--- a/arch/x86/boot/cpuflags.c
+++ b/arch/x86/boot/cpuflags.c
@@ -70,16 +70,19 @@ int has_eflag(unsigned long mask)
 # define EBX_REG "=b"
 #endif
 
-static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d)
+static inline void cpuid_count(u32 id, u32 count,
+   u32 *a, u32 *b, u32 *c, u32 *d)
 {
asm volatile(".ifnc %%ebx,%3 ; movl  %%ebx,%3 ; .endif  \n\t"
 "cpuid \n\t"
 ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif  \n\t"
: "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b)
-   : "a" (id)
+   : "a" (id), "c" (count)
);
 }
 
+#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d)
+
 void get_cpuflags(void)
 {
u32 max_intel_level, max_amd_level;
@@ -108,6 +111,11 @@ void get_cpuflags(void)
cpu.model += ((tfms >> 16) & 0xf) << 4;
}
 
+   if (max_intel_level >= 0x0007) {
+   cpuid_count(0x0007, 0, &ignored, &ignored,
+   &cpu.flags[16], &ignored);
+   }
+
cpuid(0x8000, &max_amd_level, &ignored, &ignored,
  &ignored);
 
diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 85599ad4d024..fc0960236fc3 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -36,6 +36,12 @@
 # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_X86_5LEVEL
+#define DISABLE_LA57   0
+#else
+#define DISABLE_LA57   (1<<(X86_FEATURE_LA57 & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -55,7 +61,7 @@
 #define DISABLED_MASK130
 #define DISABLED_MASK140
 #define DISABLED_MASK150
-#define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE)
+#define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57)
 #define DISABLED_MASK170
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
 
diff --git a/arch/x86/include/asm/required-features.h 
b/arch/x86/include/asm/required-features.h
index fac9a5c0abe9..d91ba04dd007 100644
--- a/arch/x86/include/asm/required-features.h
+++ b/arch/x86/include/asm/required-features.h
@@ -53,6 +53,12 @@
 # define NEED_MOVBE0
 #endif
 
+#ifdef CONFIG_X86_5LEVEL
+# define NEED_LA57 (1<<(X86_FEATURE_LA57 & 31))
+#else
+# define NEED_LA57 0
+#endif
+
 #ifdef CONFIG_X86_64
 #ifdef CONFIG_PARAVIRT
 /* Paravirtualized systems may not have PSE or PGE available */
@@ -98,7 +104,7 @@
 #define REQUIRED_MASK130
 #define REQUIRED_MASK140
 #define REQUIRED_MASK150
-#define REQUIRED_MASK160
+#define REQUIRED_MASK16(NEED_LA57)
 #define REQUIRED_MASK170
 #define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
 
-- 
2.11.0



[PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d

2017-03-06 Thread Kirill A. Shutemov
Split these helpers few per-level functions and add p4d support.

Signed-off-by: Xiong Zhang 
[kirill.shute...@linux.intel.com: split off into separate patch]
Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/xen/mmu.c | 243 -
 arch/x86/xen/mmu.h |   1 +
 2 files changed, 148 insertions(+), 96 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 37cb5aad71de..75af8da7b54f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -593,6 +593,62 @@ static void xen_set_pgd(pgd_t *ptr, pgd_t val)
 }
 #endif /* CONFIG_PGTABLE_LEVELS == 4 */
 
+static int xen_pmd_walk(struct mm_struct *mm, pmd_t *pmd,
+   int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
+   bool last, unsigned long limit)
+{
+   int i, nr, flush = 0;
+
+   nr = last ? pmd_index(limit) + 1 : PTRS_PER_PMD;
+   for (i = 0; i < nr; i++) {
+   if (!pmd_none(pmd[i]))
+   flush |= (*func)(mm, pmd_page(pmd[i]), PT_PTE);
+   }
+   return flush;
+}
+
+static int xen_pud_walk(struct mm_struct *mm, pud_t *pud,
+   int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
+   bool last, unsigned long limit)
+{
+   int i, nr, flush = 0;
+
+   nr = last ? pud_index(limit) + 1 : PTRS_PER_PUD;
+   for (i = 0; i < nr; i++) {
+   pmd_t *pmd;
+
+   if (pud_none(pud[i]))
+   continue;
+
+   pmd = pmd_offset(&pud[i], 0);
+   if (PTRS_PER_PMD > 1)
+   flush |= (*func)(mm, virt_to_page(pmd), PT_PMD);
+   xen_pmd_walk(mm, pmd, func, last && i == nr - 1, limit);
+   }
+   return flush;
+}
+
+static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
+   int (*func)(struct mm_struct *mm, struct page *, enum pt_level),
+   bool last, unsigned long limit)
+{
+   int i, nr, flush = 0;
+
+   nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D;
+   for (i = 0; i < nr; i++) {
+   pud_t *pud;
+
+   if (p4d_none(p4d[i]))
+   continue;
+
+   pud = pud_offset(&p4d[i], 0);
+   if (PTRS_PER_PUD > 1)
+   flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
+   xen_pud_walk(mm, pud, func, last && i == nr - 1, limit);
+   }
+   return flush;
+}
+
 /*
  * (Yet another) pagetable walker.  This one is intended for pinning a
  * pagetable.  This means that it walks a pagetable and calls the
@@ -613,10 +669,8 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
  enum pt_level),
  unsigned long limit)
 {
-   int flush = 0;
+   int i, nr, flush = 0;
unsigned hole_low, hole_high;
-   unsigned pgdidx_limit, pudidx_limit, pmdidx_limit;
-   unsigned pgdidx, pudidx, pmdidx;
 
/* The limit is the last byte to be touched */
limit--;
@@ -633,65 +687,22 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t 
*pgd,
hole_low = pgd_index(USER_LIMIT);
hole_high = pgd_index(PAGE_OFFSET);
 
-   pgdidx_limit = pgd_index(limit);
-#if PTRS_PER_PUD > 1
-   pudidx_limit = pud_index(limit);
-#else
-   pudidx_limit = 0;
-#endif
-#if PTRS_PER_PMD > 1
-   pmdidx_limit = pmd_index(limit);
-#else
-   pmdidx_limit = 0;
-#endif
-
-   for (pgdidx = 0; pgdidx <= pgdidx_limit; pgdidx++) {
-   pud_t *pud;
+   nr = pgd_index(limit) + 1;
+   for (i = 0; i < nr; i++) {
+   p4d_t *p4d;
 
-   if (pgdidx >= hole_low && pgdidx < hole_high)
+   if (i >= hole_low && i < hole_high)
continue;
 
-   if (!pgd_val(pgd[pgdidx]))
+   if (pgd_none(pgd[i]))
continue;
 
-   pud = pud_offset(&pgd[pgdidx], 0);
-
-   if (PTRS_PER_PUD > 1) /* not folded */
-   flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
-
-   for (pudidx = 0; pudidx < PTRS_PER_PUD; pudidx++) {
-   pmd_t *pmd;
-
-   if (pgdidx == pgdidx_limit &&
-   pudidx > pudidx_limit)
-   goto out;
-
-   if (pud_none(pud[pudidx]))
-   continue;
-
-   pmd = pmd_offset(&pud[pudidx], 0);
-
-   if (PTRS_PER_PMD > 1) /* not folded */
-   flush |= (*func)(mm, virt_to_page(pmd), PT_PMD);
-
-   for (pmdidx = 0; pmdidx < PTRS_PER_PMD; pmdidx++) {
-   struct page *pte;
-
-   if (pgdidx == pgdidx_limit &&
-   pudidx == pudidx_limit &&
-   pmdidx > pmdidx_limit)
-

[PATCHv4 00/33] 5-level paging

2017-03-06 Thread Kirill A. Shutemov
Here is v4 of 5-level paging patchset. Please review and consider applying.

== Overview ==

x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB
of physical address space. We are already bumping into this limit: some
vendors offers servers with 64 TiB of memory today.

To overcome the limitation upcoming hardware will introduce support for
5-level paging[1]. It is a straight-forward extension of the current page
table structure adding one more layer of translation.

It bumps the limits to 128 PiB of virtual address space and 4 PiB of
physical address space. This "ought to be enough for anybody" ©.

==  Patches ==

The patchset is build on top of v4.11-rc1.

Current QEMU upstream git supports 5-level paging. Use "-cpu qemu64,+la57"
to enable it.

Patch 1:
Detect la57 feature for /proc/cpuinfo.

Patches 2-7:
Brings 5-level paging to generic code and convert all
architectures to it using 

Patches 8-19:
Convert x86 to properly folded p4d layer using
.

Patches 20-32:
Enabling of real 5-level paging.

CONFIG_X86_5LEVEL=y will enable new paging mode.

Patch 33:
Introduce new prctl(2) handles -- PR_SET_MAX_VADDR and PR_GET_MAX_VADDR.

This aims to address compatibility issue. Only supports x86 for
now, but should be useful for other archtectures.

Git:
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git la57/v4

== TODO ==

There is still work to do:

  - CONFIG_XEN is broken for 5-level paging.

Xen for 5-level paging requires more work to get functional.

Xen on 4-level paging works, so it's not a regression.

  - Boot-time switch between 4- and 5-level paging.

We assume that distributions will be keen to avoid returning to the
i386 days where we shipped one kernel binary for each page table
layout.

As page table format is the same for 4- and 5-level paging it should
be possible to have single kernel binary and switch between them at
boot-time without too much hassle.

For now I only implemented compile-time switch.

This will implemented with separate patchset.

== Changelong ==

  v4:
- Rebased to v4.11-rc1;
- Use mmap() hint address to allocate virtual addresss space above
  47-bits insteads of prctl() handles.
  v3:
- Rebased to v4.10-rc5;
- prctl() handles for large address space opt-in;
- Xen works for 4-level paging;
- EFI boot fixed for both 4- and 5-level paging;
- Hibernation fixed for 4-level paging;
- kexec() fixed;
- Couple of build fixes;
  v2:
- Rebased to v4.10-rc1;
- RLIMIT_VADDR proposal;
- Fix virtual map and update documentation;
- Fix few build errors;
- Rework cpuid helpers in boot code;
- Fix espfix code to work with 5-level pages;

[1] 
https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf
Kirill A. Shutemov (33):
  x86/cpufeature: Add 5-level paging detection
  asm-generic: introduce 5level-fixup.h
  asm-generic: introduce __ARCH_USE_5LEVEL_HACK
  arch, mm: convert all architectures to use 5level-fixup.h
  asm-generic: introduce 
  mm: convert generic code to 5-level paging
  mm: introduce __p4d_alloc()
  x86: basic changes into headers for 5-level paging
  x86: trivial portion of 5-level paging conversion
  x86/gup: add 5-level paging support
  x86/ident_map: add 5-level paging support
  x86/mm: add support of p4d_t in vmalloc_fault()
  x86/power: support p4d_t in hibernate code
  x86/kexec: support p4d_t
  x86/efi: handle p4d in EFI pagetables
  x86/mm/pat: handle additional page table
  x86/kasan: prepare clear_pgds() to switch to

  x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d
  x86: convert the rest of the code to support p4d_t
  x86: detect 5-level paging support
  x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert
  x86/mm: define virtual memory map for 5-level paging
  x86/paravirt: make paravirt code support 5-level paging
  x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL
  x86/dump_pagetables: support 5-level paging
  x86/kasan: extend to support 5-level paging
  x86/espfix: support 5-level paging
  x86/mm: add support of additional page table level during early boot
  x86/mm: add sync_global_pgds() for configuration with 5-level paging
  x86/mm: make kernel_physical_mapping_init() support 5-level paging
  x86/mm: add support for 5-level paging for KASLR
  x86: enable 5-level paging support
  x86/mm: allow to have userspace mappigs above 47-bits

 Documentation/x86/x86_64/mm.txt  |  33 +-
 arch/arc/include/asm/hugepage.h  |   1 +
 arch/arc/include/asm/pgtable.h   |   1 +
 arch/arm/include/asm/pgtable.h   |   1 +
 arch/arm64/include/asm/pgtable-types.h   |   4 +
 arch/avr32/include/asm/pgtable-2level.h  |   1 +
 arch/cris/include/asm/pgtable.h  |   1 +
 arch/frv/include/asm/pgtable.h   

[PATCHv4 09/33] x86: trivial portion of 5-level paging conversion

2017-03-06 Thread Kirill A. Shutemov
This patch covers simple cases only.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/kernel/tboot.c|  6 +-
 arch/x86/kernel/vm86_32.c  |  6 +-
 arch/x86/mm/fault.c| 39 +--
 arch/x86/mm/init_32.c  | 22 --
 arch/x86/mm/ioremap.c  |  3 ++-
 arch/x86/mm/pgtable.c  |  4 +++-
 arch/x86/mm/pgtable_32.c   |  8 +++-
 arch/x86/platform/efi/efi_64.c | 13 +
 arch/x86/power/hibernate_32.c  |  7 +--
 9 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index b868fa1b812b..5db0f33cbf2c 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -118,12 +118,16 @@ static int map_tboot_page(unsigned long vaddr, unsigned 
long pfn,
  pgprot_t prot)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
 
pgd = pgd_offset(&tboot_mm, vaddr);
-   pud = pud_alloc(&tboot_mm, pgd, vaddr);
+   p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
+   if (!p4d)
+   return -1;
+   pud = pud_alloc(&tboot_mm, p4d, vaddr);
if (!pud)
return -1;
pmd = pmd_alloc(&tboot_mm, pud, vaddr);
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 23ee89ce59a9..62597c300d94 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -164,6 +164,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
struct vm_area_struct *vma;
spinlock_t *ptl;
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -173,7 +174,10 @@ static void mark_screen_rdonly(struct mm_struct *mm)
pgd = pgd_offset(mm, 0xA);
if (pgd_none_or_clear_bad(pgd))
goto out;
-   pud = pud_offset(pgd, 0xA);
+   p4d = p4d_offset(pgd, 0xA);
+   if (p4d_none_or_clear_bad(p4d))
+   goto out;
+   pud = pud_offset(p4d, 0xA);
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 428e31763cb9..605fd5e8e048 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -253,6 +253,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned 
long address)
 {
unsigned index = pgd_index(address);
pgd_t *pgd_k;
+   p4d_t *p4d, *p4d_k;
pud_t *pud, *pud_k;
pmd_t *pmd, *pmd_k;
 
@@ -265,10 +266,15 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, 
unsigned long address)
/*
 * set_pgd(pgd, *pgd_k); here would be useless on PAE
 * and redundant with the set_pmd() on non-PAE. As would
-* set_pud.
+* set_p4d/set_pud.
 */
-   pud = pud_offset(pgd, address);
-   pud_k = pud_offset(pgd_k, address);
+   p4d = p4d_offset(pgd, address);
+   p4d_k = p4d_offset(pgd_k, address);
+   if (!p4d_present(*p4d_k))
+   return NULL;
+
+   pud = pud_offset(p4d, address);
+   pud_k = pud_offset(p4d_k, address);
if (!pud_present(*pud_k))
return NULL;
 
@@ -384,6 +390,8 @@ static void dump_pagetable(unsigned long address)
 {
pgd_t *base = __va(read_cr3());
pgd_t *pgd = &base[pgd_index(address)];
+   p4d_t *p4d;
+   pud_t *pud;
pmd_t *pmd;
pte_t *pte;
 
@@ -392,7 +400,9 @@ static void dump_pagetable(unsigned long address)
if (!low_pfn(pgd_val(*pgd) >> PAGE_SHIFT) || !pgd_present(*pgd))
goto out;
 #endif
-   pmd = pmd_offset(pud_offset(pgd, address), address);
+   p4d = p4d_offset(pgd, address);
+   pud = pud_offset(p4d, address);
+   pmd = pmd_offset(pud, address);
printk(KERN_CONT "*pde = %0*Lx ", sizeof(*pmd) * 2, (u64)pmd_val(*pmd));
 
/*
@@ -526,6 +536,7 @@ static void dump_pagetable(unsigned long address)
 {
pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK);
pgd_t *pgd = base + pgd_index(address);
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -538,7 +549,15 @@ static void dump_pagetable(unsigned long address)
if (!pgd_present(*pgd))
goto out;
 
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   if (bad_address(p4d))
+   goto bad;
+
+   printk("P4D %lx ", p4d_val(*p4d));
+   if (!p4d_present(*p4d) || p4d_large(*p4d))
+   goto out;
+
+   pud = pud_offset(p4d, address);
if (bad_address(pud))
goto bad;
 
@@ -1082,6 +1101,7 @@ static noinline int
 spurious_fault(unsigned long error_code, unsigned long address)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -1104,7 +1124,14 @@ spurious_fault(unsigned long error_code, unsigned long 
address)
  

[PATCHv4 08/33] x86: basic changes into headers for 5-level paging

2017-03-06 Thread Kirill A. Shutemov
This patch extends x86 headers to enable 5-level paging support.

It's still based on . We will get to the
point where we can have  later.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/pgtable-2level_types.h |  1 +
 arch/x86/include/asm/pgtable-3level_types.h |  1 +
 arch/x86/include/asm/pgtable.h  | 26 -
 arch/x86/include/asm/pgtable_64_types.h |  1 +
 arch/x86/include/asm/pgtable_types.h| 30 -
 5 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-2level_types.h 
b/arch/x86/include/asm/pgtable-2level_types.h
index 392576433e77..373ab1de909f 100644
--- a/arch/x86/include/asm/pgtable-2level_types.h
+++ b/arch/x86/include/asm/pgtable-2level_types.h
@@ -7,6 +7,7 @@
 typedef unsigned long  pteval_t;
 typedef unsigned long  pmdval_t;
 typedef unsigned long  pudval_t;
+typedef unsigned long  p4dval_t;
 typedef unsigned long  pgdval_t;
 typedef unsigned long  pgprotval_t;
 
diff --git a/arch/x86/include/asm/pgtable-3level_types.h 
b/arch/x86/include/asm/pgtable-3level_types.h
index bcc89625ebe5..b8a4341faafa 100644
--- a/arch/x86/include/asm/pgtable-3level_types.h
+++ b/arch/x86/include/asm/pgtable-3level_types.h
@@ -7,6 +7,7 @@
 typedef u64pteval_t;
 typedef u64pmdval_t;
 typedef u64pudval_t;
+typedef u64p4dval_t;
 typedef u64pgdval_t;
 typedef u64pgprotval_t;
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..6f6f351e0a81 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -179,6 +179,17 @@ static inline unsigned long pud_pfn(pud_t pud)
return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
 }
 
+static inline unsigned long p4d_pfn(p4d_t p4d)
+{
+   return (p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT;
+}
+
+static inline int p4d_large(p4d_t p4d)
+{
+   /* No 512 GiB pages yet */
+   return 0;
+}
+
 #define pte_page(pte)  pfn_to_page(pte_pfn(pte))
 
 static inline int pmd_large(pmd_t pte)
@@ -770,6 +781,16 @@ static inline int pud_large(pud_t pud)
 }
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 
+static inline unsigned long pud_index(unsigned long address)
+{
+   return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
+}
+
+static inline unsigned long p4d_index(unsigned long address)
+{
+   return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1);
+}
+
 #if CONFIG_PGTABLE_LEVELS > 3
 static inline int pgd_present(pgd_t pgd)
 {
@@ -788,11 +809,6 @@ static inline unsigned long pgd_page_vaddr(pgd_t pgd)
 #define pgd_page(pgd)  pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT)
 
 /* to find an entry in a page-table-directory. */
-static inline unsigned long pud_index(unsigned long address)
-{
-   return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
-}
-
 static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
 {
return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address);
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 3a264200c62f..0b2797e5083c 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -13,6 +13,7 @@
 typedef unsigned long  pteval_t;
 typedef unsigned long  pmdval_t;
 typedef unsigned long  pudval_t;
+typedef unsigned long  p4dval_t;
 typedef unsigned long  pgdval_t;
 typedef unsigned long  pgprotval_t;
 
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 62484333673d..df08535f774a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -272,9 +272,20 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
return native_pgd_val(pgd) & PTE_FLAGS_MASK;
 }
 
-#if CONFIG_PGTABLE_LEVELS > 3
+#if CONFIG_PGTABLE_LEVELS > 4
+
+#error FIXME
+
+#else
 #include 
 
+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+   return native_pgd_val(p4d);
+}
+#endif
+
+#if CONFIG_PGTABLE_LEVELS > 3
 typedef struct { pudval_t pud; } pud_t;
 
 static inline pud_t native_make_pud(pmdval_t val)
@@ -318,6 +329,22 @@ static inline pmdval_t native_pmd_val(pmd_t pmd)
 }
 #endif
 
+static inline p4dval_t p4d_pfn_mask(p4d_t p4d)
+{
+   /* No 512 GiB huge pages yet */
+   return PTE_PFN_MASK;
+}
+
+static inline p4dval_t p4d_flags_mask(p4d_t p4d)
+{
+   return ~p4d_pfn_mask(p4d);
+}
+
+static inline p4dval_t p4d_flags(p4d_t p4d)
+{
+   return native_p4d_val(p4d) & p4d_flags_mask(p4d);
+}
+
 static inline pudval_t pud_pfn_mask(pud_t pud)
 {
if (native_pud_val(pud) & _PAGE_PSE)
@@ -461,6 +488,7 @@ enum pg_level {
PG_LEVEL_4K,
PG_LEVEL_2M,
PG_LEVEL_1G,
+   PG_LEVEL_512G,
PG_LEVEL_NUM
 };
 
-- 
2.11.0



[PATCH] iommu/arm-smmu: Report smmu type in dmesg

2017-03-06 Thread Robert Richter
The ARM SMMU detection especially depends from system firmware. For
better diagnostic, log the detected type in dmesg.

The smmu type's name is now stored in struct arm_smmu_type and ACPI
code is modified to use that struct too. Rename ARM_SMMU_MATCH_DATA()
macro to ARM_SMMU_TYPE() for better readability.

Signed-off-by: Robert Richter 
---
 drivers/iommu/arm-smmu.c | 61 
 1 file changed, 30 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index abf6496843a6..5c793b3d3173 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -366,6 +366,7 @@ struct arm_smmu_device {
u32 options;
enum arm_smmu_arch_version  version;
enum arm_smmu_implementationmodel;
+   const char  *name;
 
u32 num_context_banks;
u32 num_s2_context_banks;
@@ -1955,19 +1956,20 @@ static int arm_smmu_device_cfg_probe(struct 
arm_smmu_device *smmu)
return 0;
 }
 
-struct arm_smmu_match_data {
+struct arm_smmu_type {
enum arm_smmu_arch_version version;
enum arm_smmu_implementation model;
+   const char *name;
 };
 
-#define ARM_SMMU_MATCH_DATA(name, ver, imp)\
-static struct arm_smmu_match_data name = { .version = ver, .model = imp }
+#define ARM_SMMU_TYPE(var, ver, imp, _name)\
+static struct arm_smmu_type var = { .version = ver, .model = imp, .name = 
_name }
 
-ARM_SMMU_MATCH_DATA(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU);
-ARM_SMMU_MATCH_DATA(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU);
-ARM_SMMU_MATCH_DATA(arm_mmu401, ARM_SMMU_V1_64K, GENERIC_SMMU);
-ARM_SMMU_MATCH_DATA(arm_mmu500, ARM_SMMU_V2, ARM_MMU500);
-ARM_SMMU_MATCH_DATA(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2);
+ARM_SMMU_TYPE(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU, "smmu-generic-v1");
+ARM_SMMU_TYPE(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU, "smmu-generic-v2");
+ARM_SMMU_TYPE(arm_mmu401, ARM_SMMU_V1_64K, GENERIC_SMMU, "arm-mmu401");
+ARM_SMMU_TYPE(arm_mmu500, ARM_SMMU_V2, ARM_MMU500, "arm-mmu500");
+ARM_SMMU_TYPE(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2, "cavium-smmuv2");
 
 static const struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v1", .data = &smmu_generic_v1 },
@@ -1981,29 +1983,19 @@ static const struct of_device_id arm_smmu_of_match[] = {
 MODULE_DEVICE_TABLE(of, arm_smmu_of_match);
 
 #ifdef CONFIG_ACPI
-static int acpi_smmu_get_data(u32 model, struct arm_smmu_device *smmu)
+static struct arm_smmu_type *acpi_smmu_get_type(u32 model)
 {
-   int ret = 0;
-
switch (model) {
case ACPI_IORT_SMMU_V1:
case ACPI_IORT_SMMU_CORELINK_MMU400:
-   smmu->version = ARM_SMMU_V1;
-   smmu->model = GENERIC_SMMU;
-   break;
+   return &smmu_generic_v1;
case ACPI_IORT_SMMU_V2:
-   smmu->version = ARM_SMMU_V2;
-   smmu->model = GENERIC_SMMU;
-   break;
+   return &smmu_generic_v2;
case ACPI_IORT_SMMU_CORELINK_MMU500:
-   smmu->version = ARM_SMMU_V2;
-   smmu->model = ARM_MMU500;
-   break;
-   default:
-   ret = -ENODEV;
+   return &arm_mmu500;
}
 
-   return ret;
+   return NULL;
 }
 
 static int arm_smmu_device_acpi_probe(struct platform_device *pdev,
@@ -2013,14 +2005,18 @@ static int arm_smmu_device_acpi_probe(struct 
platform_device *pdev,
struct acpi_iort_node *node =
*(struct acpi_iort_node **)dev_get_platdata(dev);
struct acpi_iort_smmu *iort_smmu;
-   int ret;
+   struct arm_smmu_type *type;
 
/* Retrieve SMMU1/2 specific data */
iort_smmu = (struct acpi_iort_smmu *)node->node_data;
 
-   ret = acpi_smmu_get_data(iort_smmu->model, smmu);
-   if (ret < 0)
-   return ret;
+   type = acpi_smmu_get_type(iort_smmu->model);
+   if (!type)
+   return -ENODEV;
+
+   smmu->version   = type->version;
+   smmu->model = type->model;
+   smmu->name  = type->name;
 
/* Ignore the configuration access interrupt */
smmu->num_global_irqs = 1;
@@ -2041,8 +2037,8 @@ static inline int arm_smmu_device_acpi_probe(struct 
platform_device *pdev,
 static int arm_smmu_device_dt_probe(struct platform_device *pdev,
struct arm_smmu_device *smmu)
 {
-   const struct arm_smmu_match_data *data;
struct device *dev = &pdev->dev;
+   const struct arm_smmu_type *type;
bool legacy_binding;
 
if (of_property_read_u32(dev->of_node, "#global-interrupts",
@@ -2051,9 +2047,10 @@ static int arm_smmu_device_dt_probe(struct 
platform_device *pdev,
return -ENODEV;
}
 
-   data = of_device_get_match_data(dev);
-   smmu->version = data->version;
- 

Re: [Patch v2 02/11] s5p-mfc: Adding initial support for MFC v10.10

2017-03-06 Thread Andrzej Hajda
On 03.03.2017 10:07, Smitha T Murthy wrote:
> Adding the support for MFC v10.10, with new register file and
> necessary hw control, decoder, encoder and structural changes.
>
> Signed-off-by: Smitha T Murthy 
Reviewed-by: Andrzej Hajda 

Few nitpicks below.

> CC: Rob Herring 
> CC: devicet...@vger.kernel.org
> ---
>  .../devicetree/bindings/media/s5p-mfc.txt  |1 +
>  drivers/media/platform/s5p-mfc/regs-mfc-v10.h  |   36 
>  drivers/media/platform/s5p-mfc/s5p_mfc.c   |   30 +
>  drivers/media/platform/s5p-mfc/s5p_mfc_common.h|4 +-
>  drivers/media/platform/s5p-mfc/s5p_mfc_ctrl.c  |4 ++
>  drivers/media/platform/s5p-mfc/s5p_mfc_dec.c   |   44 
> +++-
>  drivers/media/platform/s5p-mfc/s5p_mfc_enc.c   |   21 +
>  drivers/media/platform/s5p-mfc/s5p_mfc_opr_v6.c|9 +++-
>  drivers/media/platform/s5p-mfc/s5p_mfc_opr_v6.h|2 +
>  9 files changed, 118 insertions(+), 33 deletions(-)
>  create mode 100644 drivers/media/platform/s5p-mfc/regs-mfc-v10.h
>
> diff --git a/Documentation/devicetree/bindings/media/s5p-mfc.txt 
> b/Documentation/devicetree/bindings/media/s5p-mfc.txt
> index 2c90128..b83727b 100644
> --- a/Documentation/devicetree/bindings/media/s5p-mfc.txt
> +++ b/Documentation/devicetree/bindings/media/s5p-mfc.txt
> @@ -13,6 +13,7 @@ Required properties:
>   (c) "samsung,mfc-v7" for MFC v7 present in Exynos5420 SoC
>   (d) "samsung,mfc-v8" for MFC v8 present in Exynos5800 SoC
>   (e) "samsung,exynos5433-mfc" for MFC v8 present in Exynos5433 SoC
> + (f) "samsung,mfc-v10" for MFC v10 present in Exynos7880 SoC
>  
>- reg : Physical base address of the IP registers and length of memory
> mapped region.
> diff --git a/drivers/media/platform/s5p-mfc/regs-mfc-v10.h 
> b/drivers/media/platform/s5p-mfc/regs-mfc-v10.h
> new file mode 100644
> index 000..bd671a5
> --- /dev/null
> +++ b/drivers/media/platform/s5p-mfc/regs-mfc-v10.h
> @@ -0,0 +1,36 @@
> +/*
> + * Register definition file for Samsung MFC V10.x Interface (FIMV) driver
> + *
> + * Copyright (c) 2017 Samsung Electronics Co., Ltd.
> + * http://www.samsung.com/
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _REGS_MFC_V10_H
> +#define _REGS_MFC_V10_H
> +
> +#include 
> +#include "regs-mfc-v8.h"
> +
> +/* MFCv10 register definitions*/
> +#define S5P_FIMV_MFC_CLOCK_OFF_V10   0x7120
> +#define S5P_FIMV_MFC_STATE_V10   0x7124
> +
> +/* MFCv10 Context buffer sizes */
> +#define MFC_CTX_BUF_SIZE_V10 (30 * SZ_1K)/* 30KB */
> +#define MFC_H264_DEC_CTX_BUF_SIZE_V10(2 * SZ_1M) /* 2MB */
> +#define MFC_OTHER_DEC_CTX_BUF_SIZE_V10   (20 * SZ_1K)/* 20KB */
> +#define MFC_H264_ENC_CTX_BUF_SIZE_V10(100 * SZ_1K)   /* 100KB */
> +#define MFC_OTHER_ENC_CTX_BUF_SIZE_V10   (15 * SZ_1K)/* 15KB */
> +
> +/* MFCv10 variant defines */
> +#define MAX_FW_SIZE_V10  (SZ_1M) /* 1MB */
> +#define MAX_CPB_SIZE_V10 (3 * SZ_1M) /* 3MB */

These comments seems redundant, definition is clear enough, you could
remove them if there will be next iteration.

> +#define MFC_VERSION_V10  0xA0
> +#define MFC_NUM_PORTS_V101
> +
> +#endif /*_REGS_MFC_V10_H*/
> +
> diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c 
> b/drivers/media/platform/s5p-mfc/s5p_mfc.c
> index bb0a588..a043cce 100644
> --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c
> +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c
> @@ -1542,6 +1542,33 @@ static int s5p_mfc_resume(struct device *dev)
>   .num_clocks = 3,
>  };
>  
> +static struct s5p_mfc_buf_size_v6 mfc_buf_size_v10 = {
> + .dev_ctx= MFC_CTX_BUF_SIZE_V10,
> + .h264_dec_ctx   = MFC_H264_DEC_CTX_BUF_SIZE_V10,
> + .other_dec_ctx  = MFC_OTHER_DEC_CTX_BUF_SIZE_V10,
> + .h264_enc_ctx   = MFC_H264_ENC_CTX_BUF_SIZE_V10,
> + .other_enc_ctx  = MFC_OTHER_ENC_CTX_BUF_SIZE_V10,
> +};
> +
> +static struct s5p_mfc_buf_size buf_size_v10 = {
> + .fw = MAX_FW_SIZE_V10,
> + .cpb= MAX_CPB_SIZE_V10,
> + .priv   = &mfc_buf_size_v10,
> +};
> +
> +static struct s5p_mfc_buf_align mfc_buf_align_v10 = {
> + .base = 0,
> +};
> +
> +static struct s5p_mfc_variant mfc_drvdata_v10 = {
> + .version= MFC_VERSION_V10,
> + .version_bit= MFC_V10_BIT,
> + .port_num   = MFC_NUM_PORTS_V10,
> + .buf_size   = &buf_size_v10,
> + .buf_align  = &mfc_buf_align_v10,
> + .fw_name[0] = "s5p-mfc-v10.fw",
> +};
> +
>  static const struct of_device_id exynos_mfc_match[] = {
>   {
>   .compatible = "samsung,mfc-v5",
> @@ -1558,6 +1585,9 @@ static int s5p_mfc_resume(struct device *dev)
>   }, {
>   .compatible = "samsung,exynos5433-mfc

[PATCHv4 15/33] x86/efi: handle p4d in EFI pagetables

2017-03-06 Thread Kirill A. Shutemov
Allocate additional page table level and change efi_sync_low_kernel_mappings()
to make syncing logic work with additional page table level.

Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Matt Fleming 
---
 arch/x86/platform/efi/efi_64.c | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8544dae3d1b4..34d019f75239 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -135,6 +135,7 @@ static pgd_t *efi_pgd;
 int __init efi_alloc_page_tables(void)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
gfp_t gfp_mask;
 
@@ -147,15 +148,20 @@ int __init efi_alloc_page_tables(void)
return -ENOMEM;
 
pgd = efi_pgd + pgd_index(EFI_VA_END);
+   p4d = p4d_alloc(&init_mm, pgd, EFI_VA_END);
+   if (!p4d) {
+   free_page((unsigned long)efi_pgd);
+   return -ENOMEM;
+   }
 
-   pud = pud_alloc_one(NULL, 0);
+   pud = pud_alloc(&init_mm, p4d, EFI_VA_END);
if (!pud) {
+   if (CONFIG_PGTABLE_LEVELS > 4)
+   free_page((unsigned long) pgd_page_vaddr(*pgd));
free_page((unsigned long)efi_pgd);
return -ENOMEM;
}
 
-   pgd_populate(NULL, pgd, pud);
-
return 0;
 }
 
@@ -190,6 +196,18 @@ void efi_sync_low_kernel_mappings(void)
num_entries = pgd_index(EFI_VA_END) - pgd_index(PAGE_OFFSET);
memcpy(pgd_efi, pgd_k, sizeof(pgd_t) * num_entries);
 
+   /* The same story as with PGD entries */
+   BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
+   BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
+
+   pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
+   pgd_k = pgd_offset_k(EFI_VA_END);
+   p4d_efi = p4d_offset(pgd_efi, 0);
+   p4d_k = p4d_offset(pgd_k, 0);
+
+   num_entries = p4d_index(EFI_VA_END);
+   memcpy(p4d_efi, p4d_k, sizeof(p4d_t) * num_entries);
+
/*
 * We share all the PUD entries apart from those that map the
 * EFI regions. Copy around them.
@@ -197,20 +215,15 @@ void efi_sync_low_kernel_mappings(void)
BUILD_BUG_ON((EFI_VA_START & ~PUD_MASK) != 0);
BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0);
 
-   pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
-   p4d_efi = p4d_offset(pgd_efi, 0);
+   p4d_efi = p4d_offset(pgd_efi, EFI_VA_END);
+   p4d_k = p4d_offset(pgd_k, EFI_VA_END);
pud_efi = pud_offset(p4d_efi, 0);
-
-   pgd_k = pgd_offset_k(EFI_VA_END);
-   p4d_k = p4d_offset(pgd_k, 0);
pud_k = pud_offset(p4d_k, 0);
 
num_entries = pud_index(EFI_VA_END);
memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries);
 
-   p4d_efi = p4d_offset(pgd_efi, EFI_VA_START);
pud_efi = pud_offset(p4d_efi, EFI_VA_START);
-   p4d_k = p4d_offset(pgd_k, EFI_VA_START);
pud_k = pud_offset(p4d_k, EFI_VA_START);
 
num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START);
-- 
2.11.0



Re: [PATCH] mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()

2017-03-06 Thread Michal Hocko
On Fri 03-03-17 18:53:56, Tahsin Erdogan wrote:
> mem_cgroup_free() indirectly calls wb_domain_exit() which is not
> prepared to deal with a struct wb_domain object that hasn't executed
> wb_domain_init(). For instance, the following warning message is
> printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():
> 
>   INFO: trying to register non-static key.
>   the code is fine but needs lockdep annotation.
>   turning off the locking correctness validator.
>   CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>   Call Trace:
>dump_stack+0x67/0x99
>register_lock_class+0x36d/0x540
>__lock_acquire+0x7f/0x1a30
>? irq_work_queue+0x73/0x90
>? wake_up_klogd+0x36/0x40
>? console_unlock+0x45d/0x540
>? vprintk_emit+0x211/0x2e0
>lock_acquire+0xcc/0x200
>? try_to_del_timer_sync+0x60/0x60
>del_timer_sync+0x3c/0xc0
>? try_to_del_timer_sync+0x60/0x60
>wb_domain_exit+0x14/0x20
>mem_cgroup_free+0x14/0x40
>mem_cgroup_css_alloc+0x3f9/0x620
>cgroup_apply_control_enable+0x190/0x390
>cgroup_mkdir+0x290/0x3d0
>kernfs_iop_mkdir+0x58/0x80
>vfs_mkdir+0x10e/0x1a0
>SyS_mkdirat+0xa8/0xd0
>SyS_mkdir+0x14/0x20
>entry_SYSCALL_64_fastpath+0x18/0xad
> 
> Fix mem_cgroup_alloc() by doing more granular clean up in case of
> failures.
> 
> Fixes: 0b8f73e104285 ("mm: memcontrol: clean up alloc, online, offline, free 
> functions")
> Signed-off-by: Tahsin Erdogan 

Please do not duplicate mem_cgroup_free code and rather add
__mem_cgroup_free which does everything except for wb_domain_exit.
An alternative would be to teach memcg_wb_domain_exit to not call
wb_domain_exit if it hasn't been initialized yet. The first option seems
easier.

Thanks!

> ---
>  mm/memcontrol.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c52ec893e241..9a9d5630df91 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4194,9 +4194,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
>   idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
>   return memcg;
>  fail:
> + for_each_node(node)
> + free_mem_cgroup_per_node_info(memcg, node);
> + free_percpu(memcg->stat);
>   if (memcg->id.id > 0)
>   idr_remove(&mem_cgroup_idr, memcg->id.id);
> - mem_cgroup_free(memcg);
> + kfree(memcg);
>   return NULL;
>  }
>  
> -- 
> 2.12.0.rc1.440.g5b76565f74-goog
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


[PATCHv4 12/33] x86/mm: add support of p4d_t in vmalloc_fault()

2017-03-06 Thread Kirill A. Shutemov
With 4-level paging copying happens on p4d level, as we have pgd_none()
always false when p4d_t folded.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/fault.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 605fd5e8e048..fcc887f607c2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -435,6 +435,7 @@ void vmalloc_sync_all(void)
 static noinline int vmalloc_fault(unsigned long address)
 {
pgd_t *pgd, *pgd_ref;
+   p4d_t *p4d, *p4d_ref;
pud_t *pud, *pud_ref;
pmd_t *pmd, *pmd_ref;
pte_t *pte, *pte_ref;
@@ -462,13 +463,26 @@ static noinline int vmalloc_fault(unsigned long address)
BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
}
 
+   /* With 4-level paging copying happens on p4d level. */
+   p4d = p4d_offset(pgd, address);
+   p4d_ref = p4d_offset(pgd_ref, address);
+   if (p4d_none(*p4d_ref))
+   return -1;
+
+   if (p4d_none(*p4d)) {
+   set_p4d(p4d, *p4d_ref);
+   arch_flush_lazy_mmu_mode();
+   } else {
+   BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_ref));
+   }
+
/*
 * Below here mismatches are bugs because these lower tables
 * are shared:
 */
 
-   pud = pud_offset(pgd, address);
-   pud_ref = pud_offset(pgd_ref, address);
+   pud = pud_offset(p4d, address);
+   pud_ref = pud_offset(p4d_ref, address);
if (pud_none(*pud_ref))
return -1;
 
-- 
2.11.0



[v2 PATCH 1/3] mmc: sdhci-cadence: Fix writing PHY delay

2017-03-06 Thread Piotr Sroka
Add polling for ACK to be sure that data are written to PHY register.

Signed-off-by: Piotr Sroka 
---
Changes for v2:
- fix indent
---
 drivers/mmc/host/sdhci-cadence.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/mmc/host/sdhci-cadence.c b/drivers/mmc/host/sdhci-cadence.c
index 316cfec..b2334ec 100644
--- a/drivers/mmc/host/sdhci-cadence.c
+++ b/drivers/mmc/host/sdhci-cadence.c
@@ -66,11 +66,12 @@ struct sdhci_cdns_priv {
void __iomem *hrs_addr;
 };
 
-static void sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv,
-u8 addr, u8 data)
+static int sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv,
+   u8 addr, u8 data)
 {
void __iomem *reg = priv->hrs_addr + SDHCI_CDNS_HRS04;
u32 tmp;
+   int ret;
 
tmp = (data << SDHCI_CDNS_HRS04_WDATA_SHIFT) |
  (addr << SDHCI_CDNS_HRS04_ADDR_SHIFT);
@@ -79,8 +80,14 @@ static void sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv 
*priv,
tmp |= SDHCI_CDNS_HRS04_WR;
writel(tmp, reg);
 
+   ret = readl_poll_timeout(reg, tmp, tmp & SDHCI_CDNS_HRS04_ACK, 0, 10);
+   if (ret)
+   return ret;
+
tmp &= ~SDHCI_CDNS_HRS04_WR;
writel(tmp, reg);
+
+   return 0;
 }
 
 static void sdhci_cdns_phy_init(struct sdhci_cdns_priv *priv)
-- 
2.2.2



[PATCHv4 10/33] x86/gup: add 5-level paging support

2017-03-06 Thread Kirill A. Shutemov
It's simply extension for one more page table level.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/gup.c | 33 +++--
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 99c7805a9693..eb407cf0f6d3 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -76,9 +76,9 @@ static void undo_dev_pagemap(int *nr, int nr_start, struct 
page **pages)
 }
 
 /*
- * 'pteval' can come from a pte, pmd or pud.  We only check
+ * 'pteval' can come from a pte, pmd, pud or p4d.  We only check
  * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the
- * same value on all 3 types.
+ * same value on all 4 types.
  */
 static inline int pte_allows_gup(unsigned long pteval, int write)
 {
@@ -290,13 +290,13 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long 
addr,
return 1;
 }
 
-static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
int write, struct page **pages, int *nr)
 {
unsigned long next;
pud_t *pudp;
 
-   pudp = pud_offset(&pgd, addr);
+   pudp = pud_offset(&p4d, addr);
do {
pud_t pud = *pudp;
 
@@ -315,6 +315,27 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, 
unsigned long end,
return 1;
 }
 
+static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
+   int write, struct page **pages, int *nr)
+{
+   unsigned long next;
+   p4d_t *p4dp;
+
+   p4dp = p4d_offset(&pgd, addr);
+   do {
+   p4d_t p4d = *p4dp;
+
+   next = p4d_addr_end(addr, end);
+   if (p4d_none(p4d))
+   return 0;
+   BUILD_BUG_ON(p4d_large(p4d));
+   if (!gup_pud_range(p4d, addr, next, write, pages, nr))
+   return 0;
+   } while (p4dp++, addr = next, addr != end);
+
+   return 1;
+}
+
 /*
  * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
  * back to the regular GUP.
@@ -363,7 +384,7 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
break;
-   if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+   if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
break;
} while (pgdp++, addr = next, addr != end);
local_irq_restore(flags);
@@ -435,7 +456,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, 
int write,
next = pgd_addr_end(addr, end);
if (pgd_none(pgd))
goto slow;
-   if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+   if (!gup_p4d_range(pgd, addr, next, write, pages, &nr))
goto slow;
} while (pgdp++, addr = next, addr != end);
local_irq_enable();
-- 
2.11.0



[PATCH v2 6/6] powerpc/perf: Add Power8 mem_access event to sysfs

2017-03-06 Thread Madhavan Srinivasan
Patch add "mem_access" event to sysfs. This as-is not a raw event
supported by Power8 pmu. Instead, it is formed based on
raw event encoding specificed in isa207-common.h.

Primary PMU event used here is PM_MRK_INST_CMPL.
This event tracks only the completed marked instructions.

Random sampling mode (MMCRA[SM]) with Random Instruction
Sampling (RIS) is enabled to mark type of instructions.

With Random sampling in RLS mode with PM_MRK_INST_CMPL event,
the LDST /DATA_SRC fields in SIER identifies the memory
hierarchy level (eg: L1, L2 etc) statisfied a data-cache
miss for a marked instruction.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Sukadev Bhattiprolu 
Cc: Daniel Axtens 
Cc: Andrew Donnellan 
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/perf/power8-events-list.h | 6 ++
 arch/powerpc/perf/power8-pmu.c | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/perf/power8-events-list.h 
b/arch/powerpc/perf/power8-events-list.h
index 3a2e6e8ebb92..0f1d184627cc 100644
--- a/arch/powerpc/perf/power8-events-list.h
+++ b/arch/powerpc/perf/power8-events-list.h
@@ -89,3 +89,9 @@ EVENT(PM_MRK_FILT_MATCH,  0x2013c)
 EVENT(PM_MRK_FILT_MATCH_ALT,   0x3012e)
 /* Alternate event code for PM_LD_MISS_L1 */
 EVENT(PM_LD_MISS_L1_ALT,   0x400f0)
+/*
+ * Memory Access Event -- mem_access
+ * Primary PMU event used here is PM_MRK_INST_CMPL, along with
+ * Random Load/Store Facility Sampling (RIS) in Random sampling mode 
(MMCRA[SM]).
+ */
+EVENT(MEM_ACCESS,  0x10401e0)
diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c
index 932d7536f0eb..5463516e369b 100644
--- a/arch/powerpc/perf/power8-pmu.c
+++ b/arch/powerpc/perf/power8-pmu.c
@@ -90,6 +90,7 @@ GENERIC_EVENT_ATTR(branch-instructions,   
PM_BRU_FIN);
 GENERIC_EVENT_ATTR(branch-misses,  PM_BR_MPRED_CMPL);
 GENERIC_EVENT_ATTR(cache-references,   PM_LD_REF_L1);
 GENERIC_EVENT_ATTR(cache-misses,   PM_LD_MISS_L1);
+GENERIC_EVENT_ATTR(mem_access, MEM_ACCESS);
 
 CACHE_EVENT_ATTR(L1-dcache-load-misses,PM_LD_MISS_L1);
 CACHE_EVENT_ATTR(L1-dcache-loads,  PM_LD_REF_L1);
@@ -120,6 +121,7 @@ static struct attribute *power8_events_attr[] = {
GENERIC_EVENT_PTR(PM_BR_MPRED_CMPL),
GENERIC_EVENT_PTR(PM_LD_REF_L1),
GENERIC_EVENT_PTR(PM_LD_MISS_L1),
+   GENERIC_EVENT_PTR(MEM_ACCESS),
 
CACHE_EVENT_PTR(PM_LD_MISS_L1),
CACHE_EVENT_PTR(PM_LD_REF_L1),
-- 
2.7.4



[PATCHv4 16/33] x86/mm/pat: handle additional page table

2017-03-06 Thread Kirill A. Shutemov
Straight-forward extension of existing code to support additional page
table level.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/pageattr.c | 56 --
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 28d42130243c..eb0ad12cdfde 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -346,6 +346,7 @@ static inline pgprot_t static_protections(pgprot_t prot, 
unsigned long address,
 pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
 unsigned int *level)
 {
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
 
@@ -354,7 +355,15 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long 
address,
if (pgd_none(*pgd))
return NULL;
 
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   if (p4d_none(*p4d))
+   return NULL;
+
+   *level = PG_LEVEL_512G;
+   if (p4d_large(*p4d) || !p4d_present(*p4d))
+   return (pte_t *)p4d;
+
+   pud = pud_offset(p4d, address);
if (pud_none(*pud))
return NULL;
 
@@ -406,13 +415,18 @@ static pte_t *_lookup_address_cpa(struct cpa_data *cpa, 
unsigned long address,
 pmd_t *lookup_pmd_address(unsigned long address)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
 
pgd = pgd_offset_k(address);
if (pgd_none(*pgd))
return NULL;
 
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   if (p4d_none(*p4d) || p4d_large(*p4d) || !p4d_present(*p4d))
+   return NULL;
+
+   pud = pud_offset(p4d, address);
if (pud_none(*pud) || pud_large(*pud) || !pud_present(*pud))
return NULL;
 
@@ -477,11 +491,13 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long 
address, pte_t pte)
 
list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
 
pgd = (pgd_t *)page_address(page) + pgd_index(address);
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   pud = pud_offset(p4d, address);
pmd = pmd_offset(pud, address);
set_pte_atomic((pte_t *)pmd, pte);
}
@@ -836,9 +852,9 @@ static void unmap_pmd_range(pud_t *pud, unsigned long 
start, unsigned long end)
pud_clear(pud);
 }
 
-static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end)
+static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
 {
-   pud_t *pud = pud_offset(pgd, start);
+   pud_t *pud = pud_offset(p4d, start);
 
/*
 * Not on a GB page boundary?
@@ -1004,8 +1020,8 @@ static long populate_pmd(struct cpa_data *cpa,
return num_pages;
 }
 
-static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd,
-pgprot_t pgprot)
+static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d,
+   pgprot_t pgprot)
 {
pud_t *pud;
unsigned long end;
@@ -1026,7 +1042,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
cur_pages = (pre_end - start) >> PAGE_SHIFT;
cur_pages = min_t(int, (int)cpa->numpages, cur_pages);
 
-   pud = pud_offset(pgd, start);
+   pud = pud_offset(p4d, start);
 
/*
 * Need a PMD page?
@@ -1047,7 +1063,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
if (cpa->numpages == cur_pages)
return cur_pages;
 
-   pud = pud_offset(pgd, start);
+   pud = pud_offset(p4d, start);
pud_pgprot = pgprot_4k_2_large(pgprot);
 
/*
@@ -1067,7 +1083,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned 
long start, pgd_t *pgd,
if (start < end) {
long tmp;
 
-   pud = pud_offset(pgd, start);
+   pud = pud_offset(p4d, start);
if (pud_none(*pud))
if (alloc_pmd_page(pud))
return -1;
@@ -1090,33 +1106,43 @@ static int populate_pgd(struct cpa_data *cpa, unsigned 
long addr)
 {
pgprot_t pgprot = __pgprot(_KERNPG_TABLE);
pud_t *pud = NULL;  /* shut up gcc */
+   p4d_t *p4d;
pgd_t *pgd_entry;
long ret;
 
pgd_entry = cpa->pgd + pgd_index(addr);
 
+   if (pgd_none(*pgd_entry)) {
+   p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK);
+   if (!p4d)
+   return -1;
+
+   set_pgd(pgd_entry, __pgd(__pa(p4d) | _KERNPG_TABLE))

[RESEND PATCH v3 5/8] phy: phy-mt65xx-usb3: add support for new version phy

2017-03-06 Thread Chunfeng Yun
There are some variations from mt2701 to mt2712:
1. banks shared by multiple ports are put back into each port,
such as SPLLC and U2FREQ;
2. add a new bank MISC for u2port, and CHIP for u3port;
3. bank's offset in each port are also rearranged;

Signed-off-by: Chunfeng Yun 
---
 drivers/phy/phy-mt65xx-usb3.c |  344 ++---
 1 file changed, 217 insertions(+), 127 deletions(-)

diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c
index f4a3505..eb33499 100644
--- a/drivers/phy/phy-mt65xx-usb3.c
+++ b/drivers/phy/phy-mt65xx-usb3.c
@@ -23,46 +23,54 @@
 #include 
 #include 
 
-/*
- * for sifslv2 register, but exclude port's;
- * relative to USB3_SIF2_BASE base address
- */
-#define SSUSB_SIFSLV_SPLLC 0x
-#define SSUSB_SIFSLV_U2FREQ0x0100
-
-/* offsets of banks in each u2phy registers */
-#define SSUSB_SIFSLV_U2PHY_COM_BASE0x
-/* offsets of banks in each u3phy registers */
-#define SSUSB_SIFSLV_U3PHYD_BASE   0x
-#define SSUSB_SIFSLV_U3PHYA_BASE   0x0200
-
-#define U3P_USBPHYACR0 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x)
+/* version V1 sub-banks offset base address */
+/* banks shared by multiple phys */
+#define SSUSB_SIFSLV_V1_SPLLC  0x000   /* shared by u3 phys */
+#define SSUSB_SIFSLV_V1_U2FREQ 0x100   /* shared by u2 phys */
+/* u2 phy bank */
+#define SSUSB_SIFSLV_V1_U2PHY_COM  0x000
+/* u3 phy banks */
+#define SSUSB_SIFSLV_V1_U3PHYD 0x000
+#define SSUSB_SIFSLV_V1_U3PHYA 0x200
+
+/* version V2 sub-banks offset base address */
+/* u2 phy banks */
+#define SSUSB_SIFSLV_V2_MISC   0x000
+#define SSUSB_SIFSLV_V2_U2FREQ 0x100
+#define SSUSB_SIFSLV_V2_U2PHY_COM  0x300
+/* u3 phy banks */
+#define SSUSB_SIFSLV_V2_SPLLC  0x000
+#define SSUSB_SIFSLV_V2_CHIP   0x100
+#define SSUSB_SIFSLV_V2_U3PHYD 0x200
+#define SSUSB_SIFSLV_V2_U3PHYA 0x400
+
+#define U3P_USBPHYACR0 0x000
 #define PA0_RG_U2PLL_FORCE_ON  BIT(15)
 
-#define U3P_USBPHYACR2 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0008)
+#define U3P_USBPHYACR2 0x008
 #define PA2_RG_SIF_U2PLL_FORCE_EN  BIT(18)
 
-#define U3P_USBPHYACR5 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0014)
+#define U3P_USBPHYACR5 0x014
 #define PA5_RG_U2_HSTX_SRCAL_ENBIT(15)
 #define PA5_RG_U2_HSTX_SRCTRL  GENMASK(14, 12)
 #define PA5_RG_U2_HSTX_SRCTRL_VAL(x)   ((0x7 & (x)) << 12)
 #define PA5_RG_U2_HS_100U_U3_ENBIT(11)
 
-#define U3P_USBPHYACR6 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0018)
+#define U3P_USBPHYACR6 0x018
 #define PA6_RG_U2_BC11_SW_EN   BIT(23)
 #define PA6_RG_U2_OTG_VBUSCMP_EN   BIT(20)
 #define PA6_RG_U2_SQTH GENMASK(3, 0)
 #define PA6_RG_U2_SQTH_VAL(x)  (0xf & (x))
 
-#define U3P_U2PHYACR4  (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0020)
+#define U3P_U2PHYACR4  0x020
 #define P2C_RG_USB20_GPIO_CTL  BIT(9)
 #define P2C_USB20_GPIO_MODEBIT(8)
 #define P2C_U2_GPIO_CTR_MSK(P2C_RG_USB20_GPIO_CTL | P2C_USB20_GPIO_MODE)
 
-#define U3D_U2PHYDCR0  (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0060)
+#define U3D_U2PHYDCR0  0x060
 #define P2C_RG_SIF_U2PLL_FORCE_ON  BIT(24)
 
-#define U3P_U2PHYDTM0  (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0068)
+#define U3P_U2PHYDTM0  0x068
 #define P2C_FORCE_UART_EN  BIT(26)
 #define P2C_FORCE_DATAIN   BIT(23)
 #define P2C_FORCE_DM_PULLDOWN  BIT(21)
@@ -84,59 +92,56 @@
P2C_FORCE_TERMSEL | P2C_RG_DMPULLDOWN | \
P2C_RG_DPPULLDOWN | P2C_RG_TERMSEL)
 
-#define U3P_U2PHYDTM1  (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x006C)
+#define U3P_U2PHYDTM1  0x06C
 #define P2C_RG_UART_EN BIT(16)
 #define P2C_RG_VBUSVALID   BIT(5)
 #define P2C_RG_SESSEND BIT(4)
 #define P2C_RG_AVALID  BIT(2)
 
-#define U3P_U3_PHYA_REG0   (SSUSB_SIFSLV_U3PHYA_BASE + 0x)
-#define P3A_RG_U3_VUSB10_ONBIT(5)
-
-#define U3P_U3_PHYA_REG6   (SSUSB_SIFSLV_U3PHYA_BASE + 0x0018)
+#define U3P_U3_PHYA_REG6   0x018
 #define P3A_RG_TX_EIDLE_CM GENMASK(31, 28)
 #define P3A_RG_TX_EIDLE_CM_VAL(x)  ((0xf & (x)) << 28)
 
-#define U3P_U3_PHYA_REG9   (SSUSB_SIFSLV_U3PHYA_BASE + 0x0024)
+#define U3P_U3_PHYA_REG9   0x024
 #define P3A_RG_RX_DAC_MUX  GENMASK(5, 1)
 #define P3A_RG_RX_DAC_MUX_VAL(x)   ((0x1f & (x)) << 1)
 
-#define U3P_U3PHYA_DA_REG0 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0100)
+#define U3P_U3_PHYA_DA_REG00x100
 #define P3A_RG_XTAL_EXT_EN_U3  GENMASK(11, 10)
 #define P3A_RG_XTAL_EXT_EN_U3_VAL(x)   ((0x3 & (x)) << 10)
 
-#define U3P_U3_PHYD_LFPS1  (SSUSB_SIFSLV_U3PHYD_BASE + 0x000c)
+#define U3P_U3_PHYD_LFPS1  0x00c
 #define P3D_RG_FWAKE_THGENMASK(21, 16)
 #define P3D_RG_FWAKE_TH_VAL(x) ((0x3f & (x)) << 16)
 
-#define U3P_PHYD_CDR1

[RESEND PATCH v3 3/8] phy: phy-mt65xx-usb3: split SuperSpeed port into two ones

2017-03-06 Thread Chunfeng Yun
Currently usb3 port in fact includes two sub-ports, but it is not
flexible for some cases, such as following one:
usb3 port0 includes u2port0 and u3port0;
usb2 port0 includes u2port1;
If wants to support only HS, we can use u2port0 or u2port1, when
select u2port0, u3port0 is not needed;
If wants to support SS, we can compound u2port0 and u3port0,
or u2port1 and u3port0, if select latter one, u2port0 is not needed.

So it's more flexible to split usb3 port into two ones and also try
best to save power by disabling unnecessary ports.

Signed-off-by: Chunfeng Yun 
---
 drivers/phy/phy-mt65xx-usb3.c |  149 +
 1 file changed, 75 insertions(+), 74 deletions(-)

diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c
index 4fd47d0..7fff482 100644
--- a/drivers/phy/phy-mt65xx-usb3.c
+++ b/drivers/phy/phy-mt65xx-usb3.c
@@ -30,11 +30,11 @@
 #define SSUSB_SIFSLV_SPLLC 0x
 #define SSUSB_SIFSLV_U2FREQ0x0100
 
-/* offsets of sub-segment in each port registers */
+/* offsets of banks in each u2phy registers */
 #define SSUSB_SIFSLV_U2PHY_COM_BASE0x
-#define SSUSB_SIFSLV_U3PHYD_BASE   0x0100
-#define SSUSB_USB30_PHYA_SIV_B_BASE0x0300
-#define SSUSB_SIFSLV_U3PHYA_DA_BASE0x0400
+/* offsets of banks in each u3phy registers */
+#define SSUSB_SIFSLV_U3PHYD_BASE   0x
+#define SSUSB_SIFSLV_U3PHYA_BASE   0x0200
 
 #define U3P_USBPHYACR0 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x)
 #define PA0_RG_U2PLL_FORCE_ON  BIT(15)
@@ -49,7 +49,6 @@
 #define PA5_RG_U2_HS_100U_U3_ENBIT(11)
 
 #define U3P_USBPHYACR6 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0018)
-#define PA6_RG_U2_ISO_EN   BIT(31)
 #define PA6_RG_U2_BC11_SW_EN   BIT(23)
 #define PA6_RG_U2_OTG_VBUSCMP_EN   BIT(20)
 #define PA6_RG_U2_SQTH GENMASK(3, 0)
@@ -91,18 +90,18 @@
 #define P2C_RG_SESSEND BIT(4)
 #define P2C_RG_AVALID  BIT(2)
 
-#define U3P_U3_PHYA_REG0   (SSUSB_USB30_PHYA_SIV_B_BASE + 0x)
+#define U3P_U3_PHYA_REG0   (SSUSB_SIFSLV_U3PHYA_BASE + 0x)
 #define P3A_RG_U3_VUSB10_ONBIT(5)
 
-#define U3P_U3_PHYA_REG6   (SSUSB_USB30_PHYA_SIV_B_BASE + 0x0018)
+#define U3P_U3_PHYA_REG6   (SSUSB_SIFSLV_U3PHYA_BASE + 0x0018)
 #define P3A_RG_TX_EIDLE_CM GENMASK(31, 28)
 #define P3A_RG_TX_EIDLE_CM_VAL(x)  ((0xf & (x)) << 28)
 
-#define U3P_U3_PHYA_REG9   (SSUSB_USB30_PHYA_SIV_B_BASE + 0x0024)
+#define U3P_U3_PHYA_REG9   (SSUSB_SIFSLV_U3PHYA_BASE + 0x0024)
 #define P3A_RG_RX_DAC_MUX  GENMASK(5, 1)
 #define P3A_RG_RX_DAC_MUX_VAL(x)   ((0x1f & (x)) << 1)
 
-#define U3P_U3PHYA_DA_REG0 (SSUSB_SIFSLV_U3PHYA_DA_BASE + 0x)
+#define U3P_U3PHYA_DA_REG0 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0100)
 #define P3A_RG_XTAL_EXT_EN_U3  GENMASK(11, 10)
 #define P3A_RG_XTAL_EXT_EN_U3_VAL(x)   ((0x3 & (x)) << 10)
 
@@ -160,7 +159,7 @@ struct mt65xx_phy_instance {
 
 struct mt65xx_u3phy {
struct device *dev;
-   void __iomem *sif_base; /* include sif2, but exclude port's */
+   void __iomem *sif_base; /* only shared sif */
struct clk *u3phya_ref; /* reference clock of usb3 anolog phy */
const struct mt65xx_phy_pdata *pdata;
struct mt65xx_phy_instance **phys;
@@ -190,7 +189,7 @@ static void hs_slew_rate_calibrate(struct mt65xx_u3phy 
*u3phy,
tmp = readl(sif_base + U3P_U2FREQ_FMCR0);
tmp &= ~(P2F_RG_CYCLECNT | P2F_RG_MONCLK_SEL);
tmp |= P2F_RG_CYCLECNT_VAL(U3P_FM_DET_CYCLE_CNT);
-   tmp |= P2F_RG_MONCLK_SEL_VAL(instance->index);
+   tmp |= P2F_RG_MONCLK_SEL_VAL(instance->index >> 1);
writel(tmp, sif_base + U3P_U2FREQ_FMCR0);
 
/* enable frequency meter */
@@ -238,6 +237,56 @@ static void hs_slew_rate_calibrate(struct mt65xx_u3phy 
*u3phy,
writel(tmp, instance->port_base + U3P_USBPHYACR5);
 }
 
+static void u3_phy_instance_init(struct mt65xx_u3phy *u3phy,
+   struct mt65xx_phy_instance *instance)
+{
+   void __iomem *port_base = instance->port_base;
+   u32 tmp;
+
+   /* gating PCIe Analog XTAL clock */
+   tmp = readl(u3phy->sif_base + U3P_XTALCTL3);
+   tmp |= XC3_RG_U3_XTAL_RX_PWD | XC3_RG_U3_FRC_XTAL_RX_PWD;
+   writel(tmp, u3phy->sif_base + U3P_XTALCTL3);
+
+   /* gating XSQ */
+   tmp = readl(port_base + U3P_U3PHYA_DA_REG0);
+   tmp &= ~P3A_RG_XTAL_EXT_EN_U3;
+   tmp |= P3A_RG_XTAL_EXT_EN_U3_VAL(2);
+   writel(tmp, port_base + U3P_U3PHYA_DA_REG0);
+
+   tmp = readl(port_base + U3P_U3_PHYA_REG9);
+   tmp &= ~P3A_RG_RX_DAC_MUX;
+   tmp |= P3A_RG_RX_DAC_MUX_VAL(4);
+   writel(tmp, port_base + U3P_U3_PHYA_REG9);
+
+   tmp = readl(port_base + U3P_U3_PHYA_REG6);
+   tmp &= ~P3A_RG_TX_EIDLE_CM;
+   tmp |= P3A_RG_TX_EIDLE_CM_VAL(0xe);
+   writel(tmp, port_base + U3P_U3_PHYA_REG6);
+
+   tmp = readl(port_base + U3P_PHYD_CDR1);
+   tmp &= ~(P3

Re: [PATCH V11 10/10] arm/arm64: KVM: add guest SEA support

2017-03-06 Thread Baicar, Tyler

Hello James,


On 3/6/2017 3:28 AM, James Morse wrote:

On 28/02/17 19:43, Baicar, Tyler wrote:

On 2/24/2017 3:42 AM, James Morse wrote:

On 21/02/17 21:22, Tyler Baicar wrote:

Currently external aborts are unsupported by the guest abort
handling. Add handling for SEAs so that the host kernel reports
SEAs which occur in the guest kernel.
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index b2d57fc..403277b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -602,6 +602,24 @@ static const char *fault_name(unsigned int esr)
   }

   /*
+ * Handle Synchronous External Aborts that occur in a guest kernel.
+ */
+int handle_guest_sea(unsigned long addr, unsigned int esr)
+{
+if(IS_ENABLED(HAVE_ACPI_APEI_SEA)) {
+nmi_enter();
+ghes_notify_sea();
+nmi_exit();

This nmi stuff was needed for synchronous aborts that may have interrupted
APEI's interrupts-masked code. We want to avoid trying to take the same set of
locks, hence taking the in_nmi() path through APEI. Here we know we interrupted
a guest, so there is no risk that we have interrupted APEI on the host.
ghes_notify_sea() can safely take the normal path.

Makes sense, I can remove the nmi_* calls here.

Just occurs to me: if we do this we need to add the rcu_read_lock() in
ghes_notify_sea() as its not protected by the rcu/nmi weirdness.

True, would you suggest leaving these nmi_* calls or adding the rcu_* 
calls? And since that's only needed for this KVM case, shouldn't the 
rcu_* calls just replace the nmi_* calls here (outside of ghes_notify_sea)?


Thanks,
Tyler

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.



[RESEND PATCH v3 2/8] phy: phy-mt65xx-usb3: increase LFPS filter threshold

2017-03-06 Thread Chunfeng Yun
Increase LFPS filter threshold to avoid some fake remote wakeup
signal which cause U3 link fail and link to U2 only at about
0.01% probability.

Signed-off-by: Chunfeng Yun 
---
 drivers/phy/phy-mt65xx-usb3.c |9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c
index fe2392a..4fd47d0 100644
--- a/drivers/phy/phy-mt65xx-usb3.c
+++ b/drivers/phy/phy-mt65xx-usb3.c
@@ -106,6 +106,10 @@
 #define P3A_RG_XTAL_EXT_EN_U3  GENMASK(11, 10)
 #define P3A_RG_XTAL_EXT_EN_U3_VAL(x)   ((0x3 & (x)) << 10)
 
+#define U3P_U3_PHYD_LFPS1  (SSUSB_SIFSLV_U3PHYD_BASE + 0x000c)
+#define P3D_RG_FWAKE_THGENMASK(21, 16)
+#define P3D_RG_FWAKE_TH_VAL(x) ((0x3f & (x)) << 16)
+
 #define U3P_PHYD_CDR1  (SSUSB_SIFSLV_U3PHYD_BASE + 0x005c)
 #define P3D_RG_CDR_BIR_LTD1GENMASK(28, 24)
 #define P3D_RG_CDR_BIR_LTD1_VAL(x) ((0x1f & (x)) << 24)
@@ -303,6 +307,11 @@ static void phy_instance_init(struct mt65xx_u3phy *u3phy,
tmp |= P3D_RG_CDR_BIR_LTD0_VAL(0xc) | P3D_RG_CDR_BIR_LTD1_VAL(0x3);
writel(tmp, port_base + U3P_PHYD_CDR1);
 
+   tmp = readl(port_base + U3P_U3_PHYD_LFPS1);
+   tmp &= ~P3D_RG_FWAKE_TH;
+   tmp |= P3D_RG_FWAKE_TH_VAL(0x34);
+   writel(tmp, port_base + U3P_U3_PHYD_LFPS1);
+
tmp = readl(port_base + U3P_U3_PHYD_RXDET1);
tmp &= ~P3D_RG_RXDET_STB2_SET;
tmp |= P3D_RG_RXDET_STB2_SET_VAL(0x10);
-- 
1.7.9.5



[RESEND PATCH v3 6/8] arm64: dts: mt8173: split usb SuperSpeed port into two ports

2017-03-06 Thread Chunfeng Yun
split the old SuperSpeed port node into a HighSpeed one and a new
SuperSpeed one.

Signed-off-by: Chunfeng Yun 
---
 arch/arm64/boot/dts/mediatek/mt8173.dtsi |   19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
index 6922252..1dc4629 100644
--- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
@@ -731,8 +731,9 @@
  <0 0x11280700 0 0x0100>;
reg-names = "mac", "ippc";
interrupts = ;
-   phys = <&phy_port0 PHY_TYPE_USB3>,
-  <&phy_port1 PHY_TYPE_USB2>;
+   phys = <&u2port0 PHY_TYPE_USB2>,
+  <&u3port0 PHY_TYPE_USB3>,
+  <&u2port1 PHY_TYPE_USB2>;
power-domains = <&scpsys MT8173_POWER_DOMAIN_USB>;
clocks = <&topckgen CLK_TOP_USB30_SEL>,
 <&clk26m>,
@@ -770,14 +771,20 @@
ranges;
status = "okay";
 
-   phy_port0: port@11290800 {
-   reg = <0 0x11290800 0 0x800>;
+   u2port0: usb-phy@11290800 {
+   reg = <0 0x11290800 0 0x100>;
#phy-cells = <1>;
status = "okay";
};
 
-   phy_port1: port@11291000 {
-   reg = <0 0x11291000 0 0x800>;
+   u3port0: usb-phy@11290900 {
+   reg = <0 0x11290900 0 0x700>;
+   #phy-cells = <1>;
+   status = "okay";
+   };
+
+   u2port1: usb-phy@11291000 {
+   reg = <0 0x11291000 0 0x100>;
#phy-cells = <1>;
status = "okay";
};
-- 
1.7.9.5



[RESEND PATCH v3 1/8] phy: phy-mt65xx-usb3: improve RX detection stable time

2017-03-06 Thread Chunfeng Yun
The default value of RX detection stable time is 10us, and this
margin is too big for some critical cases which cause U3 link fail
and link to U2(probability is about 1%). So change it to 5us.

Signed-off-by: Chunfeng Yun 
---
 drivers/phy/phy-mt65xx-usb3.c |   18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c
index d972067..fe2392a 100644
--- a/drivers/phy/phy-mt65xx-usb3.c
+++ b/drivers/phy/phy-mt65xx-usb3.c
@@ -112,6 +112,14 @@
 #define P3D_RG_CDR_BIR_LTD0GENMASK(12, 8)
 #define P3D_RG_CDR_BIR_LTD0_VAL(x) ((0x1f & (x)) << 8)
 
+#define U3P_U3_PHYD_RXDET1 (SSUSB_SIFSLV_U3PHYD_BASE + 0x128)
+#define P3D_RG_RXDET_STB2_SET  GENMASK(17, 9)
+#define P3D_RG_RXDET_STB2_SET_VAL(x)   ((0x1ff & (x)) << 9)
+
+#define U3P_U3_PHYD_RXDET2 (SSUSB_SIFSLV_U3PHYD_BASE + 0x12c)
+#define P3D_RG_RXDET_STB2_SET_P3   GENMASK(8, 0)
+#define P3D_RG_RXDET_STB2_SET_P3_VAL(x)(0x1ff & (x))
+
 #define U3P_XTALCTL3   (SSUSB_SIFSLV_SPLLC + 0x0018)
 #define XC3_RG_U3_XTAL_RX_PWD  BIT(9)
 #define XC3_RG_U3_FRC_XTAL_RX_PWD  BIT(8)
@@ -295,6 +303,16 @@ static void phy_instance_init(struct mt65xx_u3phy *u3phy,
tmp |= P3D_RG_CDR_BIR_LTD0_VAL(0xc) | P3D_RG_CDR_BIR_LTD1_VAL(0x3);
writel(tmp, port_base + U3P_PHYD_CDR1);
 
+   tmp = readl(port_base + U3P_U3_PHYD_RXDET1);
+   tmp &= ~P3D_RG_RXDET_STB2_SET;
+   tmp |= P3D_RG_RXDET_STB2_SET_VAL(0x10);
+   writel(tmp, port_base + U3P_U3_PHYD_RXDET1);
+
+   tmp = readl(port_base + U3P_U3_PHYD_RXDET2);
+   tmp &= ~P3D_RG_RXDET_STB2_SET_P3;
+   tmp |= P3D_RG_RXDET_STB2_SET_P3_VAL(0x10);
+   writel(tmp, port_base + U3P_U3_PHYD_RXDET2);
+
dev_dbg(u3phy->dev, "%s(%d)\n", __func__, index);
 }
 
-- 
1.7.9.5



[RESEND PATCH v3 7/8] arm64: dts: mt8173: move clock from phy node into port nodes

2017-03-06 Thread Chunfeng Yun
there is a reference clock for each port, HighSpeed port is 48M,
and SuperSpeed port is 26M which usually comes from 26M oscillator
directly, but some SoCs is not. it is flexible to move it into port
node.

Signed-off-by: Chunfeng Yun 
---
 arch/arm64/boot/dts/mediatek/mt8173.dtsi |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
index 1dc4629..1c9e0d5 100644
--- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi
@@ -764,8 +764,6 @@
u3phy: usb-phy@1129 {
compatible = "mediatek,mt8173-u3phy";
reg = <0 0x1129 0 0x800>;
-   clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
@@ -773,18 +771,24 @@
 
u2port0: usb-phy@11290800 {
reg = <0 0x11290800 0 0x100>;
+   clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u3port0: usb-phy@11290900 {
reg = <0 0x11290900 0 0x700>;
+   clocks = <&clk26m>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
u2port1: usb-phy@11291000 {
reg = <0 0x11291000 0 0x100>;
+   clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
-- 
1.7.9.5



[PATCHv4 33/33] x86/mm: allow to have userspace mappigs above 47-bits

2017-03-06 Thread Kirill A. Shutemov
On x86, 5-level paging enables 56-bit userspace virtual address space.
Not all user space is ready to handle wide addresses. It's known that
at least some JIT compilers use higher bits in pointers to encode their
information. It collides with valid pointers with 5-level paging and
leads to crashes.

To mitigate this, we are not going to allocate virtual address space
above 47-bit by default.

But userspace can ask for allocation from full address space by
specifying hint address (with or without MAP_FIXED) above 47-bits.

If hint address set above 47-bit, but MAP_FIXED is not specified, we try
to look for unmapped area by specified address. If it's already
occupied, we look for unmapped area in *full* address space, rather than
from 47-bit window.

This approach helps to easily make application's memory allocator aware
about large address space without manually tracking allocated virtual
address space.

One important case we need to handle here is interaction with MPX.
MPX (without MAWA( extension cannot handle addresses above 47-bit, so we
need to make sure that MPX cannot be enabled we already have VMA above
the boundary and forbid creating such VMAs once MPX is enabled.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/elf.h   |  2 +-
 arch/x86/include/asm/mpx.h   |  9 +
 arch/x86/include/asm/processor.h |  9 ++---
 arch/x86/kernel/sys_x86_64.c | 28 +++-
 arch/x86/mm/hugetlbpage.c| 31 +++
 arch/x86/mm/mmap.c   |  4 ++--
 arch/x86/mm/mpx.c| 33 -
 7 files changed, 104 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 9d49c18b5ea9..265625b0d6cb 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -250,7 +250,7 @@ extern int force_personality32;
the loader.  We need to make sure that it is out of the way of the program
that it will "exec", and that there is sufficient room for the brk.  */
 
-#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2)
+#define ELF_ET_DYN_BASE(DEFAULT_MAP_WINDOW / 3 * 2)
 
 /* This yields a mask that user programs can use to figure out what
instruction set this CPU supports.  This could be done in user space,
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index a0d662be4c5b..7d7404756bb4 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm)
 }
 void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
  unsigned long start, unsigned long end);
+
+unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
+   unsigned long flags);
 #else
 static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs)
 {
@@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm,
unsigned long start, unsigned long end)
 {
 }
+
+static inline unsigned long mpx_unmapped_area_check(unsigned long addr,
+   unsigned long len, unsigned long flags)
+{
+   return addr;
+}
 #endif /* CONFIG_X86_INTEL_MPX */
 
 #endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f385eca5407a..da8ab4f2d0c7 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -799,6 +799,7 @@ static inline void spin_lock_prefetch(const void *x)
  */
 #define TASK_SIZE  PAGE_OFFSET
 #define TASK_SIZE_MAX  TASK_SIZE
+#define DEFAULT_MAP_WINDOW TASK_SIZE
 #define STACK_TOP  TASK_SIZE
 #define STACK_TOP_MAX  STACK_TOP
 
@@ -838,7 +839,9 @@ static inline void spin_lock_prefetch(const void *x)
  * particular problem by preventing anything from being mapped
  * at the maximum canonical address.
  */
-#define TASK_SIZE_MAX  ((1UL << 47) - PAGE_SIZE)
+#define TASK_SIZE_MAX  ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
@@ -851,7 +854,7 @@ static inline void spin_lock_prefetch(const void *x)
 #define TASK_SIZE_OF(child)((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
IA32_PAGE_OFFSET : TASK_SIZE_MAX)
 
-#define STACK_TOP  TASK_SIZE
+#define STACK_TOP  DEFAULT_MAP_WINDOW
 #define STACK_TOP_MAX  TASK_SIZE_MAX
 
 #define INIT_THREAD  { \
@@ -873,7 +876,7 @@ extern void start_thread(struct pt_regs *regs, unsigned 
long new_ip,
  * This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
  */
-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 

Re: [PATCH net] team: use ETH_MAX_MTU as max mtu

2017-03-06 Thread Jiri Pirko
Mon, Mar 06, 2017 at 02:48:58PM CET, ja...@redhat.com wrote:
>This restores the ability to set a team device's mtu to anything higher
>than 1500. Similar to the reported issue with bonding, the team driver
>calls ether_setup(), which sets an initial max_mtu of 1500, while the
>underlying hardware can handle something much larger. Just set it to
>ETH_MAX_MTU to support all possible values, and the limitations of the
>underlying devices will prevent setting anything too large.
>
>Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
>CC: Cong Wang 
>CC: Jiri Pirko 
>CC: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson 

Acked-by: Jiri Pirko 


Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR

2017-03-06 Thread Dmitry Safonov
2017-02-21 15:42 GMT+03:00 Kirill A. Shutemov :
> On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote:
>> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski :
>> > On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
>> >  wrote:
>> >> This patch introduces two new prctl(2) handles to manage maximum virtual
>> >> address available to userspace to map.
>> ...
>> > Anyway, can you and Dmitry try to reconcile your patches?
>>
>> So, how can I help that?
>> Is there the patch's version, on which I could rebase?
>> Here are BTW the last patches, which I will resend with trivial ifdef-fixup
>> after the merge window:
>> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com
>
> Could you check if this patch collides with anything you do:
>
> http://lkml.kernel.org/r/20170220131515.ga9...@node.shutemov.name

Ok, sorry for the late reply - it was the merge window anyway and I've got
urgent work to do.

Let's see:

I'll need minor merge fixup here:
>-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
>+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3))
while in my patches:
>+#define __TASK_UNMAPPED_BASE(task_size)(PAGE_ALIGN(task_size / 3))
>+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE)

This should be just fine with my changes:
>- info.high_limit = end;
>+ info.high_limit = min(end, DEFAULT_MAP_WINDOW);

This will need another minor fixup:
>-#define MAX_GAP (TASK_SIZE/6*5)
>+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5)
I've moved it from macro to mmap_base() as local var,
which depends on task_size parameter.

That's all, as far as I can see at this moment.
Does not seems hard to fix. So I suggest sending patches sets
in parallel, the second accepted will rebase the set.
Is it convenient for you?
If you have/will have some questions about my patches, I'll be
open to answer.

-- 
 Dmitry


[PATCHv4 32/33] x86: enable 5-level paging support

2017-03-06 Thread Kirill A. Shutemov
Most of things are in place and we can enable support of 5-level paging.

Enabling XEN with 5-level paging requires more work. The patch makes XEN
dependent on !X86_5LEVEL.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/Kconfig | 5 +
 arch/x86/xen/Kconfig | 1 +
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 747f06f00a22..43b3343402f5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -317,6 +317,7 @@ config FIX_EARLYCON_MEM
 
 config PGTABLE_LEVELS
int
+   default 5 if X86_5LEVEL
default 4 if X86_64
default 3 if X86_PAE
default 2
@@ -1381,6 +1382,10 @@ config X86_PAE
  has the cost of more pagetable lookup overhead, and also
  consumes more pagetable space per process.
 
+config X86_5LEVEL
+   bool "Enable 5-level page tables support"
+   depends on X86_64
+
 config ARCH_PHYS_ADDR_T_64BIT
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 76b6dbd627df..b90d481ce5a1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
 config XEN
bool "Xen guest support"
depends on PARAVIRT
+   depends on !X86_5LEVEL
select PARAVIRT_CLOCK
select XEN_HAVE_PVMMU
select XEN_HAVE_VPMU
-- 
2.11.0



Re: Question Regarding ERMS memcpy

2017-03-06 Thread Borislav Petkov
On Mon, Mar 06, 2017 at 05:41:22AM -0800, h...@zytor.com wrote:
> It isn't really that straightforward IMO.
>
> For UC memory transaction size really needs to be specified explicitly
> at all times and should be part of the API, rather than implicit.
>
> For WC/WT/WB device memory, the ordinary memcpy is valid and
> preferred.

I'm practically partially reverting

6175ddf06b61 ("x86: Clean up mem*io functions.")

Are you saying, this was wrong before too?

Maybe it was wrong, strictly speaking, but maybe that was good enough
for our purposes...

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


[PATCHv4 11/33] x86/ident_map: add 5-level paging support

2017-03-06 Thread Kirill A. Shutemov
Nothing special: just handle one more level.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/ident_map.c | 47 ---
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 4473cb4f8b90..2c9a62282fb1 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -45,6 +45,34 @@ static int ident_pud_init(struct x86_mapping_info *info, 
pud_t *pud_page,
return 0;
 }
 
+static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page,
+ unsigned long addr, unsigned long end)
+{
+   unsigned long next;
+
+   for (; addr < end; addr = next) {
+   p4d_t *p4d = p4d_page + p4d_index(addr);
+   pud_t *pud;
+
+   next = (addr & P4D_MASK) + P4D_SIZE;
+   if (next > end)
+   next = end;
+
+   if (p4d_present(*p4d)) {
+   pud = pud_offset(p4d, 0);
+   ident_pud_init(info, pud, addr, next);
+   continue;
+   }
+   pud = (pud_t *)info->alloc_pgt_page(info->context);
+   if (!pud)
+   return -ENOMEM;
+   ident_pud_init(info, pud, addr, next);
+   set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+   }
+
+   return 0;
+}
+
 int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
  unsigned long pstart, unsigned long pend)
 {
@@ -55,27 +83,32 @@ int kernel_ident_mapping_init(struct x86_mapping_info 
*info, pgd_t *pgd_page,
 
for (; addr < end; addr = next) {
pgd_t *pgd = pgd_page + pgd_index(addr);
-   pud_t *pud;
+   p4d_t *p4d;
 
next = (addr & PGDIR_MASK) + PGDIR_SIZE;
if (next > end)
next = end;
 
if (pgd_present(*pgd)) {
-   pud = pud_offset(pgd, 0);
-   result = ident_pud_init(info, pud, addr, next);
+   p4d = p4d_offset(pgd, 0);
+   result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
continue;
}
 
-   pud = (pud_t *)info->alloc_pgt_page(info->context);
-   if (!pud)
+   p4d = (p4d_t *)info->alloc_pgt_page(info->context);
+   if (!p4d)
return -ENOMEM;
-   result = ident_pud_init(info, pud, addr, next);
+   result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
-   set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+   if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+   set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+   } else {
+   pud_t *pud = pud_offset(p4d, 0);
+   set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE));
+   }
}
 
return 0;
-- 
2.11.0



[PATCHv4 07/33] mm: introduce __p4d_alloc()

2017-03-06 Thread Kirill A. Shutemov
For full 5-level paging we need a helper to allocate p4d page table.

Signed-off-by: Kirill A. Shutemov 
---
 mm/memory.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 7f1c2163b3ce..235ba51b2fbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3906,6 +3906,29 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned 
long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+#ifndef __PAGETABLE_P4D_FOLDED
+/*
+ * Allocate p4d page table.
+ * We've already handled the fast-path in-line.
+ */
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+{
+   p4d_t *new = p4d_alloc_one(mm, address);
+   if (!new)
+   return -ENOMEM;
+
+   smp_wmb(); /* See comment in __pte_alloc */
+
+   spin_lock(&mm->page_table_lock);
+   if (pgd_present(*pgd))  /* Another has populated it */
+   p4d_free(mm, new);
+   else
+   pgd_populate(mm, pgd, new);
+   spin_unlock(&mm->page_table_lock);
+   return 0;
+}
+#endif /* __PAGETABLE_P4D_FOLDED */
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
-- 
2.11.0



[PATCHv4 13/33] x86/power: support p4d_t in hibernate code

2017-03-06 Thread Kirill A. Shutemov
set_up_temporary_text_mapping() and relocate_restore_code() require
trivial adjustments to handle additional page table level.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/power/hibernate_64.c | 49 ++-
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
index ded2e8272382..9ec941638932 100644
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -49,6 +49,7 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
 {
pmd_t *pmd;
pud_t *pud;
+   p4d_t *p4d;
 
/*
 * The new mapping only has to cover the page containing the image
@@ -63,6 +64,13 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
 * the virtual address space after switching over to the original page
 * tables used by the image kernel.
 */
+
+   if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+   p4d = (p4d_t *)get_safe_page(GFP_ATOMIC);
+   if (!p4d)
+   return -ENOMEM;
+   }
+
pud = (pud_t *)get_safe_page(GFP_ATOMIC);
if (!pud)
return -ENOMEM;
@@ -75,8 +83,15 @@ static int set_up_temporary_text_mapping(pgd_t *pgd)
__pmd((jump_address_phys & PMD_MASK) | 
__PAGE_KERNEL_LARGE_EXEC));
set_pud(pud + pud_index(restore_jump_address),
__pud(__pa(pmd) | _KERNPG_TABLE));
-   set_pgd(pgd + pgd_index(restore_jump_address),
-   __pgd(__pa(pud) | _KERNPG_TABLE));
+   if (IS_ENABLED(CONFIG_X86_5LEVEL)) {
+   set_p4d(p4d + p4d_index(restore_jump_address),
+   __p4d(__pa(pud) | _KERNPG_TABLE));
+   set_pgd(pgd + pgd_index(restore_jump_address),
+   __pgd(__pa(p4d) | _KERNPG_TABLE));
+   } else {
+   set_pgd(pgd + pgd_index(restore_jump_address),
+   __pgd(__pa(pud) | _KERNPG_TABLE));
+   }
 
return 0;
 }
@@ -124,7 +139,10 @@ static int set_up_temporary_mappings(void)
 static int relocate_restore_code(void)
 {
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
 
relocated_restore_code = get_safe_page(GFP_ATOMIC);
if (!relocated_restore_code)
@@ -134,22 +152,25 @@ static int relocate_restore_code(void)
 
/* Make the page containing the relocated code executable */
pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
-   pud = pud_offset(pgd, relocated_restore_code);
+   p4d = p4d_offset(pgd, relocated_restore_code);
+   if (p4d_large(*p4d)) {
+   set_p4d(p4d, __p4d(p4d_val(*p4d) & ~_PAGE_NX));
+   goto out;
+   }
+   pud = pud_offset(p4d, relocated_restore_code);
if (pud_large(*pud)) {
set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
-   } else {
-   pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
-
-   if (pmd_large(*pmd)) {
-   set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
-   } else {
-   pte_t *pte = pte_offset_kernel(pmd, 
relocated_restore_code);
-
-   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
-   }
+   goto out;
+   }
+   pmd = pmd_offset(pud, relocated_restore_code);
+   if (pmd_large(*pmd)) {
+   set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+   goto out;
}
+   pte = pte_offset_kernel(pmd, relocated_restore_code);
+   set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+out:
__flush_tlb_all();
-
return 0;
 }
 
-- 
2.11.0



[PATCH net] team: use ETH_MAX_MTU as max mtu

2017-03-06 Thread Jarod Wilson
This restores the ability to set a team device's mtu to anything higher
than 1500. Similar to the reported issue with bonding, the team driver
calls ether_setup(), which sets an initial max_mtu of 1500, while the
underlying hardware can handle something much larger. Just set it to
ETH_MAX_MTU to support all possible values, and the limitations of the
underlying devices will prevent setting anything too large.

Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
CC: Cong Wang 
CC: Jiri Pirko 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/team/team.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 4a24b5d15f5a..1b52520715ae 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2072,6 +2072,7 @@ static int team_dev_type_check_change(struct net_device 
*dev,
 static void team_setup(struct net_device *dev)
 {
ether_setup(dev);
+   dev->max_mtu = ETH_MAX_MTU;
 
dev->netdev_ops = &team_netdev_ops;
dev->ethtool_ops = &team_ethtool_ops;
-- 
2.11.0



Build regressions/improvements in v4.11-rc1

2017-03-06 Thread Geert Uytterhoeven
Below is the list of build error/warning regressions/improvements in
v4.11-rc1[1] compared to v4.10[2].

Summarized:
  - build errors: +19/-1
  - build warnings: +1108/-835

Happy fixing! ;-)

Thanks to the linux-next team for providing the build service.

[1] 
http://kisskb.ellerman.id.au/kisskb/head/c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201/
 (all 266 configs)
[2] 
http://kisskb.ellerman.id.au/kisskb/head/c470abd4fde40ea6a0846a2beab642a578c0b8cd/
 (all 266 configs)


*** ERRORS ***

19 error regressions:
  + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: 
dereferencing pointer to incomplete type:  => 58
  + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: implicit 
declaration of function 'user_mode':  => 60
  + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: implicit 
declaration of function 'task_stack_page' 
[-Werror=implicit-function-declaration]:  => 31:3
  + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: invalid 
application of 'sizeof' to incomplete type 'struct pt_regs' :  => 31:3
  + /home/kisskb/slave/src/arch/mips/cavium-octeon/crypto/octeon-crypto.c: 
error: implicit declaration of function 'task_stack_page' 
[-Werror=implicit-function-declaration]:  => 35:6
  + /home/kisskb/slave/src/arch/mips/cavium-octeon/smp.c: error: implicit 
declaration of function 'task_stack_page' 
[-Werror=implicit-function-declaration]:  => 214:2
  + /home/kisskb/slave/src/arch/mips/include/asm/fpu.h: error: invalid 
application of 'sizeof' to incomplete type 'struct pt_regs' :  => 140:3, 188:2, 
138:3, 136:2
  + /home/kisskb/slave/src/arch/mips/include/asm/processor.h: error: invalid 
application of 'sizeof' to incomplete type 'struct pt_regs':  => 385:31
  + /home/kisskb/slave/src/arch/mips/kernel/smp-mt.c: error: implicit 
declaration of function 'task_stack_page' 
[-Werror=implicit-function-declaration]:  => 215:2
  + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: dereferencing 
pointer to incomplete type:  => 59:17, 66:13
  + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: implicit 
declaration of function 'force_sig' [-Werror=implicit-function-declaration]:  
=> 75:2
  + /home/kisskb/slave/src/arch/mips/sgi-ip32/ip32-berr.c: error: implicit 
declaration of function 'force_sig' [-Werror=implicit-function-declaration]:  
=> 31:2
  + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown 
opcode2 `l.lwa'.:  => 70, 107, 69
  + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown 
opcode2 `l.swa'.:  => 72, 71, 111
  + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: 
unknown opcode2 `l.lwa'.:  => 18, 35, 70, 90
  + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: 
unknown opcode2 `l.swa'.:  => 20, 37, 92, 72
  + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: unknown 
opcode2 `l.lwa'.:  => 68, 30
  + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: unknown 
opcode2 `l.swa'.:  => 34, 69
  + /home/kisskb/slave/src/drivers/char/nwbutton.c: error: implicit declaration 
of function 'kill_cad_pid' [-Werror=implicit-function-declaration]:  => 134:3

1 error improvements:
  - error: rtnetlink.c: relocation truncated to fit: R_AVR32_11H_PCREL against 
`.text'+217dc: (.text+0x21bec) => 


*** WARNINGS ***

1108 warning regressions:

[Deleted 1030 lines about "warning: -ffunction-sections disabled; it makes 
profiling impossible [enabled by default]" on parisc-allmodconfig]

  + /home/kisskb/slave/src/arch/arc/include/asm/kprobes.h: warning: 
'trap_is_kprobe' defined but not used [-Wunused-function]:  => 57:13
  + /home/kisskb/slave/src/arch/mips/include/asm/sibyte/bcm1480_scd.h: warning: 
"M_SPC_CFG_CLEAR" redefined:  => 274:0
  + /home/kisskb/slave/src/arch/mips/include/asm/sibyte/bcm1480_scd.h: warning: 
"M_SPC_CFG_ENABLE" redefined:  => 275:0
  + /home/kisskb/slave/src/arch/x86/hyperv/hv_init.c: warning: label 
'register_msr_cs' defined but not used [-Wunused-label]:  => 167:1
  + /home/kisskb/slave/src/arch/x86/kernel/e820.c: warning: 'gapstart' may be 
used uninitialized in this function [-Wuninitialized]:  => 643:16, 645:8
  + /home/kisskb/slave/src/crypto/ccm.c: warning: 'crypto_ccm_auth' uses 
dynamic stack allocation [enabled by default]:  => 235:1
  + /home/kisskb/slave/src/drivers/crypto/chelsio/chcr_algo.c: warning: 
'chcr_copy_assoc.isra.20' uses dynamic stack allocation [enabled by default]:  
=> 1336:1
  + /home/kisskb/slave/src/drivers/crypto/mediatek/mtk-sha.c: warning: 
'mtk_sha_finish_hmac' uses dynamic stack allocation [enabled by default]:  => 
371:1
  + /home/kisskb/slave/src/drivers/crypto/mediatek/mtk-sha.c: warning: 
'mtk_sha_setkey' uses dynamic stack allocation [enabled by default]:  => 880:1
  + 
/home/kisskb/slave/src/drivers/gpu/drm/nouveau/nvkm/subdev/secboot/acr_r352.c: 
warning: 'acr_r352_load' uses dynamic stack allocation [enabled by default]:  
=> 736:1
  + 
/home/kisskb/slave/src/d

[PATCHv4 26/33] x86/kasan: extend to support 5-level paging

2017-03-06 Thread Kirill A. Shutemov
This patch bring support for non-folded additional page table level.

Signed-off-by: Kirill A. Shutemov 
Cc: Dmitry Vyukov = 5 && i < PTRS_PER_P4D; i++)
+   kasan_zero_p4d[i] = __p4d(p4d_val);
+
kasan_map_early_shadow(early_level4_pgt);
kasan_map_early_shadow(init_level4_pgt);
 }
-- 
2.11.0



[PATCHv4 03/33] asm-generic: introduce __ARCH_USE_5LEVEL_HACK

2017-03-06 Thread Kirill A. Shutemov
We are going to introduce  to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.

If an architecture uses , we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.

With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion  to use 5level-fixup.h.

Signed-off-by: Kirill A. Shutemov 
---
 include/asm-generic/pgtable-nop4d-hack.h | 62 
 include/asm-generic/pgtable-nopud.h  |  5 +++
 2 files changed, 67 insertions(+)
 create mode 100644 include/asm-generic/pgtable-nop4d-hack.h

diff --git a/include/asm-generic/pgtable-nop4d-hack.h 
b/include/asm-generic/pgtable-nop4d-hack.h
new file mode 100644
index ..752fb7511750
--- /dev/null
+++ b/include/asm-generic/pgtable-nop4d-hack.h
@@ -0,0 +1,62 @@
+#ifndef _PGTABLE_NOP4D_HACK_H
+#define _PGTABLE_NOP4D_HACK_H
+
+#ifndef __ASSEMBLY__
+#include 
+
+#define __PAGETABLE_PUD_FOLDED
+
+/*
+ * Having the pud type consist of a pgd gets the size right, and allows
+ * us to conceptually access the pgd entry that this pud is folded into
+ * without casting.
+ */
+typedef struct { pgd_t pgd; } pud_t;
+
+#define PUD_SHIFT  PGDIR_SHIFT
+#define PTRS_PER_PUD   1
+#define PUD_SIZE   (1UL << PUD_SHIFT)
+#define PUD_MASK   (~(PUD_SIZE-1))
+
+/*
+ * The "pgd_xxx()" functions here are trivial for a folded two-level
+ * setup: the pud is never bad, and a pud always exists (as it's folded
+ * into the pgd entry)
+ */
+static inline int pgd_none(pgd_t pgd)  { return 0; }
+static inline int pgd_bad(pgd_t pgd)   { return 0; }
+static inline int pgd_present(pgd_t pgd)   { return 1; }
+static inline void pgd_clear(pgd_t *pgd)   { }
+#define pud_ERROR(pud) (pgd_ERROR((pud).pgd))
+
+#define pgd_populate(mm, pgd, pud) do { } while (0)
+/*
+ * (puds are folded into pgds so this doesn't get actually called,
+ * but the define is needed for a generic inline function.)
+ */
+#define set_pgd(pgdptr, pgdval)set_pud((pud_t *)(pgdptr), (pud_t) { 
pgdval })
+
+static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
+{
+   return (pud_t *)pgd;
+}
+
+#define pud_val(x) (pgd_val((x).pgd))
+#define __pud(x)   ((pud_t) { __pgd(x) })
+
+#define pgd_page(pgd)  (pud_page((pud_t){ pgd }))
+#define pgd_page_vaddr(pgd)(pud_page_vaddr((pud_t){ pgd }))
+
+/*
+ * allocating and freeing a pud is trivial: the 1-entry pud is
+ * inside the pgd, so has no extra memory associated with it.
+ */
+#define pud_alloc_one(mm, address) NULL
+#define pud_free(mm, x)do { } while (0)
+#define __pud_free_tlb(tlb, x, a)  do { } while (0)
+
+#undef  pud_addr_end
+#define pud_addr_end(addr, end)(end)
+
+#endif /* __ASSEMBLY__ */
+#endif /* _PGTABLE_NOP4D_HACK_H */
diff --git a/include/asm-generic/pgtable-nopud.h 
b/include/asm-generic/pgtable-nopud.h
index 810431d8351b..5e49430a30a4 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -3,6 +3,10 @@
 
 #ifndef __ASSEMBLY__
 
+#ifdef __ARCH_USE_5LEVEL_HACK
+#include 
+#else
+
 #define __PAGETABLE_PUD_FOLDED
 
 /*
@@ -58,4 +62,5 @@ static inline pud_t * pud_offset(pgd_t * pgd, unsigned long 
address)
 #define pud_addr_end(addr, end)(end)
 
 #endif /* __ASSEMBLY__ */
+#endif /* !__ARCH_USE_5LEVEL_HACK */
 #endif /* _PGTABLE_NOPUD_H */
-- 
2.11.0



[PATCHv4 24/33] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL

2017-03-06 Thread Kirill A. Shutemov
Extends pagetable headers to support new paging mode.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/include/asm/pgtable_64.h   | 11 +++
 arch/x86/include/asm/pgtable_64_types.h | 20 +++
 arch/x86/include/asm/pgtable_types.h| 10 +-
 arch/x86/mm/pgtable.c   | 34 -
 4 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h 
b/arch/x86/include/asm/pgtable_64.h
index 79396bfdc791..9991224f6238 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -35,6 +35,13 @@ extern void paging_init(void);
 #define pud_ERROR(e)   \
pr_err("%s:%d: bad pud %p(%016lx)\n",   \
   __FILE__, __LINE__, &(e), pud_val(e))
+
+#if CONFIG_PGTABLE_LEVELS >= 5
+#define p4d_ERROR(e)   \
+   pr_err("%s:%d: bad p4d %p(%016lx)\n",   \
+  __FILE__, __LINE__, &(e), p4d_val(e))
+#endif
+
 #define pgd_ERROR(e)   \
pr_err("%s:%d: bad pgd %p(%016lx)\n",   \
   __FILE__, __LINE__, &(e), pgd_val(e))
@@ -128,7 +135,11 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 
 static inline void native_p4d_clear(p4d_t *p4d)
 {
+#ifdef CONFIG_X86_5LEVEL
+   native_set_p4d(p4d, native_make_p4d(0));
+#else
native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)});
+#endif
 }
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 00dc0c2b456e..7ae641fdbd07 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -23,12 +23,32 @@ typedef struct { pteval_t pte; } pte_t;
 
 #define SHARED_KERNEL_PMD  0
 
+#ifdef CONFIG_X86_5LEVEL
+
+/*
+ * PGDIR_SHIFT determines what a top-level page table entry can map
+ */
+#define PGDIR_SHIFT48
+#define PTRS_PER_PGD   512
+
+/*
+ * 4rd level page in 5-level paging case
+ */
+#define P4D_SHIFT  39
+#define PTRS_PER_P4D   512
+#define P4D_SIZE   (_AC(1, UL) << P4D_SHIFT)
+#define P4D_MASK   (~(P4D_SIZE - 1))
+
+#else  /* CONFIG_X86_5LEVEL */
+
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
 #define PGDIR_SHIFT39
 #define PTRS_PER_PGD   512
 
+#endif  /* CONFIG_X86_5LEVEL */
+
 /*
  * 3rd level page
  */
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 4930afe9df0a..bf9638e1ee42 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -273,9 +273,17 @@ static inline pgdval_t pgd_flags(pgd_t pgd)
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
+typedef struct { p4dval_t p4d; } p4d_t;
 
-#error FIXME
+static inline p4d_t native_make_p4d(pudval_t val)
+{
+   return (p4d_t) { val };
+}
 
+static inline p4dval_t native_p4d_val(p4d_t p4d)
+{
+   return p4d.p4d;
+}
 #else
 #include 
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 38b6daf72deb..d26b066944a5 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -81,6 +81,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
tlb_remove_page(tlb, virt_to_page(pud));
 }
+
+#if CONFIG_PGTABLE_LEVELS > 4
+void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
+{
+   paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
+   tlb_remove_page(tlb, virt_to_page(p4d));
+}
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */
 #endif /* CONFIG_PGTABLE_LEVELS > 3 */
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 
@@ -120,7 +128,7 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
   references from swapper_pg_dir. */
if (CONFIG_PGTABLE_LEVELS == 2 ||
(CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
-   CONFIG_PGTABLE_LEVELS == 4) {
+   CONFIG_PGTABLE_LEVELS >= 4) {
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
@@ -582,6 +590,30 @@ void native_set_fixmap(enum fixed_addresses idx, 
phys_addr_t phys,
 }
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+#ifdef CONFIG_X86_5LEVEL
+/**
+ * p4d_set_huge - setup kernel P4D mapping
+ *
+ * No 512GB pages yet -- always return 0
+ *
+ * Returns 1 on success and 0 on failure.
+ */
+int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
+{
+   return 0;
+}
+
+/**
+ * p4d_clear_huge - clear kernel P4D mapping when it is set
+ *
+ * No 512GB pages yet -- always return 0
+ */
+int p4d_clear_huge(p4d_t *p4d)
+{
+   return 0;
+}
+#endif
+
 /**
  * pud_set_huge - setup kernel PUD mapping
  *
-- 
2.11.0



[PATCHv4 06/33] mm: convert generic code to 5-level paging

2017-03-06 Thread Kirill A. Shutemov
Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov 
---
 drivers/misc/sgi-gru/grufault.c |   9 +-
 fs/userfaultfd.c|   6 +-
 include/asm-generic/pgtable.h   |  48 +-
 include/linux/hugetlb.h |   5 +-
 include/linux/kasan.h   |   1 +
 include/linux/mm.h  |  31 --
 lib/ioremap.c   |  39 +++-
 mm/gup.c|  46 +++--
 mm/huge_memory.c|   7 +-
 mm/hugetlb.c|  29 +++---
 mm/kasan/kasan_init.c   |  35 ++-
 mm/memory.c | 207 +---
 mm/mlock.c  |   1 +
 mm/mprotect.c   |  26 -
 mm/mremap.c |  13 ++-
 mm/page_vma_mapped.c|   6 +-
 mm/pagewalk.c   |  32 ++-
 mm/pgtable-generic.c|   6 ++
 mm/rmap.c   |   7 +-
 mm/sparse-vmemmap.c |  22 -
 mm/swapfile.c   |  26 -
 mm/userfaultfd.c|  23 +++--
 mm/vmalloc.c|  81 
 23 files changed, 586 insertions(+), 120 deletions(-)

diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 6fb773dbcd0c..93be82fc338a 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -219,15 +219,20 @@ static int atomic_pte_lookup(struct vm_area_struct *vma, 
unsigned long vaddr,
int write, unsigned long *paddr, int *pageshift)
 {
pgd_t *pgdp;
-   pmd_t *pmdp;
+   p4d_t *p4dp;
pud_t *pudp;
+   pmd_t *pmdp;
pte_t pte;
 
pgdp = pgd_offset(vma->vm_mm, vaddr);
if (unlikely(pgd_none(*pgdp)))
goto err;
 
-   pudp = pud_offset(pgdp, vaddr);
+   p4dp = p4d_offset(pgdp, vaddr);
+   if (unlikely(p4d_none(*p4dp)))
+   goto err;
+
+   pudp = pud_offset(p4dp, vaddr);
if (unlikely(pud_none(*pudp)))
goto err;
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 973607df579d..02ce3944d0f5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -267,6 +267,7 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
 {
struct mm_struct *mm = ctx->mm;
pgd_t *pgd;
+   p4d_t *p4d;
pud_t *pud;
pmd_t *pmd, _pmd;
pte_t *pte;
@@ -277,7 +278,10 @@ static inline bool userfaultfd_must_wait(struct 
userfaultfd_ctx *ctx,
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
goto out;
-   pud = pud_offset(pgd, address);
+   p4d = p4d_offset(pgd, address);
+   if (!p4d_present(*p4d))
+   goto out;
+   pud = pud_offset(p4d, address);
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, address);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f4ca23b158b3..1fad160f35de 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,9 +10,9 @@
 #include 
 #include 
 
-#if 4 - defined(__PAGETABLE_PUD_FOLDED) - defined(__PAGETABLE_PMD_FOLDED) != \
-   CONFIG_PGTABLE_LEVELS
-#error CONFIG_PGTABLE_LEVELS is not consistent with 
__PAGETABLE_{PUD,PMD}_FOLDED
+#if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
+   defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
+#error CONFIG_PGTABLE_LEVELS is not consistent with 
__PAGETABLE_{P4D,PUD,PMD}_FOLDED
 #endif
 
 /*
@@ -424,6 +424,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, 
pgprot_t newprot)
(__boundary - 1 < (end) - 1)? __boundary: (end);\
 })
 
+#ifndef p4d_addr_end
+#define p4d_addr_end(addr, end)
\
+({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;  \
+   (__boundary - 1 < (end) - 1)? __boundary: (end);\
+})
+#endif
+
 #ifndef pud_addr_end
 #define pud_addr_end(addr, end)
\
 ({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;  \
@@ -444,6 +451,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, 
pgprot_t newprot)
  * Do the tests inline, but report and clear the bad entry in mm/memory.c.
  */
 void pgd_clear_bad(pgd_t *);
+void p4d_clear_bad(p4d_t *);
 void pud_clear_bad(pud_t *);
 void pmd_clear_bad(pmd_t *);
 
@@ -458,6 +466,17 @@ static inline int pgd_none_or_clear_bad(pgd_t *pgd)
return 0;
 }
 
+static inline int p4d_none_or_clear_bad(p4d_t *p4d)
+{
+   if (p4d_none(*p4d))
+   return 1;
+   if (unlikely(p4d_bad(*p4d))) {
+   p4d_clear_bad(p4d);
+   return 1;
+   }
+   return 0;
+}
+
 static inline int pud_none_or_clear_bad(pud_t *pud)
 {
 

[PATCHv4 31/33] x86/mm: add support for 5-level paging for KASLR

2017-03-06 Thread Kirill A. Shutemov
With 5-level paging randomization happens on P4D level instead of PUD.

Maximum amount of physical memory also bumped to 52-bits for 5-level
paging.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/kaslr.c | 82 -
 1 file changed, 63 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c
index 887e57182716..662e5c4b21c8 100644
--- a/arch/x86/mm/kaslr.c
+++ b/arch/x86/mm/kaslr.c
@@ -6,12 +6,12 @@
  *
  * Entropy is generated using the KASLR early boot functions now shared in
  * the lib directory (originally written by Kees Cook). Randomization is
- * done on PGD & PUD page table levels to increase possible addresses. The
- * physical memory mapping code was adapted to support PUD level virtual
- * addresses. This implementation on the best configuration provides 30,000
- * possible virtual addresses in average for each memory region. An additional
- * low memory page is used to ensure each CPU can start with a PGD aligned
- * virtual address (for realmode).
+ * done on PGD & P4D/PUD page table levels to increase possible addresses.
+ * The physical memory mapping code was adapted to support P4D/PUD level
+ * virtual addresses. This implementation on the best configuration provides
+ * 30,000 possible virtual addresses in average for each memory region.
+ * An additional low memory page is used to ensure each CPU can start with
+ * a PGD aligned virtual address (for realmode).
  *
  * The order of each memory region is not changed. The feature looks at
  * the available space for the regions based on different configuration
@@ -70,7 +70,8 @@ static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
 } kaslr_regions[] = {
-   { &page_offset_base, 64/* Maximum */ },
+   { &page_offset_base,
+   1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ },
{ &vmalloc_base, VMALLOC_SIZE_TB },
{ &vmemmap_base, 1 },
 };
@@ -142,7 +143,10 @@ void __init kernel_randomize_memory(void)
 */
entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i);
prandom_bytes_state(&rand_state, &rand, sizeof(rand));
-   entropy = (rand % (entropy + 1)) & PUD_MASK;
+   if (IS_ENABLED(CONFIG_X86_5LEVEL))
+   entropy = (rand % (entropy + 1)) & P4D_MASK;
+   else
+   entropy = (rand % (entropy + 1)) & PUD_MASK;
vaddr += entropy;
*kaslr_regions[i].base = vaddr;
 
@@ -151,27 +155,21 @@ void __init kernel_randomize_memory(void)
 * randomization alignment.
 */
vaddr += get_padding(&kaslr_regions[i]);
-   vaddr = round_up(vaddr + 1, PUD_SIZE);
+   if (IS_ENABLED(CONFIG_X86_5LEVEL))
+   vaddr = round_up(vaddr + 1, P4D_SIZE);
+   else
+   vaddr = round_up(vaddr + 1, PUD_SIZE);
remain_entropy -= entropy;
}
 }
 
-/*
- * Create PGD aligned trampoline table to allow real mode initialization
- * of additional CPUs. Consume only 1 low memory page.
- */
-void __meminit init_trampoline(void)
+static void __meminit init_trampoline_pud(void)
 {
unsigned long paddr, paddr_next;
pgd_t *pgd;
pud_t *pud_page, *pud_page_tramp;
int i;
 
-   if (!kaslr_memory_enabled()) {
-   init_trampoline_default();
-   return;
-   }
-
pud_page_tramp = alloc_low_page();
 
paddr = 0;
@@ -192,3 +190,49 @@ void __meminit init_trampoline(void)
set_pgd(&trampoline_pgd_entry,
__pgd(_KERNPG_TABLE | __pa(pud_page_tramp)));
 }
+
+static void __meminit init_trampoline_p4d(void)
+{
+   unsigned long paddr, paddr_next;
+   pgd_t *pgd;
+   p4d_t *p4d_page, *p4d_page_tramp;
+   int i;
+
+   p4d_page_tramp = alloc_low_page();
+
+   paddr = 0;
+   pgd = pgd_offset_k((unsigned long)__va(paddr));
+   p4d_page = (p4d_t *) pgd_page_vaddr(*pgd);
+
+   for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) {
+   p4d_t *p4d, *p4d_tramp;
+   unsigned long vaddr = (unsigned long)__va(paddr);
+
+   p4d_tramp = p4d_page_tramp + p4d_index(paddr);
+   p4d = p4d_page + p4d_index(vaddr);
+   paddr_next = (paddr & P4D_MASK) + P4D_SIZE;
+
+   *p4d_tramp = *p4d;
+   }
+
+   set_pgd(&trampoline_pgd_entry,
+   __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp)));
+}
+
+/*
+ * Create PGD aligned trampoline table to allow real mode initialization
+ * of additional CPUs. Consume only 1 low memory page.
+ */
+void __meminit init_trampoline(void)
+{
+
+   if (!kaslr_memory_enabled()) {
+   init_trampoline_default();
+   return;
+   }
+
+   if (IS_ENABLED(CONFIG_X86_5LEVEL))
+ 

[PATCHv4 25/33] x86/dump_pagetables: support 5-level paging

2017-03-06 Thread Kirill A. Shutemov
Simple extension to support one more page table level.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/dump_pagetables.c | 49 ---
 1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 58b5bee7ea27..0effac6989cd 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -110,7 +110,8 @@ static struct addr_marker address_markers[] = {
 #define PTE_LEVEL_MULT (PAGE_SIZE)
 #define PMD_LEVEL_MULT (PTRS_PER_PTE * PTE_LEVEL_MULT)
 #define PUD_LEVEL_MULT (PTRS_PER_PMD * PMD_LEVEL_MULT)
-#define PGD_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define P4D_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT)
+#define PGD_LEVEL_MULT (PTRS_PER_PUD * P4D_LEVEL_MULT)
 
 #define pt_dump_seq_printf(m, to_dmesg, fmt, args...)  \
 ({ \
@@ -347,7 +348,7 @@ static bool pud_already_checked(pud_t *prev_pud, pud_t 
*pud, bool checkwx)
return checkwx && prev_pud && (pud_val(*prev_pud) == pud_val(*pud));
 }
 
-static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
unsigned long P)
 {
int i;
@@ -355,7 +356,7 @@ static void walk_pud_level(struct seq_file *m, struct 
pg_state *st, pgd_t addr,
pgprotval_t prot;
pud_t *prev_pud = NULL;
 
-   start = (pud_t *) pgd_page_vaddr(addr);
+   start = (pud_t *) p4d_page_vaddr(addr);
 
for (i = 0; i < PTRS_PER_PUD; i++) {
st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
@@ -377,9 +378,43 @@ static void walk_pud_level(struct seq_file *m, struct 
pg_state *st, pgd_t addr,
 }
 
 #else
-#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(pgd_val(a)),p)
-#define pgd_large(a) pud_large(__pud(pgd_val(a)))
-#define pgd_none(a)  pud_none(__pud(pgd_val(a)))
+#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(p4d_val(a)),p)
+#define p4d_large(a) pud_large(__pud(p4d_val(a)))
+#define p4d_none(a)  pud_none(__pud(p4d_val(a)))
+#endif
+
+#if PTRS_PER_P4D > 1
+
+static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
+   unsigned long P)
+{
+   int i;
+   p4d_t *start;
+   pgprotval_t prot;
+
+   start = (p4d_t *) pgd_page_vaddr(addr);
+
+   for (i = 0; i < PTRS_PER_P4D; i++) {
+   st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
+   if (!p4d_none(*start)) {
+   if (p4d_large(*start) || !p4d_present(*start)) {
+   prot = p4d_flags(*start);
+   note_page(m, st, __pgprot(prot), 2);
+   } else {
+   walk_pud_level(m, st, *start,
+  P + i * P4D_LEVEL_MULT);
+   }
+   } else
+   note_page(m, st, __pgprot(0), 2);
+
+   start++;
+   }
+}
+
+#else
+#define walk_p4d_level(m,s,a,p) walk_pud_level(m,s,__p4d(pgd_val(a)),p)
+#define pgd_large(a) p4d_large(__p4d(pgd_val(a)))
+#define pgd_none(a)  p4d_none(__p4d(pgd_val(a)))
 #endif
 
 static inline bool is_hypervisor_range(int idx)
@@ -424,7 +459,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, 
pgd_t *pgd,
prot = pgd_flags(*start);
note_page(m, &st, __pgprot(prot), 1);
} else {
-   walk_pud_level(m, &st, *start,
+   walk_p4d_level(m, &st, *start,
   i * PGD_LEVEL_MULT);
}
} else
-- 
2.11.0



[PATCHv4 04/33] arch, mm: convert all architectures to use 5level-fixup.h

2017-03-06 Thread Kirill A. Shutemov
If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.

If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.

If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.

Signed-off-by: Kirill A. Shutemov 
---
 arch/arc/include/asm/hugepage.h  | 1 +
 arch/arc/include/asm/pgtable.h   | 1 +
 arch/arm/include/asm/pgtable.h   | 1 +
 arch/arm64/include/asm/pgtable-types.h   | 4 
 arch/avr32/include/asm/pgtable-2level.h  | 1 +
 arch/cris/include/asm/pgtable.h  | 1 +
 arch/frv/include/asm/pgtable.h   | 1 +
 arch/h8300/include/asm/pgtable.h | 1 +
 arch/hexagon/include/asm/pgtable.h   | 1 +
 arch/ia64/include/asm/pgtable.h  | 2 ++
 arch/metag/include/asm/pgtable.h | 1 +
 arch/mips/include/asm/pgtable-32.h   | 1 +
 arch/mips/include/asm/pgtable-64.h   | 1 +
 arch/mn10300/include/asm/page.h  | 1 +
 arch/nios2/include/asm/pgtable.h | 1 +
 arch/openrisc/include/asm/pgtable.h  | 1 +
 arch/powerpc/include/asm/book3s/32/pgtable.h | 1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++
 arch/powerpc/include/asm/nohash/32/pgtable.h | 1 +
 arch/powerpc/include/asm/nohash/64/pgtable-4k.h  | 3 +++
 arch/powerpc/include/asm/nohash/64/pgtable-64k.h | 1 +
 arch/s390/include/asm/pgtable.h  | 1 +
 arch/score/include/asm/pgtable.h | 1 +
 arch/sh/include/asm/pgtable-2level.h | 1 +
 arch/sh/include/asm/pgtable-3level.h | 1 +
 arch/sparc/include/asm/pgtable_64.h  | 1 +
 arch/tile/include/asm/pgtable_32.h   | 1 +
 arch/tile/include/asm/pgtable_64.h   | 1 +
 arch/um/include/asm/pgtable-2level.h | 1 +
 arch/um/include/asm/pgtable-3level.h | 1 +
 arch/unicore32/include/asm/pgtable.h | 1 +
 arch/x86/include/asm/pgtable_types.h | 4 
 arch/xtensa/include/asm/pgtable.h| 1 +
 33 files changed, 44 insertions(+)

diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index 317ff773e1ca..b18fcb606908 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -11,6 +11,7 @@
 #define _ASM_ARC_HUGEPAGE_H
 
 #include 
+#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 static inline pte_t pmd_pte(pmd_t pmd)
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index e94ca72b974e..ee22d40afef4 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -37,6 +37,7 @@
 
 #include 
 #include 
+#define __ARCH_USE_5LEVEL_HACK
 #include 
 #include 
 
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index a8d656d9aec7..1c462381c225 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -20,6 +20,7 @@
 
 #else
 
+#define __ARCH_USE_5LEVEL_HACK
 #include 
 #include 
 #include 
diff --git a/arch/arm64/include/asm/pgtable-types.h 
b/arch/arm64/include/asm/pgtable-types.h
index 69b2fd41503c..345a072b5856 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -55,9 +55,13 @@ typedef struct { pteval_t pgprot; } pgprot_t;
 #define __pgprot(x)((pgprot_t) { (x) } )
 
 #if CONFIG_PGTABLE_LEVELS == 2
+#define __ARCH_USE_5LEVEL_HACK
 #include 
 #elif CONFIG_PGTABLE_LEVELS == 3
+#define __ARCH_USE_5LEVEL_HACK
 #include 
+#elif CONFIG_PGTABLE_LEVELS == 4
+#include 
 #endif
 
 #endif /* __ASM_PGTABLE_TYPES_H */
diff --git a/arch/avr32/include/asm/pgtable-2level.h 
b/arch/avr32/include/asm/pgtable-2level.h
index 425dd567b5b9..d5b1c63993ec 100644
--- a/arch/avr32/include/asm/pgtable-2level.h
+++ b/arch/avr32/include/asm/pgtable-2level.h
@@ -8,6 +8,7 @@
 #ifndef __ASM_AVR32_PGTABLE_2LEVEL_H
 #define __ASM_AVR32_PGTABLE_2LEVEL_H
 
+#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 /*
diff --git a/arch/cris/include/asm/pgtable.h b/arch/cris/include/asm/pgtable.h
index 2a3210ba4c72..fa3a73004cc5 100644
--- a/arch/cris/include/asm/pgtable.h
+++ b/arch/cris/include/asm/pgtable.h
@@ -6,6 +6,7 @@
 #define _CRIS_PGTABLE_H
 
 #include 
+#define __ARCH_USE_5LEVEL_HACK
 #include 
 
 #ifndef __ASSEMBLY__
diff --git a/arch/frv/include/asm/pgtable.h b/arch/frv/include/asm/pgtable.h
index a0513d463a1f..ab6e7e961b54 100644
--- a/arch/frv/include/asm/pgtable.h
+++ b/arch/frv/include/asm/pgtable.h
@@ -16,6 +16,7 @@
 #ifndef _ASM_PGTABLE_H
 #define _ASM_PGTABLE_H
 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/h8300/include/asm/pgtable.h b/arch/h8300/include/asm/pgtable.h
index 8341db67821d..7d265d28ba5e 100644
--- a/arch/h8300/include/asm/pgtable.h
+++ b/arch/h8300/include/asm/pgtable.h
@@ -1,5 +1,6 @@
 #ifndef _H8

[PATCHv4 29/33] x86/mm: add sync_global_pgds() for configuration with 5-level paging

2017-03-06 Thread Kirill A. Shutemov
This basically restores slightly modified version of original
sync_global_pgds() which we had before foldedl p4d was introduced.

The only modification is protection against 'address' overflow.

Signed-off-by: Kirill A. Shutemov 
---
 arch/x86/mm/init_64.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7bdda6f1d135..5ba99090dc3c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -92,6 +92,42 @@ __setup("noexec32=", nonx32_setup);
  * When memory was added make sure all the processes MM have
  * suitable PGD entries in the local PGD level page.
  */
+#ifdef CONFIG_X86_5LEVEL
+void sync_global_pgds(unsigned long start, unsigned long end)
+{
+   unsigned long address;
+
+   for (address = start; address <= end && address >= start;
+   address += PGDIR_SIZE) {
+   const pgd_t *pgd_ref = pgd_offset_k(address);
+   struct page *page;
+
+   if (pgd_none(*pgd_ref))
+   continue;
+
+   spin_lock(&pgd_lock);
+   list_for_each_entry(page, &pgd_list, lru) {
+   pgd_t *pgd;
+   spinlock_t *pgt_lock;
+
+   pgd = (pgd_t *)page_address(page) + pgd_index(address);
+   /* the pgt_lock only for Xen */
+   pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+   spin_lock(pgt_lock);
+
+   if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+   BUG_ON(pgd_page_vaddr(*pgd)
+   != pgd_page_vaddr(*pgd_ref));
+
+   if (pgd_none(*pgd))
+   set_pgd(pgd, *pgd_ref);
+
+   spin_unlock(pgt_lock);
+   }
+   spin_unlock(&pgd_lock);
+   }
+}
+#else
 void sync_global_pgds(unsigned long start, unsigned long end)
 {
unsigned long address;
@@ -135,6 +171,7 @@ void sync_global_pgds(unsigned long start, unsigned long 
end)
spin_unlock(&pgd_lock);
}
 }
+#endif
 
 /*
  * NOTE: This function is marked __ref because it calls __init function
-- 
2.11.0



[PATCHv4 17/33] x86/kasan: prepare clear_pgds() to switch to

2017-03-06 Thread Kirill A. Shutemov
With folded p4d, pgd_clear() is nop. Change clear_pgds() to use
p4d_clear() instead.

Signed-off-by: Kirill A. Shutemov 
Cc: Dmitry Vyukov 
---
 arch/x86/mm/kasan_init_64.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 8d63d7a104c3..733f8ba6a01f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -32,8 +32,15 @@ static int __init map_range(struct range *range)
 static void __init clear_pgds(unsigned long start,
unsigned long end)
 {
-   for (; start < end; start += PGDIR_SIZE)
-   pgd_clear(pgd_offset_k(start));
+   pgd_t *pgd;
+
+   for (; start < end; start += PGDIR_SIZE) {
+   pgd = pgd_offset_k(start);
+   if (CONFIG_PGTABLE_LEVELS < 5)
+   p4d_clear(p4d_offset(pgd, start));
+   else
+   pgd_clear(pgd);
+   }
 }
 
 static void __init kasan_map_early_shadow(pgd_t *pgd)
-- 
2.11.0



cfq-iosched: two questions about the hrtimer version of CFQ

2017-03-06 Thread Hou Tao
Hi Jan and list,

When testing the hrtimer version of CFQ, we found a performance degradation
problem which seems to be caused by commit 0b31c10 ("cfq-iosched: Charge at
least 1 jiffie instead of 1 ns").

The following is the test process:

* filesystem and block device
* XFS + /dev/sda mounted on /tmp/sda
* CFQ configuration
* default configurations
* fio job configuration
[global]
bs=4k
ioengine=psync
iodepth=1
direct=1
rw=randwrite
time_based
runtime=15
cgroup_nodelete=1
group_reporting=1

[cfq_a]
filename=/tmp/sda/cfq_a.dat
size=2G
cgroup_weight=500
cgroup=cfq_a
thread=1
numjobs=2

[cfq_b]
new_group
filename=/tmp/sda/cfq_b.dat
size=2G
rate=4m
cgroup_weight=500
cgroup=cfq_b
thread=1
numjobs=2


The following is the test result:
* with 0b31c10:
* fio report
cfq_a: bw=5312.6KB/s, iops=1328
cfq_b: bw=8192.6KB/s, iops=2048

* blkcg debug files
./cfq_a/blkio.group_wait_time:8:0 12062571233
./cfq_b/blkio.group_wait_time:8:0 155841600
./cfq_a/blkio.io_serviced:Total 19922
./cfq_b/blkio.io_serviced:Total 30722
./cfq_a/blkio.time:8:0 19406083246
./cfq_b/blkio.time:8:0 19417146869

* without 0b31c10:
* fio report
cfq_a: bw=21670KB/s, iops=5417
cfq_b: bw=8191.2KB/s, iops=2047

* blkcg debug files
./cfq_a/blkio.group_wait_time:8:0 5798452504
./cfq_b/blkio.group_wait_time:8:0 5131844007
./cfq_a/blkio.io_serviced:8:0 Write 81261
./cfq_b/blkio.io_serviced:8:0 Write 30722
./cfq_a/blkio.time:8:0 5642608173
./cfq_b/blkio.time:8:0 5849949812

We want to known the reason why you revert the minimal used slice to 1 jiffy
when the slice has not been allocated. Does it lead to some performance
regressions or something similar ? If not, I think we could revert the minimal
slice to 1 ns again.

Another problem is about the time comparison in CFQ code. In no-hrtimer version
of CFQ, it uses time_after or time_before when possible, Why the hrtimer version
doesn't use the equivalent time_after64/time_before64 ? Can ktime_get_ns()
ensure there will be no wrapping problem ?

Thanks very much.

Regards,

Tao




[RESEND PATCH v3 4/8] phy: phy-mt65xx-usb3: move clock from phy node into port nodes

2017-03-06 Thread Chunfeng Yun
the reference clock of HighSpeed port is 48M which comes from PLL;
the reference clock of SuperSpeed port is 26M which usually comes
from 26M oscillator directly, but some SoCs are not, add it for
compatibility, and put them into port node for flexibility.

Signed-off-by: Chunfeng Yun 
---
 drivers/phy/phy-mt65xx-usb3.c |   21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c
index 7fff482..f4a3505 100644
--- a/drivers/phy/phy-mt65xx-usb3.c
+++ b/drivers/phy/phy-mt65xx-usb3.c
@@ -153,6 +153,7 @@ struct mt65xx_phy_pdata {
 struct mt65xx_phy_instance {
struct phy *phy;
void __iomem *port_base;
+   struct clk *ref_clk;/* reference clock of anolog phy */
u32 index;
u8 type;
 };
@@ -160,7 +161,6 @@ struct mt65xx_phy_instance {
 struct mt65xx_u3phy {
struct device *dev;
void __iomem *sif_base; /* only shared sif */
-   struct clk *u3phya_ref; /* reference clock of usb3 anolog phy */
const struct mt65xx_phy_pdata *pdata;
struct mt65xx_phy_instance **phys;
int nphys;
@@ -449,9 +449,9 @@ static int mt65xx_phy_init(struct phy *phy)
struct mt65xx_u3phy *u3phy = dev_get_drvdata(phy->dev.parent);
int ret;
 
-   ret = clk_prepare_enable(u3phy->u3phya_ref);
+   ret = clk_prepare_enable(instance->ref_clk);
if (ret) {
-   dev_err(u3phy->dev, "failed to enable u3phya_ref\n");
+   dev_err(u3phy->dev, "failed to enable ref_clk\n");
return ret;
}
 
@@ -494,7 +494,7 @@ static int mt65xx_phy_exit(struct phy *phy)
if (instance->type == PHY_TYPE_USB2)
phy_instance_exit(u3phy, instance);
 
-   clk_disable_unprepare(u3phy->u3phya_ref);
+   clk_disable_unprepare(instance->ref_clk);
return 0;
 }
 
@@ -594,12 +594,6 @@ static int mt65xx_u3phy_probe(struct platform_device *pdev)
return PTR_ERR(u3phy->sif_base);
}
 
-   u3phy->u3phya_ref = devm_clk_get(dev, "u3phya_ref");
-   if (IS_ERR(u3phy->u3phya_ref)) {
-   dev_err(dev, "error to get u3phya_ref\n");
-   return PTR_ERR(u3phy->u3phya_ref);
-   }
-
port = 0;
for_each_child_of_node(np, child_np) {
struct mt65xx_phy_instance *instance;
@@ -634,6 +628,13 @@ static int mt65xx_u3phy_probe(struct platform_device *pdev)
goto put_child;
}
 
+   instance->ref_clk = devm_clk_get(&phy->dev, "ref");
+   if (IS_ERR(instance->ref_clk)) {
+   dev_err(dev, "failed to get ref_clk(id-%d)\n", port);
+   retval = PTR_ERR(instance->ref_clk);
+   goto put_child;
+   }
+
instance->phy = phy;
instance->index = port;
phy_set_drvdata(phy, instance);
-- 
1.7.9.5



[RESEND PATCH v3 8/8] dt-bindings: phy-mt65xx-usb: add support for new version phy

2017-03-06 Thread Chunfeng Yun
add a new compatible string for "mt2712", and move reference clock
into each port node;

Signed-off-by: Chunfeng Yun 
Acked-by: Rob Herring 
---
 .../devicetree/bindings/phy/phy-mt65xx-usb.txt |   93 +---
 1 file changed, 80 insertions(+), 13 deletions(-)

diff --git a/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt 
b/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt
index 33a2b1e..0acc5a9 100644
--- a/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt
+++ b/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt
@@ -6,12 +6,11 @@ This binding describes a usb3.0 phy for mt65xx platforms of 
Medaitek SoC.
 Required properties (controller (parent) node):
  - compatible  : should be one of
  "mediatek,mt2701-u3phy"
+ "mediatek,mt2712-u3phy"
  "mediatek,mt8173-u3phy"
- - reg : offset and length of register for phy, exclude port's
- register.
- - clocks  : a list of phandle + clock-specifier pairs, one for each
- entry in clock-names
- - clock-names : must contain
+ - clocks  : (deprecated, use port's clocks instead) a list of phandle +
+ clock-specifier pairs, one for each entry in clock-names
+ - clock-names : (deprecated, use port's one instead) must contain
  "u3phya_ref": for reference clock of usb3.0 analog phy.
 
 Required nodes : a sub-node is required for each port the controller
@@ -19,8 +18,19 @@ Required nodes   : a sub-node is required for each port 
the controller
  'reg' property is used inside these nodes to describe
  the controller's topology.
 
+Optional properties (controller (parent) node):
+ - reg : offset and length of register shared by multiple ports,
+ exclude port's private register. It is needed on mt2701
+ and mt8173, but not on mt2712.
+
 Required properties (port (child) node):
 - reg  : address and length of the register set for the port.
+- clocks   : a list of phandle + clock-specifier pairs, one for each
+ entry in clock-names
+- clock-names  : must contain
+ "ref": 48M reference clock for HighSpeed analog phy; and 26M
+   reference clock for SuperSpeed analog phy, sometimes is
+   24M, 25M or 27M, depended on platform.
 - #phy-cells   : should be 1 (See second example)
  cell after port phandle is phy type from:
- PHY_TYPE_USB2
@@ -31,21 +41,31 @@ Example:
 u3phy: usb-phy@1129 {
compatible = "mediatek,mt8173-u3phy";
reg = <0 0x1129 0 0x800>;
-   clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>;
-   clock-names = "u3phya_ref";
#address-cells = <2>;
#size-cells = <2>;
ranges;
status = "okay";
 
-   phy_port0: port@11290800 {
-   reg = <0 0x11290800 0 0x800>;
+   u2port0: usb-phy@11290800 {
+   reg = <0 0x11290800 0 0x100>;
+   clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
 
-   phy_port1: port@11291000 {
-   reg = <0 0x11291000 0 0x800>;
+   u3port0: usb-phy@11290900 {
+   reg = <0 0x11290800 0 0x700>;
+   clocks = <&clk26m>;
+   clock-names = "ref";
+   #phy-cells = <1>;
+   status = "okay";
+   };
+
+   u2port1: usb-phy@11291000 {
+   reg = <0 0x11291000 0 0x100>;
+   clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>;
+   clock-names = "ref";
#phy-cells = <1>;
status = "okay";
};
@@ -64,7 +84,54 @@ Example:
 
 usb30: usb@1127 {
...
-   phys = <&phy_port0 PHY_TYPE_USB3>;
-   phy-names = "usb3-0";
+   phys = <&u2port0 PHY_TYPE_USB2>, <&u3port0 PHY_TYPE_USB3>;
+   phy-names = "usb2-0", "usb3-0";
...
 };
+
+
+Layout differences of banks between mt8173/mt2701 and mt2712
+-
+mt8173 and mt2701:
+portoffsetbank
+shared  0xSPLLC
+0x0100FMREG
+u2 port00x0800U2PHY_COM
+u3 port00x0900U3PHYD
+0x0a00U3PHYD_BANK2
+0x0b00U3PHYA
+0x0c00U3PHYA_DA
+u2 port10x1000U2PHY_COM
+u3 port10x1100U3PHYD
+0x1200U3PHYD_BANK2
+0x1300U3PHYA
+0x1400U3PHYA_DA
+u2 port20x1800U2PHY_COM
+...
+
+mt2712:
+portoffsetbank
+u2 port00xMISC
+0x0100FMREG
+0x0300U2PHY_COM
+u3 port00x0700SPLLC
+0x0800CHIP
+0x0900U3PHYD
+0x0a00U3PHYD_BANK2
+0x0b00U3PHYA
+0x0c00U3PHY

Re: [PATCH] HID: get rid of HID_QUIRK_NO_INIT_REPORTS

2017-03-06 Thread Benjamin Tissoires
On Mar 06 2017 or thereabouts, Jiri Kosina wrote:
> On Thu, 5 Jan 2017, Benjamin Tissoires wrote:
> 
> > For case 1, the hiddev documentation provides an ioctl to do the
> > init manually. A solution could be to retrieve the requested report
> > when EVIOCGUSAGE is called, in the same way hidraw does. I would be
> > tempted to not change the behavior and hope that we won't break any
> > userspace tool.
> 
> I'd like to be applying the HID_QUIRK_NO_INIT_REPORTS removal as soon as 
> possible so that it gets exposure in linux-next over the whole development 
> cycle.
> 
> I am however too conservative to ignore the potential hiddev breakage, I 
> am afraid. This has a real potential of breaking systems, and 
> administrators having hard time figuring out of happened; essentialy, this 
> is userspace-visible behavior change (regression) for which we haven't 
> done any long-term depreciation (such as printing a warning "please talk 
> to your hiddev driver vendor" in case the driver seems to assume 
> initialized reports) at least for a few years.
> 
> I think that either doing it at a connect time, or during first 
> EVIOCGUSAGE ioctl() call is a must.

Yes, that's what I was thinking to do too. Also, I think we need to keep
around the list of currently "quirked" devices for hiddev to work
properly. I am still wondering whether we should simply keep the list of
quirked devices in hid-core, but disable the effects, or move the full
list of quirked devices in hiddev.

Initially I thought it was better to remove the quirk from core and move
the list in hiddev, but on the other hand, that means that we will
remove the ability to introduce it from the kernel boot command, so
maybe keeping the list in its current state is better, and only have the
effects in hiddev. Am I clear enough?)

> 
> Otherwise, I'd be super-happy to finally get rid of this giant PITA.
> 

Me too!

Cheers,
Benjamin

> Thanks!
> 
> -- 
> Jiri Kosina
> SUSE Labs
> 


Re: [PATCH v2 6/9] kasan: improve slab object description

2017-03-06 Thread Andrey Konovalov
On Fri, Mar 3, 2017 at 3:39 PM, Andrey Ryabinin  wrote:
>
>
> On 03/03/2017 04:52 PM, Alexander Potapenko wrote:
>> On Fri, Mar 3, 2017 at 2:31 PM, Andrey Ryabinin  
>> wrote:
>>> On 03/02/2017 04:48 PM, Andrey Konovalov wrote:
 Changes slab object description from:

 Object at 880068388540, in cache kmalloc-128 size: 128

 to:

 The buggy address belongs to the object at 880068388540
  which belongs to the cache kmalloc-128 of size 128
 The buggy address is located 123 bytes inside of
  128-byte region [880068388540, 8800683885c0)

 Makes it more explanatory and adds information about relative offset
 of the accessed address to the start of the object.

>>>
>>> I don't think that this is an improvement. You replaced one simple line 
>>> with a huge
>>> and hard to parse text without giving any new/useful information.
>>> Except maybe offset, it useful sometimes, so wouldn't mind adding it to 
>>> description.
>> Agreed.
>> How about:
>> ===
>> Access 123 bytes inside of 128-byte region [880068388540, 
>> 8800683885c0)
>> Object at 880068388540 belongs to the cache kmalloc-128
>> ===
>> ?
>>
>
> I would just add the offset in the end:
> Object at 880068388540, in cache kmalloc-128 size: 128 accessed 
> at offset y

Access can be inside or outside the object, so it's better to
specifically say that.

I think we can do (basically what Alexander suggested):

Object at 880068388540 belongs to the cache kmalloc-128 of size 128
Access 123 bytes inside of 128-byte region [880068388540, 8800683885c0)

What do you think?


Re: Question Regarding ERMS memcpy

2017-03-06 Thread hpa
On March 6, 2017 5:33:28 AM PST, Borislav Petkov  wrote:
>On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote:
>> Well honestly my issue was solved by fixing my kernel config. I have
>no
>> idea why I had optimize for size in there in the first place.
>
>I still think that we should address the iomem memcpy Linus mentioned.
>So how about this partial revert. I've made 32-bit use the same special
>__memcpy() version.
>
>Hmmm?
>
>---
>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>index 7afb0e2f07f4..9e378a10796d 100644
>--- a/arch/x86/include/asm/io.h
>+++ b/arch/x86/include/asm/io.h
>@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void);
> #ifdef __KERNEL__
> 
> #include 
>+#include 
> 
> /*
>  * Convert a virtual cached pointer to an uncached pointer
>@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned
>char val, size_t count)
>  * @src:  The (I/O memory) source for the data
>  * @count:The number of bytes to copy
>  *
>- * Copy a block of data from I/O memory.
>+ * Copy a block of data from I/O memory. IO memory is different from
>+ * cached memory so we use special memcpy version.
>  */
> static inline void
>memcpy_fromio(void *dst, const volatile void __iomem *src, size_t
>count)
> {
>-  memcpy(dst, (const void __force *)src, count);
>+  __inline_memcpy(dst, (const void __force *)src, count);
> }
> 
> /**
>@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void
>__iomem *src, size_t count)
>  * @src:  The (RAM) source for the data
>  * @count:The number of bytes to copy
>  *
>- * Copy a block of data to I/O memory.
>+ * Copy a block of data to I/O memory. IO memory is different from
>+ * cached memory so we use special memcpy version.
>  */
> static inline void
> memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
> {
>-  memcpy((void __force *)dst, src, count);
>+  __inline_memcpy((void __force *)dst, src, count);
> }
> 
> /*
>diff --git a/arch/x86/include/asm/string_32.h
>b/arch/x86/include/asm/string_32.h
>index 3d3e8353ee5c..556fa4a975ff 100644
>--- a/arch/x86/include/asm/string_32.h
>+++ b/arch/x86/include/asm/string_32.h
>@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c);
> #define __HAVE_ARCH_STRLEN
> extern size_t strlen(const char *s);
> 
>+#define __inline_memcpy __memcpy
>static __always_inline void *__memcpy(void *to, const void *from,
>size_t n)
> {
>   int d0, d1, d2;

It isn't really that straightforward IMO.

For UC memory transaction size really needs to be specified explicitly at all 
times and should be part of the API, rather than implicit.

For WC/WT/WB device memory, the ordinary memcpy is valid and preferred.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v5 1/2] perf sdt: add scanning of sdt probles arguments

2017-03-06 Thread Masami Hiramatsu
On Wed, 14 Dec 2016 01:07:31 +0100
Alexis Berlemont  wrote:

> During a "perf buildid-cache --add" command, the section
> ".note.stapsdt" of the "added" binary is scanned in order to list the
> available SDT markers available in a binary. The parts containing the
> probes arguments were left unscanned.
> 
> The whole section is now parsed; the probe arguments are extracted for
> later use.
> 

Looks good to me.

Acked-by: Masami Hiramatsu 

Thanks!

> Signed-off-by: Alexis Berlemont 
> ---
>  tools/perf/util/symbol-elf.c | 25 +++--
>  tools/perf/util/symbol.h |  1 +
>  2 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
> index 99400b0..7725c3f 100644
> --- a/tools/perf/util/symbol-elf.c
> +++ b/tools/perf/util/symbol-elf.c
> @@ -1822,7 +1822,7 @@ void kcore_extract__delete(struct kcore_extract *kce)
>  static int populate_sdt_note(Elf **elf, const char *data, size_t len,
>struct list_head *sdt_notes)
>  {
> - const char *provider, *name;
> + const char *provider, *name, *args;
>   struct sdt_note *tmp = NULL;
>   GElf_Ehdr ehdr;
>   GElf_Addr base_off = 0;
> @@ -1881,6 +1881,25 @@ static int populate_sdt_note(Elf **elf, const char 
> *data, size_t len,
>   goto out_free_prov;
>   }
>  
> + args = memchr(name, '\0', data + len - name);
> +
> + /*
> +  * There is no argument if:
> +  * - We reached the end of the note;
> +  * - There is not enough room to hold a potential string;
> +  * - The argument string is empty or just contains ':'.
> +  */
> + if (args == NULL || data + len - args < 2 ||
> + args[1] == ':' || args[1] == '\0')
> + tmp->args = NULL;
> + else {
> + tmp->args = strdup(++args);
> + if (!tmp->args) {
> + ret = -ENOMEM;
> + goto out_free_name;
> + }
> + }
> +
>   if (gelf_getclass(*elf) == ELFCLASS32) {
>   memcpy(&tmp->addr, &buf, 3 * sizeof(Elf32_Addr));
>   tmp->bit32 = true;
> @@ -1892,7 +1911,7 @@ static int populate_sdt_note(Elf **elf, const char 
> *data, size_t len,
>   if (!gelf_getehdr(*elf, &ehdr)) {
>   pr_debug("%s : cannot get elf header.\n", __func__);
>   ret = -EBADF;
> - goto out_free_name;
> + goto out_free_args;
>   }
>  
>   /* Adjust the prelink effect :
> @@ -1917,6 +1936,8 @@ static int populate_sdt_note(Elf **elf, const char 
> *data, size_t len,
>   list_add_tail(&tmp->note_list, sdt_notes);
>   return 0;
>  
> +out_free_args:
> + free(tmp->args);
>  out_free_name:
>   free(tmp->name);
>  out_free_prov:
> diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
> index 6c358b7..9222c7e 100644
> --- a/tools/perf/util/symbol.h
> +++ b/tools/perf/util/symbol.h
> @@ -351,6 +351,7 @@ int arch__choose_best_symbol(struct symbol *syma, struct 
> symbol *symb);
>  struct sdt_note {
>   char *name; /* name of the note*/
>   char *provider; /* provider name */
> + char *args;
>   bool bit32; /* whether the location is 32 bits? */
>   union { /* location, base and semaphore addrs */
>   Elf64_Addr a64[3];
> -- 
> 2.10.2
> 


-- 
Masami Hiramatsu 


[PATCH 7/7] jbd2: make the whole kjournald2 kthread NOFS safe

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara 
Reviewed-by: Jan Kara 
Signed-off-by: Michal Hocko 
---
 fs/jbd2/journal.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a1a359bfcc9c..78433ce1db40 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -206,6 +207,13 @@ static int kjournald2(void *arg)
wake_up(&journal->j_wait_done_commit);
 
/*
+* Make sure that no allocations from this kernel thread will ever 
recurse
+* to the fs layer because we are responsible for the transaction commit
+* and any fs involvement might get stuck waiting for the trasn. commit.
+*/
+   memalloc_nofs_save();
+
+   /*
 * And now, wait forever for commit wakeup events.
 */
write_lock(&journal->j_state_lock);
-- 
2.11.0



[PATCH 3/7] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Brian Foster 
Signed-off-by: Michal Hocko 
---
 fs/xfs/kmem.c |  4 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/sched.h |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 2dfdc62f795e..e14da724a0b5 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -81,13 +81,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);
 
return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index c3decedc9455..3059a3ec7ecb 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2886,7 +2886,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, 
work);
unsigned long   pflags;
-   unsigned long   new_pflags = PF_FSTRANS;
+   unsigned long   new_pflags = PF_MEMALLOC_NOFS;
 
/*
 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index bf65a9ea8642..330c6019120e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 * We hand off the transaction to the completion thread now, so
 * clear the flag here.
 */
-   current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 * thus we need to mark ourselves as being in a transaction manually.
 * Similarly for freeze protection.
 */
-   current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
/* we abort the update if there was an IO error */
@@ -1021,7 +1021,7 @@ xfs_do_writepage(
 * Given that we do not allow direct reclaim to call us, we should
 * never be called while in a filesystem transaction.
 */
-   if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+   if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;
 
/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
boolrsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
/* Mark this thread as being in a transaction */
-   current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
/*
 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
if (blocks > 0) {
error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), 
rsvd);
if (error != 0) {
-   current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+   

Re: [PATCH v2] arm64: kvm: Use has_vhe() instead of hyp_alternate_select()

2017-03-06 Thread Shanker Donthineni

Hi Marc,


On 03/06/2017 02:34 AM, Marc Zyngier wrote:

Hi Shanker,

On Mon, Mar 06 2017 at  2:33:18 am GMT, Shanker Donthineni 
 wrote:

Now all the cpu_hwcaps features have their own static keys. We don't
need a separate function hyp_alternate_select() to patch the vhe/nvhe
code. We can achieve the same functionality by using has_vhe(). It
improves the code readability, uses the jump label instructions, and
also compiler generates the better code with a fewer instructions.

How do you define "better"? Which compiler? Do you have any benchmarking data?
I'm using gcc version 5.2.0. With has_vhe() it shows the smaller code 
size as shown below. I tried to benchmark
the code changes using Cristiffer's microbench tool, but not seeing a 
noticeable difference on QDF2400 platform.


hyp_alternate_select() uses BR/BLR instructions to patch vhe/mvhe code, 
which is not good for branch prediction purpose.
compiler treats patched code as a function call, so the contents of the 
registers x0-x18 are not reusable after vhe/nvhe call.


Current code:
arch/arm64/kvm/hyp/switch.o: file format elf64-littleaarch64

Sections:
Idx Name  Size  VMA   LMA   File 
off  Algn
  0 .text      
0040  2**0

  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data      
0040  2**0

  CONTENTS, ALLOC, LOAD, DATA
  2 .bss       
0040  2**0

  ALLOC
  3 .hyp.text 0550     
0040  2**3

  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE

New code:
arch/arm64/kvm/hyp/switch.o: file format elf64-littleaarch64

Sections:
Idx Name  Size  VMA   LMA   File 
off  Algn
  0 .text      
0040  2**0

  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data      
0040  2**0

  CONTENTS, ALLOC, LOAD, DATA
  2 .bss       
0040  2**0

  ALLOC
  3 .hyp.text 0488     
0040  2**3

  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE



Signed-off-by: Shanker Donthineni 
---
v2: removed 'Change-Id: Ia8084189833f2081ff13c392deb5070c46a64038' from commit

  arch/arm64/kvm/hyp/debug-sr.c  | 12 ++
  arch/arm64/kvm/hyp/switch.c| 50 +++---
  arch/arm64/kvm/hyp/sysreg-sr.c | 23 +--
  3 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/arch/arm64/kvm/hyp/debug-sr.c b/arch/arm64/kvm/hyp/debug-sr.c
index f5154ed..e5642c2 100644
--- a/arch/arm64/kvm/hyp/debug-sr.c
+++ b/arch/arm64/kvm/hyp/debug-sr.c
@@ -109,9 +109,13 @@ static void __hyp_text __debug_save_spe_nvhe(u64 
*pmscr_el1)
dsb(nsh);
  }
  
-static hyp_alternate_select(__debug_save_spe,

-   __debug_save_spe_nvhe, __debug_save_spe_vhe,
-   ARM64_HAS_VIRT_HOST_EXTN);
+static void __hyp_text __debug_save_spe(u64 *pmscr_el1)
+{
+   if (has_vhe())
+   __debug_save_spe_vhe(pmscr_el1);
+   else
+   __debug_save_spe_nvhe(pmscr_el1);
+}

I have two worries about this kind of thing:
- Not all compilers do support jump labels, leading to a memory access
on each static key (GCC 4.8, for example). This would immediately
introduce a pretty big regression
- The hyp_alternate_select() method doesn't introduce a fast/slow path
duality. Each path has the exact same cost. I'm not keen on choosing
what is supposed to be the fast path, really.
Yes, it'll require a runtime check if the compiler doesn't support ASM 
GOTO labels.
Agree, hyp_alternate_select() has a constant branch over head but it 
might cause a branch prediction penality.



Thanks,

M.


--
Shanker Donthineni
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.



[PATCH v4 7/7] perf/sdt: Remove stale warning

2017-03-06 Thread Ravi Bangoria
Perf was showing warning if user tries to record sdt event without
creating a probepoint. Now we are allowing direct record on sdt
events, remove this stale warning/hint.

Signed-off-by: Ravi Bangoria 
---
 tools/lib/api/fs/tracing_path.c | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/tools/lib/api/fs/tracing_path.c b/tools/lib/api/fs/tracing_path.c
index 3e606b9..fa52e67 100644
--- a/tools/lib/api/fs/tracing_path.c
+++ b/tools/lib/api/fs/tracing_path.c
@@ -103,19 +103,10 @@ int tracing_path__strerror_open_tp(int err, char *buf, 
size_t size,
 * - jirka
 */
if (debugfs__configured() || tracefs__configured()) {
-   /* sdt markers */
-   if (!strncmp(filename, "sdt_", 4)) {
-   snprintf(buf, size,
-   "Error:\tFile %s/%s not found.\n"
-   "Hint:\tSDT event cannot be directly 
recorded on.\n"
-   "\tPlease first use 'perf probe %s:%s' 
before recording it.\n",
-   tracing_events_path, filename, sys, 
name);
-   } else {
-   snprintf(buf, size,
-"Error:\tFile %s/%s not found.\n"
-"Hint:\tPerhaps this kernel misses 
some CONFIG_ setting to enable this feature?.\n",
-tracing_events_path, filename);
-   }
+   snprintf(buf, size,
+"Error:\tFile %s/%s not found.\n"
+"Hint:\tPerhaps this kernel misses some 
CONFIG_ setting to enable this feature?.\n",
+tracing_events_path, filename);
break;
}
snprintf(buf, size, "%s",
-- 
2.9.3



[PATCH 6/7] jbd2: mark the transaction context with the scope GFP_NOFS context

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara 
Signed-off-by: Michal Hocko 
---
 fs/jbd2/transaction.c | 12 
 include/linux/jbd2.h  |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5e659ee08d6a..d8f09f34285f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -388,6 +389,11 @@ static int start_this_handle(journal_t *journal, handle_t 
*handle,
 
rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
jbd2_journal_free_transaction(new_transaction);
+   /*
+* Make sure that no allocations done while the transaction is
+* open is going to recurse back to the fs layer.
+*/
+   handle->saved_alloc_context = memalloc_nofs_save();
return 0;
 }
 
@@ -466,6 +472,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int 
nblocks, int rsv_blocks,
trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
handle->h_transaction->t_tid, type,
line_no, nblocks);
+
return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1767,11 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_rsv_handle)
jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+   /*
+* scope of th GFP_NOFS context is over here and so we can
+* restore the original alloc context.
+*/
+   memalloc_nofs_restore(handle->saved_alloc_context);
jbd2_free_handle(handle);
return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
unsigned long   h_start_jiffies;
unsigned inth_requested_credits;
+
+   unsigned intsaved_alloc_context;
 };
 
 
-- 
2.11.0



[PATCH 1/7] lockdep: teach lockdep about memalloc_noio_save

2017-03-06 Thread Michal Hocko
From: Nikolay Borisov 

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disabling I/O during memory
allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear
__GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent
allocation paths recursing back into the filesystem without explicitly
changing the flags for every allocation site. However, lockdep hasn't been
keeping up with the changes and it entirely misses handling the memalloc_noio
adjustments. Instead, it is left to the callers of __lockdep_trace_alloc to
call the function after they have shaven the respective GFP flags which
can lead to false positives:

[  644.173373] =
[  644.174012] [ INFO: inconsistent lock state ]
[  644.174012] 4.10.0-nbor #134 Not tainted
[  644.174012] -
[  644.174012] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[  644.174012] fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  644.174012]  (&xfs_nondir_ilock_class){?.}, at: [] 
xfs_ilock+0x141/0x230
[  644.174012] {IN-RECLAIM_FS-W} state was registered at:
[  644.174012]   __lock_acquire+0x62a/0x17c0
[  644.174012]   lock_acquire+0xc5/0x220
[  644.174012]   down_write_nested+0x4f/0x90
[  644.174012]   xfs_ilock+0x141/0x230
[  644.174012]   xfs_reclaim_inode+0x12a/0x320
[  644.174012]   xfs_reclaim_inodes_ag+0x2c8/0x4e0
[  644.174012]   xfs_reclaim_inodes_nr+0x33/0x40
[  644.174012]   xfs_fs_free_cached_objects+0x19/0x20
[  644.174012]   super_cache_scan+0x191/0x1a0
[  644.174012]   shrink_slab+0x26f/0x5f0
[  644.174012]   shrink_node+0xf9/0x2f0
[  644.174012]   kswapd+0x356/0x920
[  644.174012]   kthread+0x10c/0x140
[  644.174012]   ret_from_fork+0x31/0x40
[  644.174012] irq event stamp: 173777
[  644.174012] hardirqs last  enabled at (173777): [] 
__local_bh_enable_ip+0x70/0xc0
[  644.174012] hardirqs last disabled at (173775): [] 
__local_bh_enable_ip+0x37/0xc0
[  644.174012] softirqs last  enabled at (173776): [] 
_xfs_buf_find+0x67a/0xb70
[  644.174012] softirqs last disabled at (173774): [] 
_xfs_buf_find+0x5db/0xb70
[  644.174012]
[  644.174012] other info that might help us debug this:
[  644.174012]  Possible unsafe locking scenario:
[  644.174012]
[  644.174012]CPU0
[  644.174012]
[  644.174012]   lock(&xfs_nondir_ilock_class);
[  644.174012]   
[  644.174012] lock(&xfs_nondir_ilock_class);
[  644.174012]
[  644.174012]  *** DEADLOCK ***
[  644.174012]
[  644.174012] 4 locks held by fsstress/3365:
[  644.174012]  #0:  (sb_writers#10){++}, at: [] 
mnt_want_write+0x24/0x50
[  644.174012]  #1:  (&sb->s_type->i_mutex_key#12){++}, at: 
[] vfs_setxattr+0x6f/0xb0
[  644.174012]  #2:  (sb_internal#2){++}, at: [] 
xfs_trans_alloc+0xfc/0x140
[  644.174012]  #3:  (&xfs_nondir_ilock_class){?.}, at: 
[] xfs_ilock+0x141/0x230
[  644.174012]
[  644.174012] stack backtrace:
[  644.174012] CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
[  644.174012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  644.174012] Call Trace:
[  644.174012]  dump_stack+0x85/0xc9
[  644.174012]  print_usage_bug.part.37+0x284/0x293
[  644.174012]  ? print_shortest_lock_dependencies+0x1b0/0x1b0
[  644.174012]  mark_lock+0x27e/0x660
[  644.174012]  mark_held_locks+0x66/0x90
[  644.174012]  lockdep_trace_alloc+0x6f/0xd0
[  644.174012]  kmem_cache_alloc_node_trace+0x3a/0x2c0
[  644.174012]  ? vm_map_ram+0x2a1/0x510
[  644.174012]  vm_map_ram+0x2a1/0x510
[  644.174012]  ? vm_map_ram+0x46/0x510
[  644.174012]  _xfs_buf_map_pages+0x77/0x140
[  644.174012]  xfs_buf_get_map+0x185/0x2a0
[  644.174012]  xfs_attr_rmtval_set+0x233/0x430
[  644.174012]  xfs_attr_leaf_addname+0x2d2/0x500
[  644.174012]  xfs_attr_set+0x214/0x420
[  644.174012]  xfs_xattr_set+0x59/0xb0
[  644.174012]  __vfs_setxattr+0x76/0xa0
[  644.174012]  __vfs_setxattr_noperm+0x5e/0xf0
[  644.174012]  vfs_setxattr+0xae/0xb0
[  644.174012]  ? __might_fault+0x43/0xa0
[  644.174012]  setxattr+0x15e/0x1a0
[  644.174012]  ? __lock_is_held+0x53/0x90
[  644.174012]  ? rcu_read_lock_sched_held+0x93/0xa0
[  644.174012]  ? rcu_sync_lockdep_assert+0x2f/0x60
[  644.174012]  ? __sb_start_write+0x130/0x1d0
[  644.174012]  ? mnt_want_write+0x24/0x50
[  644.174012]  path_setxattr+0x8f/0xc0
[  644.174012]  SyS_lsetxattr+0x11/0x20
[  644.174012]  entry_SYSCALL_64_fastpath+0x23/0xc6

Let's fix this by making lockdep explicitly do the shaving of respective
GFP flags.

Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
Acked-by: Michal Hocko 
Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Nikolay Borisov 
Signed-off-by: Michal Hocko 
---
 kernel/locking/lockdep.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 12e38

[PATCH 2/7] lockdep: allow to disable reclaim lockup detection

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G   O
-
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){-+}, at: [] 
xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [] mark_held_locks+0x79/0xa0
  [] lockdep_trace_alloc+0xb3/0x100
  [] kmem_cache_alloc+0x33/0x230
  [] kmem_zone_alloc+0x81/0x120 [xfs]
  [] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [] xfs_getbmap+0x608/0x8c0 [xfs]
  [] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [] do_vfs_ioctl+0x498/0x670
  [] SyS_ioctl+0x79/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x6f

   CPU0
   
  lock(&xfs_nondir_ilock_class);
  
lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G   O4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 82a34f10 88003aa078d0 813a14f9 88003d8551c0
 88003aa07920 8110ec65  0001
 8801 000b 0008 88003d855aa0
Call Trace:
 [] dump_stack+0x4b/0x72
 [] print_usage_bug+0x215/0x240
 [] mark_lock+0x1f5/0x660
 [] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [] __lock_acquire+0xa80/0x1e50
 [] ? kmem_cache_alloc+0x15e/0x230
 [] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [] lock_acquire+0xd8/0x1e0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] down_write_nested+0x5e/0xc0
 [] ? xfs_ilock+0x177/0x200 [xfs]
 [] xfs_ilock+0x177/0x200 [xfs]
 [] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [] evict+0xc5/0x190
 [] dispose_list+0x39/0x60
 [] prune_icache_sb+0x4b/0x60
 [] super_cache_scan+0x14f/0x1a0
 [] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [] shrink_zone+0x15e/0x170
 [] kswapd+0x4f1/0xa80
 [] ? zone_reclaim+0x230/0x230
 [] kthread+0xf2/0x110
 [] ? kthread_create_on_node+0x220/0x220
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Vlastimil Babka 
Signed-off-by: Michal Hocko 
---
 include/linux/gfp.h  | 10 +-
 kernel/locking/lockdep.c |  4 
 lib/radix-tree.c |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM  0x40u
 #define ___GFP_WRITE   0x80u
 #define ___GFP_KSWAPD_RECLAIM  0x100u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP   0x400u
+#else
+#define ___GFP_NOLOCKDEP   0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK  ((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gf

Re: [PATCH] x86, kasan: add KASAN checks to atomic operations

2017-03-06 Thread Peter Zijlstra
On Mon, Mar 06, 2017 at 01:58:51PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 06, 2017 at 01:50:47PM +0100, Dmitry Vyukov wrote:
> > On Mon, Mar 6, 2017 at 1:42 PM, Dmitry Vyukov  wrote:
> > > KASAN uses compiler instrumentation to intercept all memory accesses.
> > > But it does not see memory accesses done in assembly code.
> > > One notable user of assembly code is atomic operations. Frequently,
> > > for example, an atomic reference decrement is the last access to an
> > > object and a good candidate for a racy use-after-free.
> > >
> > > Add manual KASAN checks to atomic operations.
> > > Note: we need checks only before asm blocks and don't need them
> > > in atomic functions composed of other atomic functions
> > > (e.g. load-cmpxchg loops).
> > 
> > Peter, also pointed me at arch/x86/include/asm/bitops.h. Will add them in 
> > v2.
> > 
> 
> > >  static __always_inline void atomic_add(int i, atomic_t *v)
> > >  {
> > > +   kasan_check_write(v, sizeof(*v));
> > > asm volatile(LOCK_PREFIX "addl %1,%0"
> > >  : "+m" (v->counter)
> > >  : "ir" (i));
> 
> 
> So the problem is doing load/stores from asm bits, and GCC
> (traditionally) doesn't try and interpret APP asm bits.
> 
> However, could we not write a GCC plugin that does exactly that?
> Something that interprets the APP asm bits and generates these KASAN
> bits that go with it?

Another suspect is the per-cpu stuff, that's all asm foo as well.


Re: [PATCH v5 02/11] phy: exynos-ufs: add UFS PHY driver for EXYNOS SoC

2017-03-06 Thread Kishon Vijay Abraham I
Hi,

On Monday 06 March 2017 05:12 PM, Alim Akhtar wrote:
> Hi Kishon
> 
> On 03/01/2017 10:07 AM, Kishon Vijay Abraham I wrote:
>> Hi,
>>
>> On Tuesday 28 February 2017 01:51 PM, Alim Akhtar wrote:
>>> Hi Kishon,
>>>
>>> On 02/28/2017 09:04 AM, Kishon Vijay Abraham I wrote:
 Hi,

 On Monday 27 February 2017 07:40 PM, Alim Akhtar wrote:
> Hi Kishon,
>
> On 02/27/2017 10:56 AM, Kishon Vijay Abraham I wrote:
>> Hi,
>>
>> On Thursday 23 February 2017 12:20 AM, Alim Akhtar wrote:
>>> On Fri, Feb 3, 2017 at 2:49 PM, Alim Akhtar  
>>> wrote:
 Hi Kishon,


 On 11/19/2015 07:09 PM, Kishon Vijay Abraham I wrote:
>
> Hi,
>
> On Tuesday 17 November 2015 01:41 PM, Alim Akhtar wrote:
>>
>> Hi
>> Thanks again for looking into this.
>>
>> On 11/17/2015 11:46 AM, Kishon Vijay Abraham I wrote:
>>>
>>> Hi,
>>>
>>> On Monday 09 November 2015 10:56 AM, Alim Akhtar wrote:

 From: Seungwon Jeon 

 This patch introduces Exynos UFS PHY driver. This driver
 supports to deal with phy calibration and power control
 according to UFS host driver's behavior.

 Signed-off-by: Seungwon Jeon 
 Signed-off-by: Alim Akhtar 
 Cc: Kishon Vijay Abraham I 
 ---
   drivers/phy/Kconfig|7 ++
   drivers/phy/Makefile   |1 +
   drivers/phy/phy-exynos-ufs.c   |  241
 
   drivers/phy/phy-exynos-ufs.h   |   85 +
   drivers/phy/phy-exynos7-ufs.h  |   89 +
   include/linux/phy/phy-exynos-ufs.h |   85 +
   6 files changed, 508 insertions(+)
   create mode 100644 drivers/phy/phy-exynos-ufs.c
   create mode 100644 drivers/phy/phy-exynos-ufs.h
   create mode 100644 drivers/phy/phy-exynos7-ufs.h
   create mode 100644 include/linux/phy/phy-exynos-ufs.h

 diff --git a/drivers/phy/Kconfig b/drivers/phy/Kconfig
 index 7eb5859dd035..7d38a92e0297 100644
 --- a/drivers/phy/Kconfig
 +++ b/drivers/phy/Kconfig
 @@ -389,4 +389,11 @@ config PHY_CYGNUS_PCIE
 Enable this to support the Broadcom Cygnus PCIe PHY.
 If unsure, say N.

 +config PHY_EXYNOS_UFS
 +tristate "EXYNOS SoC series UFS PHY driver"
 +depends on OF && ARCH_EXYNOS || COMPILE_TEST
 +select GENERIC_PHY
 +help
 +  Support for UFS PHY on Samsung EXYNOS chipsets.
 +
   endmenu
 diff --git a/drivers/phy/Makefile b/drivers/phy/Makefile
 index 075db1a81aa5..9bec4d1a89e1 100644
 --- a/drivers/phy/Makefile
 +++ b/drivers/phy/Makefile
 @@ -10,6 +10,7 @@ obj-$(CONFIG_ARMADA375_USBCLUSTER_PHY)+=
 phy-armada375-usb2.o
   obj-$(CONFIG_BCM_KONA_USB2_PHY)+= phy-bcm-kona-usb2.o
   obj-$(CONFIG_PHY_EXYNOS_DP_VIDEO)+= phy-exynos-dp-video.o
   obj-$(CONFIG_PHY_EXYNOS_MIPI_VIDEO)+= phy-exynos-mipi-video.o
 +obj-$(CONFIG_PHY_EXYNOS_UFS)+= phy-exynos-ufs.o
   obj-$(CONFIG_PHY_LPC18XX_USB_OTG)+= phy-lpc18xx-usb-otg.o
   obj-$(CONFIG_PHY_PXA_28NM_USB2)+= phy-pxa-28nm-usb2.o
   obj-$(CONFIG_PHY_PXA_28NM_HSIC)+= phy-pxa-28nm-hsic.o
 diff --git a/drivers/phy/phy-exynos-ufs.c
 b/drivers/phy/phy-exynos-ufs.c
 new file mode 100644
 index ..cb1aeaa3d4eb
 --- /dev/null
 +++ b/drivers/phy/phy-exynos-ufs.c
 @@ -0,0 +1,241 @@
 +/*
 + * UFS PHY driver for Samsung EXYNOS SoC
 + *
 + * Copyright (C) 2015 Samsung Electronics Co., Ltd.
 + * Author: Seungwon Jeon 
 + *
 + * This program is free software; you can redistribute it and/or
 modify
 + * it under the terms of the GNU General Public License as 
 published
 by
 + * the Free Software Foundation; either version 2 of the License, 
 or
 + * (at your option) any later version.
 + */
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +
 +#include "phy-exynos-ufs.h"
>>

Re: [RFC PATCH 10/12] staging: android: ion: Use CMA APIs directly

2017-03-06 Thread Laurent Pinchart
Hi Daniel,

On Monday 06 Mar 2017 11:32:04 Daniel Vetter wrote:
> On Fri, Mar 03, 2017 at 10:50:20AM -0800, Laura Abbott wrote:
> > On 03/03/2017 08:41 AM, Laurent Pinchart wrote:
> >> On Thursday 02 Mar 2017 13:44:42 Laura Abbott wrote:
> >>> When CMA was first introduced, its primary use was for DMA allocation
> >>> and the only way to get CMA memory was to call dma_alloc_coherent. This
> >>> put Ion in an awkward position since there was no device structure
> >>> readily available and setting one up messed up the coherency model.
> >>> These days, CMA can be allocated directly from the APIs. Switch to
> >>> using this model to avoid needing a dummy device. This also avoids
> >>> awkward caching questions.
> >> 
> >> If the DMA mapping API isn't suitable for today's requirements anymore,
> >> I believe that's what needs to be fixed, instead of working around the
> >> problem by introducing another use-case-specific API.
> > 
> > I don't think this is a usecase specific API. CMA has been decoupled from
> > DMA already because it's used in other places. The trying to go through
> > DMA was just another layer of abstraction, especially since there isn't
> > a device available for allocation.
> 
> Also, we've had separation of allocation and dma-mapping since forever,
> that's how it works almost everywhere. Not exactly sure why/how arm-soc
> ecosystem ended up focused so much on dma_alloc_coherent.

I believe because that was the easy way to specify memory constraints. The API 
receives a device pointer and will allocate memory suitable for DMA for that 
device. The fact that it maps it to the device is a side-effect in my opinion.

> I think separating allocation from dma mapping/coherency is perfectly
> fine, and the way to go.

Especially given that in many cases we'll want to share buffers between 
multiple devices, so we'll need to map them multiple times.

My point still stands though, if we want to move towards a model where 
allocation and mapping are decoupled, we need an allocation function that 
takes constraints (possibly implemented with two layers, a constraint 
resolution layer on top of a pool/heap/type/foo-based allocator), and a 
mapping API. IOMMU handling being integrated in the DMA mapping API we're 
currently stuck with it, which might call for brushing up that API.

-- 
Regards,

Laurent Pinchart



[PATCH] LOCAL / input: touchscreen: fix semicolon.cocci warnings

2017-03-06 Thread Julia Lawall
 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

CC: Beomho Seo 
Signed-off-by: Julia Lawall 
Signed-off-by: Fengguang Wu 
---

I also received the following warning from kbuild, without any other
information:

drivers/input/touchscreen/fts_ts.c:750:1-6: WARNING: invalid free of devm_
allocated data

tree:
https://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos.git
exynos-drm-next-tm2
head:   41f00580dc0f947b7788a1b5f57f793dea49ee9a
commit: 15a1244b5349543dfc629b1eda799f0008dbd8bd [7/38] LOCAL / input:
touchscreen: Add FTS_TS touchsreen driver


 fts_ts.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/input/touchscreen/fts_ts.c
+++ b/drivers/input/touchscreen/fts_ts.c
@@ -558,12 +558,12 @@ static struct fts_i2c_platform_data *fts
if (of_property_read_u32(np, "x-size", &pdata->max_x)) {
dev_err(dev, "failed to get x-size property\n");
return NULL;
-   };
+   }

if (of_property_read_u32(np, "y-size", &pdata->max_y)) {
dev_err(dev, "failed to get y-size property\n");
return NULL;
-   };
+   }

pdata->keys_en = of_property_read_bool(np, "touch-key-connected");



Re: [PATCH] HID: usbhid: Use pr_ and remove unnecessary OOM messages

2017-03-06 Thread Jiri Kosina
On Wed, 1 Mar 2017, Joe Perches wrote:

> Use a more common logging style and remove the unnecessary
> OOM messages as there is default dump_stack when OOM.
> 
> Miscellanea:
> 
> o Hoist an assignment in an if
> o Realign arguments
> o Realign a deeply indented if descendent above a printk
> 
> Signed-off-by: Joe Perches 

Applied to for-4.12/upstream. Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH 0/5] perf/sdt: Argument support for x86 and powepc

2017-03-06 Thread Masami Hiramatsu
On Mon, 6 Mar 2017 13:23:30 +0530
Ravi Bangoria  wrote:

> 
> 
> On Tuesday 07 February 2017 08:25 AM, Masami Hiramatsu wrote:
> > On Thu,  2 Feb 2017 16:41:38 +0530
> > Ravi Bangoria  wrote:
> >
> >> The v5 patchset for sdt marker argument support for x86 [1] has
> >> couple  of issues. For example, it still has x86 specific code
> >> in general code. It lacks support for rNN (with size postfix
> >> b/w/d), %rsp, %esp, %sil etc. registers and such sdt markers
> >> are failing at 'perf probe'. It also fails to convert arguments
> >> having no offset but still surrounds register with parenthesis
> >> for ex. 8@(%rdi) is converted to +(%di):u64 which is rejected
> >> by uprobe_events. It's causing failure at 'perf probe' for all
> >> SDT events on all archs except x86. With this patchset, I've
> >> solved these issues. (patch 2,3)
> >>
> >> Also, existing perf shows misleading message when user tries to
> >> record sdt event without probing it. I've prepared patch for
> >> the same. (patch 1)
> >>
> >> Apart from that, I've also added logic to support arguments with
> >> sdt marker on powerpc. (patch 4)
> >>
> >> There are cases where uprobe definition of sdt event goes beyond
> >> current limit MAX_CMDLEN (256) and in such case perf fails with
> >> seg fault. I've solve this issue. (patch 5)
> >>
> >> Note: This patchset is prepared on top of Alexis' v5 series.[1]
> >>
> >> [1] 
> >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1292251.html
> > Hmm, I must missed it. I'll check it...
> >
> 
> Hi Masami,
> 
> Can you please review this.

Thanks for kicking me :)


-- 
Masami Hiramatsu 


[PATCH 0/7 v5] scope GFP_NOFS api

2017-03-06 Thread Michal Hocko
Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. I've just added "lockdep: teach
lockdep about memalloc_noio_save" from Nikolay which is a lockdep bugfix
developed independently but "mm: introduce memalloc_nofs_{save,restore}
API" depends on it so I added it here. Then I've rebased the series on
top of 4.11-rc1 which contains sched.h split up which required to add
sched/mm.h include.

There didn't seem to be any real objections and so I think we should go
and finally merge this - ideally in this release cycle as it doesn't
really introduce any functional changes. Those were separated out and
will be posted later. The risk of regressions should really be small
because we do not remove any real GFP_NOFS users yet.

Diffstat says
 fs/jbd2/journal.c |  8 
 fs/jbd2/transaction.c | 12 
 fs/xfs/kmem.c | 12 ++--
 fs/xfs/kmem.h |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c |  6 +++---
 fs/xfs/xfs_buf.c  |  8 
 fs/xfs/xfs_trans.c| 12 ++--
 include/linux/gfp.h   | 18 +-
 include/linux/jbd2.h  |  2 ++
 include/linux/sched.h |  6 +++---
 include/linux/sched/mm.h  | 26 +++---
 kernel/locking/lockdep.c  | 11 +--
 lib/radix-tree.c  |  2 ++
 mm/page_alloc.c   | 10 ++
 mm/vmscan.c   |  6 +++---
 16 files changed, 106 insertions(+), 37 deletions(-)

Shortlog:
Michal Hocko (6):
  lockdep: allow to disable reclaim lockup detection
  xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  mm: introduce memalloc_nofs_{save,restore} API
  xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  jbd2: mark the transaction context with the scope GFP_NOFS context
  jbd2: make the whole kjournald2 kthread NOFS safe

Nikolay Borisov (1):
  lockdep: teach lockdep about memalloc_noio_save


[1] http://lkml.kernel.org/r/20170206140718.16222-1-mho...@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojz...@thunk.org


[PATCH] irqchip: crossbar: Fix incorrect type of register size

2017-03-06 Thread Franck Demathieu
The 'size' variable is unsigned according to the dt-bindings.
As this variable is used as integer in other places, create a new variable
that allows to fix the following sparse issue (-Wtypesign):

  drivers/irqchip/irq-crossbar.c:279:52: warning: incorrect type in argument 3 
(different signedness)
  drivers/irqchip/irq-crossbar.c:279:52:expected unsigned int [usertype] 
*out_value
  drivers/irqchip/irq-crossbar.c:279:52:got int *

Signed-off-by: Franck Demathieu 
---
 drivers/irqchip/irq-crossbar.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/irqchip/irq-crossbar.c b/drivers/irqchip/irq-crossbar.c
index 05bbf17..1070b7b 100644
--- a/drivers/irqchip/irq-crossbar.c
+++ b/drivers/irqchip/irq-crossbar.c
@@ -199,7 +199,7 @@ static const struct irq_domain_ops crossbar_domain_ops = {
 static int __init crossbar_of_init(struct device_node *node)
 {
int i, size, reserved = 0;
-   u32 max = 0, entry;
+   u32 max = 0, entry, reg_size;
const __be32 *irqsr;
int ret = -ENOMEM;
 
@@ -276,9 +276,9 @@ static int __init crossbar_of_init(struct device_node *node)
if (!cb->register_offsets)
goto err_irq_map;
 
-   of_property_read_u32(node, "ti,reg-size", &size);
+   of_property_read_u32(node, "ti,reg-size", ®_size);
 
-   switch (size) {
+   switch (reg_size) {
case 1:
cb->write = crossbar_writeb;
break;
@@ -304,7 +304,7 @@ static int __init crossbar_of_init(struct device_node *node)
continue;
 
cb->register_offsets[i] = reserved;
-   reserved += size;
+   reserved += reg_size;
}
 
of_property_read_u32(node, "ti,irqs-safe-map", &cb->safe_map);
-- 
2.10.1



Re: [PATCH] HID: i2c-hid: Fix error handling

2017-03-06 Thread Jiri Kosina
On Sun, 19 Feb 2017, Christophe JAILLET wrote:

> According to error handling in this function, it is likely that some
> resources should be freed before returning.
> Replace 'return ret', with 'goto err'.
> 
> While at it, remove some spaces at the beginning of the lines to be more
> consistent.
> 
> 
> Fixes: ead0687fe304a ("HID: i2c-hid: support regulator power on/off")
> 
> Signed-off-by: Christophe JAILLET 
> ---
>  drivers/hid/i2c-hid/i2c-hid.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/hid/i2c-hid/i2c-hid.c b/drivers/hid/i2c-hid/i2c-hid.c
> index d5288f3fb5ee..1a57ac2d8524 100644
> --- a/drivers/hid/i2c-hid/i2c-hid.c
> +++ b/drivers/hid/i2c-hid/i2c-hid.c
> @@ -1058,13 +1058,13 @@ static int i2c_hid_probe(struct i2c_client *client,
>   }
>  
>   ihid->pdata.supply = devm_regulator_get(&client->dev, "vdd");
> - if (IS_ERR(ihid->pdata.supply)) {
> - ret = PTR_ERR(ihid->pdata.supply);
> - if (ret != -EPROBE_DEFER)
> - dev_err(&client->dev, "Failed to get regulator: %d\n",
> - ret);
> - return ret;
> - }
> + if (IS_ERR(ihid->pdata.supply)) {
> + ret = PTR_ERR(ihid->pdata.supply);
> + if (ret != -EPROBE_DEFER)
> + dev_err(&client->dev, "Failed to get regulator: %d\n",
> + ret);
> + goto err;
> + }

I don't see any spaces at the beginning of lines in the version that's in 
my tree ... o_O?

Therefore I've converted this patch into simple 'return ret -> goto err' 
transformation and applied on top for-4.12/i2c-hid.

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH v17 2/3] usb: USB Type-C connector class

2017-03-06 Thread Heikki Krogerus
Hi Mats,

On Fri, Mar 03, 2017 at 08:27:08PM +0100, Mats Karrman wrote:
> On 2017-03-03 13:59, Heikki Krogerus wrote:
> 
> > On Fri, Mar 03, 2017 at 08:29:18AM +0100, Mats Karrman wrote:
> > 
> 
> > How would something like that sound to you guys?
> 
> Complicated... Need to marinate on that for a while ;)

Sorry about the bad explanation :-). Let me try again..  I'm simply
looking for a method that is as scalable as possible to handle the
alternate modes, basically how to couple the different components
involved. Bus would feel like the best approach at the moment.

> > > My system is a bit different. It's an i.MX6 SoC with the typec phy and DP 
> > > controller connected
> > > directly to the SoC and it's using DTB/OF.
> > Is this "DP controller" a controller that is capable of taking care of
> > the USB Power Delivery communication with the partner regarding
> > DisplayPort alternate mode?
> 
> No, the "DP controller" just talks DP and knows nothing about Type-C or USB 
> PD.
> It takes a video stream from the SoC and turns it into a DP link, set up and 
> orchestrated
> by the corresponding driver. And all the driver needs from Type-C is the 
> plugged in / interrupt /
> plugged out events.

Got it.

> The analog switching between USB / safe / DP signal levels in the Type-C 
> connector is, I think,
> best handled by the software doing the USB PD negotiation / Altmode handling 
> (using some GPIOs).
> 
> > > Do we need to further standardize attributes under (each) specific 
> > > alternate mode to
> > > include things such as HPD for the DP mode?
> > I'm not completely sure what kind of system you have, but I would
> > imagine that if we had the bus, your DP controller driver would be the
> > port (and partner) alternate mode driver. The bus would bind you to
> > the typec phy.
> 
> So, both the DP controller and the USB PD phy are I2C devices, and now I have 
> to make them both
> attach to the AM bus as well?

The DP controller would provide the driver and the USB PD phy
(actually, the typec class) the device.

Would it be a problem to register these I2C devices with some other
subsystem, was it extcon or something like AM bus? It really would not
be that uncommon. Or have I misunderstood your question?


Thanks,

-- 
heikki


[PATCH v2 2/8] irqchip/gic-v3-its: Initialize MSIs with subsys_initcalls

2017-03-06 Thread Robert Richter
This allows us to use kernel core functionality (e.g. cma) for ITS
initialization. MSIs must be up before the device_initcalls (pci and
platform device probe) and after arch_initcalls (dma init), so
subsys_initcall is fine.

Signed-off-by: Robert Richter 
---
 drivers/irqchip/irq-gic-v3-its-pci-msi.c  | 2 +-
 drivers/irqchip/irq-gic-v3-its-platform-msi.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c 
b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
index aee1c60d7ab5..dace9bc4ef8d 100644
--- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c
@@ -194,4 +194,4 @@ static int __init its_pci_msi_init(void)
 
return 0;
 }
-early_initcall(its_pci_msi_init);
+subsys_initcall(its_pci_msi_init);
diff --git a/drivers/irqchip/irq-gic-v3-its-platform-msi.c 
b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
index 470b4aa7d62c..7d8c19973766 100644
--- a/drivers/irqchip/irq-gic-v3-its-platform-msi.c
+++ b/drivers/irqchip/irq-gic-v3-its-platform-msi.c
@@ -103,4 +103,4 @@ static int __init its_pmsi_init(void)
 
return 0;
 }
-early_initcall(its_pmsi_init);
+subsys_initcall(its_pmsi_init);
-- 
2.11.0



Re: perf: use-after-free in perf_release

2017-03-06 Thread Dmitry Vyukov
On Mon, Mar 6, 2017 at 2:14 PM, Peter Zijlstra  wrote:
> On Mon, Mar 06, 2017 at 10:57:07AM +0100, Dmitry Vyukov wrote:
>
>> ==
>> BUG: KASAN: use-after-free in atomic_dec_and_test
>> arch/x86/include/asm/atomic.h:123 [inline] at addr 880079c30158
>> BUG: KASAN: use-after-free in put_task_struct
>> include/linux/sched/task.h:93 [inline] at addr 880079c30158
>> BUG: KASAN: use-after-free in put_ctx+0xcf/0x110
>
> FWIW, this output is very confusing, is this a result of your
> post-processing replicating the line for every 'inlined' part?


Yes.
We probably should not do this inlining in the header line. But the
problem is that it is very difficult to understand that it is a header
line in general.


>> kernel/events/core.c:1131 at addr 880079c30158
>> Write of size 4 by task syz-executor6/25698
>
>>  atomic_dec_and_test arch/x86/include/asm/atomic.h:123 [inline]
>>  put_task_struct include/linux/sched/task.h:93 [inline]
>>  put_ctx+0xcf/0x110 kernel/events/core.c:1131
>>  perf_event_release_kernel+0x3ad/0xc90 kernel/events/core.c:4322
>>  perf_release+0x37/0x50 kernel/events/core.c:4338
>>  __fput+0x332/0x800 fs/file_table.c:209
>>  fput+0x15/0x20 fs/file_table.c:245
>>  task_work_run+0x197/0x260 kernel/task_work.c:116
>>  exit_task_work include/linux/task_work.h:21 [inline]
>>  do_exit+0xb38/0x29c0 kernel/exit.c:880
>>  do_group_exit+0x149/0x420 kernel/exit.c:984
>>  get_signal+0x7e0/0x1820 kernel/signal.c:2318
>>  do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808
>>  exit_to_usermode_loop+0x200/0x2a0 arch/x86/entry/common.c:157
>>  syscall_return_slowpath arch/x86/entry/common.c:191 [inline]
>>  do_syscall_64+0x6fc/0x930 arch/x86/entry/common.c:286
>>  entry_SYSCALL64_slow_path+0x25/0x25
>
> So this is fput()..
>
>
>> Freed:
>> PID = 25681
>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>>  set_track mm/kasan/kasan.c:525 [inline]
>>  kasan_slab_free+0x6f/0xb0 mm/kasan/kasan.c:589
>>  __cache_free mm/slab.c:3514 [inline]
>>  kmem_cache_free+0x71/0x240 mm/slab.c:3774
>>  free_task_struct kernel/fork.c:158 [inline]
>>  free_task+0x151/0x1d0 kernel/fork.c:370
>>  copy_process.part.38+0x18e5/0x4aa0 kernel/fork.c:1931
>>  copy_process kernel/fork.c:1531 [inline]
>>  _do_fork+0x200/0x1010 kernel/fork.c:1994
>>  SYSC_clone kernel/fork.c:2104 [inline]
>>  SyS_clone+0x37/0x50 kernel/fork.c:2098
>>  do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281
>>  return_from_SYSCALL_64+0x0/0x7a
>
> and this is a failed fork().
>
>
> However, inherited events don't have a filedesc to fput(), and
> similarly, a task that fails for has never been visible to attach a perf
> event to because it never hits the pid-hash.
>
> Or so it is assumed.
>
> I'm forever getting lost in the PID code. Oleg, is there any way
> find_task_by_vpid() can return a task that can still fail fork() ?


FWIW here are 2 syzkaller programs that triggered the bug:
https://gist.githubusercontent.com/dvyukov/d67f980050589775237a7fbdff226bec/raw/4bca72861cb2ede64059b6dad403e19f425a361f/gistfile1.txt
They look very similar, so most likely they are a mutation of the same
program. Which may suggest that there is something in that program
that provokes the bug. Note that the calls in these programs are
executed potentially in multiple threads. But at least it can give
some idea wrt e.g. flags passed to perf_event_open.


Re: [RFC PATCH 00/12] Ion cleanup in preparation for moving out of staging

2017-03-06 Thread Michal Hocko
On Mon 06-03-17 11:40:41, Daniel Vetter wrote:
> On Mon, Mar 06, 2017 at 08:42:59AM +0100, Michal Hocko wrote:
> > On Fri 03-03-17 09:37:55, Laura Abbott wrote:
> > > On 03/03/2017 05:29 AM, Michal Hocko wrote:
> > > > On Thu 02-03-17 13:44:32, Laura Abbott wrote:
> > > >> Hi,
> > > >>
> > > >> There's been some recent discussions[1] about Ion-like frameworks. 
> > > >> There's
> > > >> apparently interest in just keeping Ion since it works reasonablly 
> > > >> well.
> > > >> This series does what should be the final clean ups for it to possibly 
> > > >> be
> > > >> moved out of staging.
> > > >>
> > > >> This includes the following:
> > > >> - Some general clean up and removal of features that never got a lot 
> > > >> of use
> > > >>   as far as I can tell.
> > > >> - Fixing up the caching. This is the series I proposed back in 
> > > >> December[2]
> > > >>   but never heard any feedback on. It will certainly break existing
> > > >>   applications that rely on the implicit caching. I'd rather make an 
> > > >> effort
> > > >>   to move to a model that isn't going directly against the 
> > > >> establishement
> > > >>   though.
> > > >> - Fixing up the platform support. The devicetree approach was never 
> > > >> well
> > > >>   recieved by DT maintainers. The proposal here is to think of Ion 
> > > >> less as
> > > >>   specifying requirements and more of a framework for exposing memory 
> > > >> to
> > > >>   userspace.
> > > >> - CMA allocations now happen without the need of a dummy device 
> > > >> structure.
> > > >>   This fixes a bunch of the reasons why I attempted to add devicetree
> > > >>   support before.
> > > >>
> > > >> I've had problems getting feedback in the past so if I don't hear any 
> > > >> major
> > > >> objections I'm going to send out with the RFC dropped to be picked up.
> > > >> The only reason there isn't a patch to come out of staging is to 
> > > >> discuss any
> > > >> other changes to the ABI people might want. Once this comes out of 
> > > >> staging,
> > > >> I really don't want to mess with the ABI.
> > > > 
> > > > Could you recapitulate concerns preventing the code being merged
> > > > normally rather than through the staging tree and how they were
> > > > addressed?
> > > > 
> > > 
> > > Sorry, I'm really not understanding your question here, can you
> > > clarify?
> > 
> > There must have been a reason why this code ended up in the staging
> > tree, right? So my question is what those reasons were and how they were
> > handled in order to move the code from the staging subtree.
> 
> No one gave a thing about android in upstream, so Greg KH just dumped it
> all into staging/android/. We've discussed ION a bunch of times, recorded
> anything we'd like to fix in staging/android/TODO, and Laura's patch
> series here addresses a big chunk of that.

Thanks for the TODO reference. I was looking exactly at something like
that in drivers/staging/android/ion/. To bad I didn't look one directory
up.

Thanks for the clarification!

-- 
Michal Hocko
SUSE Labs


Re: Question Regarding ERMS memcpy

2017-03-06 Thread Borislav Petkov
On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote:
> Well honestly my issue was solved by fixing my kernel config. I have no
> idea why I had optimize for size in there in the first place.

I still think that we should address the iomem memcpy Linus mentioned.
So how about this partial revert. I've made 32-bit use the same special
__memcpy() version.

Hmmm?

---
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 7afb0e2f07f4..9e378a10796d 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void);
 #ifdef __KERNEL__
 
 #include 
+#include 
 
 /*
  * Convert a virtual cached pointer to an uncached pointer
@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned char val, 
size_t count)
  * @src:   The (I/O memory) source for the data
  * @count: The number of bytes to copy
  *
- * Copy a block of data from I/O memory.
+ * Copy a block of data from I/O memory. IO memory is different from
+ * cached memory so we use special memcpy version.
  */
 static inline void
 memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
 {
-   memcpy(dst, (const void __force *)src, count);
+   __inline_memcpy(dst, (const void __force *)src, count);
 }
 
 /**
@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void __iomem 
*src, size_t count)
  * @src:   The (RAM) source for the data
  * @count: The number of bytes to copy
  *
- * Copy a block of data to I/O memory.
+ * Copy a block of data to I/O memory. IO memory is different from
+ * cached memory so we use special memcpy version.
  */
 static inline void
 memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
 {
-   memcpy((void __force *)dst, src, count);
+   __inline_memcpy((void __force *)dst, src, count);
 }
 
 /*
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e8353ee5c..556fa4a975ff 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c);
 #define __HAVE_ARCH_STRLEN
 extern size_t strlen(const char *s);
 
+#define __inline_memcpy __memcpy
 static __always_inline void *__memcpy(void *to, const void *from, size_t n)
 {
int d0, d1, d2;

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: perf: use-after-free in perf_release

2017-03-06 Thread Peter Zijlstra
On Mon, Mar 06, 2017 at 10:57:07AM +0100, Dmitry Vyukov wrote:

> ==
> BUG: KASAN: use-after-free in atomic_dec_and_test
> arch/x86/include/asm/atomic.h:123 [inline] at addr 880079c30158
> BUG: KASAN: use-after-free in put_task_struct
> include/linux/sched/task.h:93 [inline] at addr 880079c30158
> BUG: KASAN: use-after-free in put_ctx+0xcf/0x110

FWIW, this output is very confusing, is this a result of your
post-processing replicating the line for every 'inlined' part?

> kernel/events/core.c:1131 at addr 880079c30158
> Write of size 4 by task syz-executor6/25698

>  atomic_dec_and_test arch/x86/include/asm/atomic.h:123 [inline]
>  put_task_struct include/linux/sched/task.h:93 [inline]
>  put_ctx+0xcf/0x110 kernel/events/core.c:1131
>  perf_event_release_kernel+0x3ad/0xc90 kernel/events/core.c:4322
>  perf_release+0x37/0x50 kernel/events/core.c:4338
>  __fput+0x332/0x800 fs/file_table.c:209
>  fput+0x15/0x20 fs/file_table.c:245
>  task_work_run+0x197/0x260 kernel/task_work.c:116
>  exit_task_work include/linux/task_work.h:21 [inline]
>  do_exit+0xb38/0x29c0 kernel/exit.c:880
>  do_group_exit+0x149/0x420 kernel/exit.c:984
>  get_signal+0x7e0/0x1820 kernel/signal.c:2318
>  do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808
>  exit_to_usermode_loop+0x200/0x2a0 arch/x86/entry/common.c:157
>  syscall_return_slowpath arch/x86/entry/common.c:191 [inline]
>  do_syscall_64+0x6fc/0x930 arch/x86/entry/common.c:286
>  entry_SYSCALL64_slow_path+0x25/0x25

So this is fput()..


> Freed:
> PID = 25681
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_slab_free+0x6f/0xb0 mm/kasan/kasan.c:589
>  __cache_free mm/slab.c:3514 [inline]
>  kmem_cache_free+0x71/0x240 mm/slab.c:3774
>  free_task_struct kernel/fork.c:158 [inline]
>  free_task+0x151/0x1d0 kernel/fork.c:370
>  copy_process.part.38+0x18e5/0x4aa0 kernel/fork.c:1931
>  copy_process kernel/fork.c:1531 [inline]
>  _do_fork+0x200/0x1010 kernel/fork.c:1994
>  SYSC_clone kernel/fork.c:2104 [inline]
>  SyS_clone+0x37/0x50 kernel/fork.c:2098
>  do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281
>  return_from_SYSCALL_64+0x0/0x7a

and this is a failed fork().


However, inherited events don't have a filedesc to fput(), and
similarly, a task that fails for has never been visible to attach a perf
event to because it never hits the pid-hash.

Or so it is assumed.

I'm forever getting lost in the PID code. Oleg, is there any way
find_task_by_vpid() can return a task that can still fail fork() ?



Re: [PATCH v17 2/3] usb: USB Type-C connector class

2017-03-06 Thread Heikki Krogerus
Hi Peter,

On Mon, Mar 06, 2017 at 09:15:51AM +0800, Peter Chen wrote:
> > > What interface you use when you receive this event to handle
> > > dual-role switch? I am wonder if a common dual-role class is
> > > needed, then we can have a common user utility.
> > > 
> > > Eg, if "data_role" has changed, the udev can echo "data_role" to
> > > /sys/class/usb-dual-role/role
> > 
> > No. If the partner executes successfully for example DR_Swap message,
> > the kernel has to take care everything that is needed for the role to
> > be what ever was negotiated on its own. User space can't be involved
> > with that.
> > 
> 
> Would you give me an example how kernel handle this? How type-C event
> triggers role switch?

On our boards, the firmware or EC (or ACPI) configures the hardware as
needed and also notifies the components using ACPI if needed. It's
often not even possible to directly configure the components/hardware
for a particular role.

I'm not commenting on Roger's dual role patch series, but I don't
really think it should be mixed with Type-C. USB Type-C and USB Power
Delivery define their own ways of handling the roles, and they are not
limited to the data role only. Things like OTG for example will, and
actually can not be supported. With Type-C we will have competing
state machines compared to OTG. The dual-role framework may be useful
on systems that provide more traditional connectors, which possibly
have the ID-pin like micro-AB, and possibly also support OTG. It can
also be something that exist in parallel with the Type-C class, but
there just can not be any dependencies between the two.


Thanks,

-- 
heikki


[PATCH v2] f2fs: combine nat_bits and free_nid_bitmap cache

2017-03-06 Thread Chao Yu
Both nat_bits cache and free_nid_bitmap cache provide same functionality
as a intermediate cache between free nid cache and disk, but with
different granularity of indicating free nid range, and different
persistence policy. nat_bits cache provides better persistence ability,
and free_nid_bitmap provides better granularity.

In this patch we combine advantage of both caches, so finally policy of
the intermediate cache would be:
- init: load free nid status from nat_bits into free_nid_bitmap
- lookup: scan free_nid_bitmap before load NAT blocks
- update: update free_nid_bitmap in real-time
- persistence: udpate and persist nat_bits in checkpoint

Signed-off-by: Chao Yu 
---
 fs/f2fs/node.c | 105 +++--
 1 file changed, 35 insertions(+), 70 deletions(-)

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 1a759d45b7e4..625b46bc55ad 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -338,9 +338,6 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct 
node_info *ni,
set_nat_flag(e, IS_CHECKPOINTED, false);
__set_nat_cache_dirty(nm_i, e);
 
-   if (enabled_nat_bits(sbi, NULL) && new_blkaddr == NEW_ADDR)
-   clear_bit_le(NAT_BLOCK_OFFSET(ni->nid), nm_i->empty_nat_bits);
-
/* update fsync_mark if its inode nat entry is still alive */
if (ni->nid != ni->ino)
e = __lookup_nat_cache(nm_i, ni->ino);
@@ -1920,58 +1917,6 @@ static void scan_free_nid_bits(struct f2fs_sb_info *sbi)
up_read(&nm_i->nat_tree_lock);
 }
 
-static int scan_nat_bits(struct f2fs_sb_info *sbi)
-{
-   struct f2fs_nm_info *nm_i = NM_I(sbi);
-   struct page *page;
-   unsigned int i = 0;
-   nid_t nid;
-
-   if (!enabled_nat_bits(sbi, NULL))
-   return -EAGAIN;
-
-   down_read(&nm_i->nat_tree_lock);
-check_empty:
-   i = find_next_bit_le(nm_i->empty_nat_bits, nm_i->nat_blocks, i);
-   if (i >= nm_i->nat_blocks) {
-   i = 0;
-   goto check_partial;
-   }
-
-   for (nid = i * NAT_ENTRY_PER_BLOCK; nid < (i + 1) * NAT_ENTRY_PER_BLOCK;
-   nid++) {
-   if (unlikely(nid >= nm_i->max_nid))
-   break;
-   add_free_nid(sbi, nid, true);
-   }
-
-   if (nm_i->nid_cnt[FREE_NID_LIST] >= MAX_FREE_NIDS)
-   goto out;
-   i++;
-   goto check_empty;
-
-check_partial:
-   i = find_next_zero_bit_le(nm_i->full_nat_bits, nm_i->nat_blocks, i);
-   if (i >= nm_i->nat_blocks) {
-   disable_nat_bits(sbi, true);
-   up_read(&nm_i->nat_tree_lock);
-   return -EINVAL;
-   }
-
-   nid = i * NAT_ENTRY_PER_BLOCK;
-   page = get_current_nat_page(sbi, nid);
-   scan_nat_page(sbi, page, nid);
-   f2fs_put_page(page, 1);
-
-   if (nm_i->nid_cnt[FREE_NID_LIST] < MAX_FREE_NIDS) {
-   i++;
-   goto check_partial;
-   }
-out:
-   up_read(&nm_i->nat_tree_lock);
-   return 0;
-}
-
 static void __build_free_nids(struct f2fs_sb_info *sbi, bool sync, bool mount)
 {
struct f2fs_nm_info *nm_i = NM_I(sbi);
@@ -1993,21 +1938,6 @@ static void __build_free_nids(struct f2fs_sb_info *sbi, 
bool sync, bool mount)
 
if (nm_i->nid_cnt[FREE_NID_LIST])
return;
-
-   /* try to find free nids with nat_bits */
-   if (!scan_nat_bits(sbi) && nm_i->nid_cnt[FREE_NID_LIST])
-   return;
-   }
-
-   /* find next valid candidate */
-   if (enabled_nat_bits(sbi, NULL)) {
-   int idx = find_next_zero_bit_le(nm_i->full_nat_bits,
-   nm_i->nat_blocks, 0);
-
-   if (idx >= nm_i->nat_blocks)
-   set_sbi_flag(sbi, SBI_NEED_FSCK);
-   else
-   nid = idx * NAT_ENTRY_PER_BLOCK;
}
 
/* readahead nat pages to be scanned */
@@ -2590,6 +2520,38 @@ static int __get_nat_bitmaps(struct f2fs_sb_info *sbi)
return 0;
 }
 
+inline void load_free_nid_bitmap(struct f2fs_sb_info *sbi)
+{
+   struct f2fs_nm_info *nm_i = NM_I(sbi);
+   unsigned int i = 0;
+   nid_t nid, last_nid;
+
+   if (!enabled_nat_bits(sbi, NULL))
+   return;
+
+   for (i = 0; i < nm_i->nat_blocks; i++) {
+   i = find_next_bit_le(nm_i->empty_nat_bits, nm_i->nat_blocks, i);
+   if (i >= nm_i->nat_blocks)
+   break;
+
+   __set_bit_le(i, nm_i->nat_block_bitmap);
+
+   nid = i * NAT_ENTRY_PER_BLOCK;
+   last_nid = (i + 1) * NAT_ENTRY_PER_BLOCK;
+
+   for (; nid < last_nid; nid++)
+   update_free_nid_bitmap(sbi, nid, true, true);
+   }
+
+   for (i = 0; i < nm_i->nat_blocks; i++) {
+   i = find_next_bit_le(nm_i->full_nat_bits,

[PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

GFP_NOFS context is used for the following 5 reasons currently
- to prevent from deadlocks when the lock held by the allocation
  context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
  the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
  other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka 
Signed-off-by: Michal Hocko 
---
 fs/xfs/kmem.h|  2 +-
 include/linux/gfp.h  |  8 
 include/linux/sched.h|  8 +++-
 include/linux/sched/mm.h | 26 +++---
 kernel/locking/lockdep.c |  6 +++---
 mm/page_alloc.c  | 10 ++
 mm/vmscan.c  |  6 +++---
 7 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+   if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which 
cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4528f7c9789f..9c3ee2281a56 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1211,9 +1211,9 @@ extern struct pid *cad_pid;
 #define PF_USED_ASYNC  0x4000  /* Used async_schedule*(), used 
by module init */
 #define PF_NOFREEZE0x8000  /* This thread should not be 
f

Re: [PATCH] pinctrl: samsung: fix segfault when using external interrupts on s3c24xx

2017-03-06 Thread Sergio Prado
Hi Krzysztof,

> > This is a regression from commit 8b1bd11c1f8f529057369c5b3702d13fd24e2765.
> 
> Checkpatch should complain here about commit format.
> 
> > 
> > Tested on FriendlyARM mini2440.
> > 
> 
> Please add:
>   Fixes: 8b1bd11c1f8f ("pinctrl: samsung: Add the support the multiple 
> IORESOURCE_MEM for one pin-bank")
>   Cc: 
> 

OK.

> > Signed-off-by: Sergio Prado 
> > ---
> >  drivers/pinctrl/samsung/pinctrl-s3c24xx.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c 
> > b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c
> > index b82a003546ae..1b8d887796e8 100644
> > --- a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c
> > +++ b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c
> > @@ -356,8 +356,8 @@ static inline void s3c24xx_demux_eint(struct irq_desc 
> > *desc,
> >  {
> > struct s3c24xx_eint_data *data = irq_desc_get_handler_data(desc);
> > struct irq_chip *chip = irq_desc_get_chip(desc);
> > -   struct irq_data *irqd = irq_desc_get_irq_data(desc);
> > -   struct samsung_pin_bank *bank = irq_data_get_irq_chip_data(irqd);
> > +   struct samsung_pinctrl_drv_data *d = data->drvdata;
> > +   struct samsung_pin_bank *bank = d->pin_banks;
> 
> I think 'pin_banks' point to all banks of given controller not to the
> currently accessed one.

Understood. I think it worked in my tests because on s3c2440 all banks
have the same eint base address.

So what do you think is the best approach to solve this problem?

> 
> 
> Best regards,
> Krzysztof
> 

-- 
Sergio Prado
Embedded Labworks
Office: +55 11 2628-3461
Mobile: +55 11 97123-3420


[PATCH 5/7] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-03-06 Thread Michal Hocko
From: Michal Hocko 

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka 
Reviewed-by: Brian Foster 
Reviewed-by: Darrick J. Wong 
Signed-off-by: Michal Hocko 
---
 fs/xfs/kmem.c| 12 ++--
 fs/xfs/xfs_buf.c |  8 
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index e14da724a0b5..6b7b04468aa8 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -66,7 +66,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-   unsigned noio_flag = 0;
+   unsigned nofs_flag = 0;
void*ptr;
gfp_t   lflags;
 
@@ -78,17 +78,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 * here. Hence we need to tell memory reclaim that we are in such a
-* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+* context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 * the filesystem here and potentially deadlocking.
 */
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   noio_flag = memalloc_noio_save();
+   if (flags & KM_NOFS)
+   nofs_flag = memalloc_nofs_save();
 
lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-   if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-   memalloc_noio_restore(noio_flag);
+   if (flags & KM_NOFS)
+   memalloc_nofs_restore(nofs_flag);
 
return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b6208728ba39..ca09061369cb 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -443,17 +443,17 @@ _xfs_buf_map_pages(
bp->b_addr = NULL;
} else {
int retried = 0;
-   unsigned noio_flag;
+   unsigned nofs_flag;
 
/*
 * vm_map_ram() will allocate auxillary structures (e.g.
 * pagetables) with GFP_KERNEL, yet we are likely to be under
 * GFP_NOFS context here. Hence we need to tell memory reclaim
-* that we are in such a context via PF_MEMALLOC_NOIO to prevent
+* that we are in such a context via PF_MEMALLOC_NOFS to prevent
 * memory reclaim re-entering the filesystem here and
 * potentially deadlocking.
 */
-   noio_flag = memalloc_noio_save();
+   nofs_flag = memalloc_nofs_save();
do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-1, PAGE_KERNEL);
@@ -461,7 +461,7 @@ _xfs_buf_map_pages(
break;
vm_unmap_aliases();
} while (retried++ <= 1);
-   memalloc_noio_restore(noio_flag);
+   memalloc_nofs_restore(nofs_flag);
 
if (!bp->b_addr)
return -ENOMEM;
-- 
2.11.0



Re: [PATCH v2 1/2] HID: reject input outside logical range only if null state is set

2017-03-06 Thread Jiri Kosina
On Tue, 14 Feb 2017, Tomasz Kramkowski wrote:

> From: Valtteri Heikkilä 
> 
> This patch fixes an issue in drivers/hid/hid-input.c where USB HID
> control null state flag is not checked upon rejecting inputs outside
> logical minimum-maximum range. The check should be made according to USB
> HID specification 1.11, section 6.2.2.5, p.31. The fix will resolve
> issues with some game controllers, such as:
> https://bugzilla.kernel.org/show_bug.cgi?id=68621
> 
> [t...@the-tk.com: shortened and fixed spelling in commit message]
> Signed-off-by: Valtteri Heikkilä 
> Signed-off-by: Tomasz Kramkowski 

Applied to for-4.12/hid-core-null-state-handling. Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH 1/2] xfs: allow kmem_zalloc_greedy to fail

2017-03-06 Thread Michal Hocko
On Sat 04-03-17 09:54:44, Dave Chinner wrote:
> On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Even though kmem_zalloc_greedy is documented it might fail the current
> > code doesn't really implement this properly and loops on the smallest
> > allowed size for ever. This is a problem because vzalloc might fail
> > permanently - we might run out of vmalloc space or since 5d17a73a2ebe
> > ("vmalloc: back off when the current task is killed") when the current
> > task is killed. The later one makes the failure scenario much more
> > probable than it used to be because it makes vmalloc() failures
> > permanent for tasks with fatal signals pending.. Fix this by bailing out
> > if the minimum size request failed.
> > 
> > This has been noticed by a hung generic/269 xfstest by Xiong Zhou.
> > 
> > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, 
> > mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null)
> > fsstress cpuset=/ mems_allowed=0-1
> > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21
> > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
> > 10/05/2016
> > Call Trace:
> >  dump_stack+0x63/0x87
> >  warn_alloc+0x114/0x1c0
> >  ? alloc_pages_current+0x88/0x120
> >  __vmalloc_node_range+0x250/0x2a0
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  ? free_hot_cold_page+0x21f/0x280
> >  vzalloc+0x54/0x60
> >  ? kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  kmem_zalloc_greedy+0x2b/0x40 [xfs]
> >  xfs_bulkstat+0x11b/0x730 [xfs]
> >  ? xfs_bulkstat_one_int+0x340/0x340 [xfs]
> >  ? selinux_capable+0x20/0x30
> >  ? security_capable+0x48/0x60
> >  xfs_ioc_bulkstat+0xe4/0x190 [xfs]
> >  xfs_file_ioctl+0x9dd/0xad0 [xfs]
> >  ? do_filp_open+0xa5/0x100
> >  do_vfs_ioctl+0xa7/0x5e0
> >  SyS_ioctl+0x79/0x90
> >  do_syscall_64+0x67/0x180
> >  entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > fsstress keeps looping inside kmem_zalloc_greedy without any way out
> > because vmalloc keeps failing due to fatal_signal_pending.
> > 
> > Reported-by: Xiong Zhou 
> > Analyzed-by: Tetsuo Handa 
> > Signed-off-by: Michal Hocko 
> > ---
> >  fs/xfs/kmem.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index 339c696bbc01..ee95f5c6db45 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t 
> > maxsize)
> > size_t  kmsize = maxsize;
> >  
> > while (!(ptr = vzalloc(kmsize))) {
> > +   if (kmsize == minsize)
> > +   break;
> > if ((kmsize >>= 1) <= minsize)
> > kmsize = minsize;
> > }
> 
> Seems wrong to me - this function used to have lots of callers and
> over time we've slowly removed them or replaced them with something
> else. I'd suggest removing it completely, replacing the call sites
> with kmem_zalloc_large().

I do not really care how this gets fixed. Dropping kmem_zalloc_greedy
sounds like a way to go. I am not familiar with xfs_bulkstat to do an
edicated guess which allocation size to use. So I guess I have to
postpone this to you guys if you prefer that route though.

Thanks!
-- 
Michal Hocko
SUSE Labs


<    2   3   4   5   6   7   8   9   10   11   >