Re: [PATCH 1/2] mm: use is_migrate_highatomic() to simplify the code
On Fri 03-03-17 15:06:19, Andrew Morton wrote: > On Fri, 3 Mar 2017 14:18:08 +0100 Michal Hocko wrote: > > > On Fri 03-03-17 19:10:13, Xishi Qiu wrote: > > > Introduce two helpers, is_migrate_highatomic() and > > > is_migrate_highatomic_page(). > > > Simplify the code, no functional changes. > > > > static inline helpers would be nicer than macros > > Always. > > We made a big dependency mess in mmzone.h. internal.h works. Just too bad we have three different header files for is_migrate_isolate{_page} - include/linux/page-isolation.h is_migrate_cma{_page} - include/linux/mmzone.h is_migrate_highatomic{_page} - mm/internal.h I guess we want all of them in internal.h? -- Michal Hocko SUSE Labs
Re: [Patch v2 03/11] s5p-mfc: Use min scratch buffer size as provided by F/W
On 03.03.2017 10:07, Smitha T Murthy wrote: > After MFC v8.0, mfc f/w lets the driver know how much scratch buffer > size is required for decoder. If mfc f/w has the functionality, > E_MIN_SCRATCH_BUFFER_SIZE, driver can know how much scratch buffer size > is required for encoder too. > > Signed-off-by: Smitha T Murthy Reviewed-by: Andrzej Hajda -- Regards Andrzej
[v2 PATCH 3/3] mmc: sdhci-cadence: Update PHY delay configuration
PHY settings can be different for different platforms and SoCs. Fixed PHY input delays was replaced with SoC specific compatible data. DTS properties are used for configuration new PHY DLL delays. Signed-off-by: Piotr Sroka --- Changes for v2: - dts part was removed from this patch - most delays were moved from dts file to data associated with an SoC specific compatible - remove unrelated changes --- drivers/mmc/host/sdhci-cadence.c | 124 --- 1 file changed, 116 insertions(+), 8 deletions(-) diff --git a/drivers/mmc/host/sdhci-cadence.c b/drivers/mmc/host/sdhci-cadence.c index b2334ec..29b5d11 100644 --- a/drivers/mmc/host/sdhci-cadence.c +++ b/drivers/mmc/host/sdhci-cadence.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "sdhci-pltfm.h" @@ -54,6 +55,9 @@ #define SDHCI_CDNS_PHY_DLY_EMMC_LEGACY 0x06 #define SDHCI_CDNS_PHY_DLY_EMMC_SDR0x07 #define SDHCI_CDNS_PHY_DLY_EMMC_DDR0x08 +#define SDHCI_CDNS_PHY_DLY_SDCLK 0x0b +#define SDHCI_CDNS_PHY_DLY_HSMMC 0x0c +#define SDHCI_CDNS_PHY_DLY_STROBE 0x0d /* * The tuned val register is 6 bit-wide, but not the whole of the range is @@ -62,10 +66,24 @@ */ #define SDHCI_CDNS_MAX_TUNING_LOOP 40 +static const struct of_device_id sdhci_cdns_match[]; + struct sdhci_cdns_priv { void __iomem *hrs_addr; }; +struct sdhci_cdns_config { + u8 phy_dly_sd_highspeed; + u8 phy_dly_sd_legacy; + u8 phy_dly_sd_uhs_sdr12; + u8 phy_dly_sd_uhs_sdr25; + u8 phy_dly_sd_uhs_sdr50; + u8 phy_dly_sd_uhs_ddr50; + u8 phy_dly_emmc_legacy; + u8 phy_dly_emmc_sdr; + u8 phy_dly_emmc_ddr; +}; + static int sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv, u8 addr, u8 data) { @@ -90,13 +108,77 @@ static int sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv, return 0; } -static void sdhci_cdns_phy_init(struct sdhci_cdns_priv *priv) +static int sdhci_cdns_phy_in_delay_init(struct sdhci_cdns_priv *priv, + const struct sdhci_cdns_config *config) +{ + int ret = 0; + + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_HS, + config->phy_dly_sd_highspeed); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_DEFAULT, + config->phy_dly_sd_legacy); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_SDR12, + config->phy_dly_sd_uhs_sdr12); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_SDR25, + config->phy_dly_sd_uhs_sdr25); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_SDR50, + config->phy_dly_sd_uhs_sdr50); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_UHS_DDR50, + config->phy_dly_sd_uhs_ddr50); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_LEGACY, + config->phy_dly_emmc_legacy); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_SDR, + config->phy_dly_emmc_sdr); + if (ret) + return ret; + ret = sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_DDR, + config->phy_dly_emmc_ddr); + if (ret) + return ret; + return 0; +} + +static int sdhci_cdns_phy_dll_delay_parse_dt(struct device_node *np, +struct sdhci_cdns_priv *priv) { - sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_HS, 4); - sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_SD_DEFAULT, 4); - sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_LEGACY, 9); - sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_SDR, 2); - sdhci_cdns_write_phy_reg(priv, SDHCI_CDNS_PHY_DLY_EMMC_DDR, 3); + u32 tmp; + int ret; + + if (!of_property_read_u32(np, "phy-dll-delay-sdclk", &tmp)) { + ret = sdhci_cdns_write_phy_reg(priv, + SDHCI_CDNS_PHY_DLY_SDCLK, tmp); + + if (ret) + return ret; + } + if (!of_property_read_u32(np, "phy-dll-delay-sdclk-hsmmc", &tmp)) { + ret = sdhci_cdns_write_phy_reg(priv, + SDHCI_CDNS_PHY_DLY_HSMMC, tmp); + if (ret) + return ret; + } +
[PATCH v2 1/4] mmc: core: Add post_ios_power_on callback for power sequences
Currently, ->pre_power_on() callback is called at the beginning of the mmc_power_up() function before MMC_POWER_UP and MMC_POWER_ON sequences. The callback ->post_power_on() is called at the end of the mmc_power_up() function. Some SDIO Chipsets require to gate the clock after than the vqmmc supply is powered on and then toggle the reset line. Currently, there is no way for doing this. This commit introduces a new callback ->post_ios_power_on(), that is called at the end of the mmc_power_up() function after the mmc_set_ios() operation. In this way the entire power sequences can be done from this function after the enablement of the power supply. Signed-off-by: Romain Perier --- Changes in v2: - Added missing declaration for mmc_pwrseq_post_ios_power_on when CONFIG_OF is disabled. drivers/mmc/core/core.c | 1 + drivers/mmc/core/pwrseq.c | 8 drivers/mmc/core/pwrseq.h | 3 +++ 3 files changed, 12 insertions(+) diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 1076b9d..d73a050 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -1831,6 +1831,7 @@ void mmc_power_up(struct mmc_host *host, u32 ocr) * time required to reach a stable voltage. */ mmc_delay(10); + mmc_pwrseq_post_ios_power_on(host); } void mmc_power_off(struct mmc_host *host) diff --git a/drivers/mmc/core/pwrseq.c b/drivers/mmc/core/pwrseq.c index 9386c47..98f50b7 100644 --- a/drivers/mmc/core/pwrseq.c +++ b/drivers/mmc/core/pwrseq.c @@ -68,6 +68,14 @@ void mmc_pwrseq_post_power_on(struct mmc_host *host) pwrseq->ops->post_power_on(host); } +void mmc_pwrseq_post_ios_power_on(struct mmc_host *host) +{ + struct mmc_pwrseq *pwrseq = host->pwrseq; + + if (pwrseq && pwrseq->ops->post_ios_power_on) + pwrseq->ops->post_ios_power_on(host); +} + void mmc_pwrseq_power_off(struct mmc_host *host) { struct mmc_pwrseq *pwrseq = host->pwrseq; diff --git a/drivers/mmc/core/pwrseq.h b/drivers/mmc/core/pwrseq.h index d69e751..ad6e3af 100644 --- a/drivers/mmc/core/pwrseq.h +++ b/drivers/mmc/core/pwrseq.h @@ -13,6 +13,7 @@ struct mmc_pwrseq_ops { void (*pre_power_on)(struct mmc_host *host); void (*post_power_on)(struct mmc_host *host); + void (*post_ios_power_on)(struct mmc_host *host); void (*power_off)(struct mmc_host *host); }; @@ -31,6 +32,7 @@ void mmc_pwrseq_unregister(struct mmc_pwrseq *pwrseq); int mmc_pwrseq_alloc(struct mmc_host *host); void mmc_pwrseq_pre_power_on(struct mmc_host *host); void mmc_pwrseq_post_power_on(struct mmc_host *host); +void mmc_pwrseq_post_ios_power_on(struct mmc_host *host); void mmc_pwrseq_power_off(struct mmc_host *host); void mmc_pwrseq_free(struct mmc_host *host); @@ -44,6 +46,7 @@ static inline void mmc_pwrseq_unregister(struct mmc_pwrseq *pwrseq) {} static inline int mmc_pwrseq_alloc(struct mmc_host *host) { return 0; } static inline void mmc_pwrseq_pre_power_on(struct mmc_host *host) {} static inline void mmc_pwrseq_post_power_on(struct mmc_host *host) {} +static inline void mmc_pwrseq_post_ios_power_on(struct mmc_host *host) {} static inline void mmc_pwrseq_power_off(struct mmc_host *host) {} static inline void mmc_pwrseq_free(struct mmc_host *host) {} -- 2.9.3
Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
On Mon, Mar 06, 2017 at 05:00:28PM +0300, Dmitry Safonov wrote: > 2017-02-21 15:42 GMT+03:00 Kirill A. Shutemov : > > On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote: > >> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski : > >> > On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov > >> > wrote: > >> >> This patch introduces two new prctl(2) handles to manage maximum virtual > >> >> address available to userspace to map. > >> ... > >> > Anyway, can you and Dmitry try to reconcile your patches? > >> > >> So, how can I help that? > >> Is there the patch's version, on which I could rebase? > >> Here are BTW the last patches, which I will resend with trivial ifdef-fixup > >> after the merge window: > >> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com > > > > Could you check if this patch collides with anything you do: > > > > http://lkml.kernel.org/r/20170220131515.ga9...@node.shutemov.name > > Ok, sorry for the late reply - it was the merge window anyway and I've got > urgent work to do. > > Let's see: > > I'll need minor merge fixup here: > >-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3)) > >+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3)) > while in my patches: > >+#define __TASK_UNMAPPED_BASE(task_size)(PAGE_ALIGN(task_size / 3)) > >+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE) > > This should be just fine with my changes: > >- info.high_limit = end; > >+ info.high_limit = min(end, DEFAULT_MAP_WINDOW); > > This will need another minor fixup: > >-#define MAX_GAP (TASK_SIZE/6*5) > >+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5) > I've moved it from macro to mmap_base() as local var, > which depends on task_size parameter. > > That's all, as far as I can see at this moment. > Does not seems hard to fix. So I suggest sending patches sets > in parallel, the second accepted will rebase the set. > Is it convenient for you? Works for me. In fact, I've just sent v4 of the patchset. -- Kirill A. Shutemov
[PATCH v2 3/4] mmc: pwrseq_simple: Add an optional pre-power-on-delay
Some devices need a while between the enablement of its clk and the time where the reset line is asserted. When this time happens between the pre_power_on and the post_power_on callbacks, there is a need to do an msleep at the end of the pre_power_on callback. This commit adds an optional DT property for such devices. Signed-off-by: Romain Perier --- Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt | 2 ++ drivers/mmc/core/pwrseq_simple.c| 6 ++ 2 files changed, 8 insertions(+) diff --git a/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt b/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt index e254368..821feaaf 100644 --- a/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt +++ b/Documentation/devicetree/bindings/mmc/mmc-pwrseq-simple.txt @@ -18,6 +18,8 @@ Optional properties: "ext_clock" (External clock provided to the card). - post-power-on-delay-ms : Delay in ms after powering the card and de-asserting the reset-gpios (if any) +- pre-power-on-delay-ms : Delay in ms before powering the card and + asserting the reset-gpios (if any) Example: diff --git a/drivers/mmc/core/pwrseq_simple.c b/drivers/mmc/core/pwrseq_simple.c index e27019f..d8d7166 100644 --- a/drivers/mmc/core/pwrseq_simple.c +++ b/drivers/mmc/core/pwrseq_simple.c @@ -27,6 +27,7 @@ struct mmc_pwrseq_simple { struct mmc_pwrseq pwrseq; bool clk_enabled; u32 post_power_on_delay_ms; + u32 pre_power_on_delay_ms; struct clk *ext_clk; struct gpio_descs *reset_gpios; }; @@ -60,6 +61,9 @@ static void mmc_pwrseq_simple_pre_power_on(struct mmc_host *host) } mmc_pwrseq_simple_set_gpios_value(pwrseq, 1); + + if (pwrseq->pre_power_on_delay_ms) + msleep(pwrseq->pre_power_on_delay_ms); } static void mmc_pwrseq_simple_post_power_on(struct mmc_host *host) @@ -130,6 +134,8 @@ static int mmc_pwrseq_simple_probe(struct platform_device *pdev) device_property_read_u32(dev, "post-power-on-delay-ms", &pwrseq->post_power_on_delay_ms); + device_property_read_u32(dev, "pre-power-on-delay-ms", +&pwrseq->pre_power_on_delay_ms); pwrseq->pwrseq.dev = dev; if (device_property_read_bool(dev, "post-ios-power-on")) -- 2.9.3
[PATCH v2 2/4] mmc: pwrseq-simple: Add optional op. for post_ios_power_on callback
Some devices require to do their entire power sequence after that the power supply of the MMC has been powered on. This can be done by only implementing the optional post_ios_power_on() callback that rely on pre_power_on/post_power_on functions, other functions being NULL. Then we introduce a new DT property "post_ios_power_on", when this property is set the driver will use its post_ios_power_on operations, otherwise it fallbacks to the default operations with pre_power_on/post_power_on. Signed-off-by: Romain Perier --- Changes in v2: - Added missing power_off function in mmc_pwrseq_post_ios_ops drivers/mmc/core/pwrseq_simple.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/mmc/core/pwrseq_simple.c b/drivers/mmc/core/pwrseq_simple.c index 1304160..e27019f 100644 --- a/drivers/mmc/core/pwrseq_simple.c +++ b/drivers/mmc/core/pwrseq_simple.c @@ -84,12 +84,23 @@ static void mmc_pwrseq_simple_power_off(struct mmc_host *host) } } +static void mmc_pwrseq_simple_post_ios_power_on(struct mmc_host *host) +{ + mmc_pwrseq_simple_pre_power_on(host); + mmc_pwrseq_simple_post_power_on(host); +} + static const struct mmc_pwrseq_ops mmc_pwrseq_simple_ops = { .pre_power_on = mmc_pwrseq_simple_pre_power_on, .post_power_on = mmc_pwrseq_simple_post_power_on, .power_off = mmc_pwrseq_simple_power_off, }; +static const struct mmc_pwrseq_ops mmc_pwrseq_post_ios_ops = { + .post_ios_power_on = mmc_pwrseq_simple_post_ios_power_on, + .power_off = mmc_pwrseq_simple_power_off, +}; + static const struct of_device_id mmc_pwrseq_simple_of_match[] = { { .compatible = "mmc-pwrseq-simple",}, {/* sentinel */}, @@ -121,7 +132,10 @@ static int mmc_pwrseq_simple_probe(struct platform_device *pdev) &pwrseq->post_power_on_delay_ms); pwrseq->pwrseq.dev = dev; - pwrseq->pwrseq.ops = &mmc_pwrseq_simple_ops; + if (device_property_read_bool(dev, "post-ios-power-on")) + pwrseq->pwrseq.ops = &mmc_pwrseq_post_ios_ops; + else + pwrseq->pwrseq.ops = &mmc_pwrseq_simple_ops; pwrseq->pwrseq.owner = THIS_MODULE; platform_set_drvdata(pdev, pwrseq); -- 2.9.3
[PATCH v2 0/4] mmc: pwrseq: post_ios power sequence
Some devices, like WiFi chipsets AP6335 require a specific power-up sequence ordering before being used. You must enable the vqmmc power supply and wait until it reaches its minimum voltage, gate the clock and wait at least two cycles and then assert the reset line. See DS 1/ Currently, there is no generic manner for doing this with pwrseq_simple. This set of patches proposes an approach to support this use case. It is related to the old patch 2/ 1. http://www.t-firefly.com/download/firefly-rk3288/hardware/AP6335%20datasheet_V1.3_02102014.pdf 2. http://lists.infradead.org/pipermail/linux-arm-kernel/2017-March/490681.html Changes in v2: - Added missing power_off function in operations for post_ios - Fixed warning found by 0day-ci about missing mmc_pwrseq_post_ios_power_on when CONFIG_OF is disabled. Romain Perier (4): mmc: core: Add post_ios_power_on callback for power sequences mmc: pwrseq-simple: Add optional op. for post_ios_power_on callback mmc: pwrseq_simple: Add an optional pre-power-on-delay arm: dts: rockchip: Enable post_ios_power_on and pre-power-on-delay-ms .../devicetree/bindings/mmc/mmc-pwrseq-simple.txt | 2 ++ arch/arm/boot/dts/rk3288-rock2-square.dts | 2 ++ drivers/mmc/core/core.c| 1 + drivers/mmc/core/pwrseq.c | 8 drivers/mmc/core/pwrseq.h | 3 +++ drivers/mmc/core/pwrseq_simple.c | 22 +- 6 files changed, 37 insertions(+), 1 deletion(-) -- 2.9.3
Re: [PATCH 01/10] x86: assembly, ENTRY for fn, GLOBAL for data
On 03/03/2017, 07:20 PM, h...@zytor.com wrote: > On March 1, 2017 2:27:54 AM PST, Ingo Molnar wrote: >> >> * Thomas Gleixner wrote: >> >>> On Wed, 1 Mar 2017, Ingo Molnar wrote: * Jiri Slaby wrote: > This is a start of series to unify use of ENTRY, ENDPROC, GLOBAL, >> END, > and other macros across x86. When we have all this sorted out, >> this will > help to inject DWARF unwinding info by objtool later. > > So, let us use the macros this way: > * ENTRY -- start of a global function > * ENDPROC -- end of a local/global function > * GLOBAL -- start of a globally visible data symbol > * END -- end of local/global data symbol So how about using macro names that actually show the purpose, >> instead of importing all the crappy, historic, essentially randomly chosen >> debug symbol macro names from the binutils and older kernels? Something sane, like: SYM__FUNCTION_START >>> >>> Sane would be: >>> >>> SYM_FUNCTION_START >>> >>> The double underscore is just not giving any value. >> >> So the double underscore (at least in my view) has two advantages: >> >> 1) it helps separate the prefix from the postfix. >> >> I.e. it's a 'symbols' namespace, and a 'function start', not the >> 'start' of a >> 'symbol function'. >> >> 2) It also helps easy greppability. >> >> Try this in latest -tip: >> >> git grep e820__ >> >> To see all the E820 API calls - with no false positives! >> >> 'git grep e820_' on the other hand is a lot less reliable... >> >> But no strong feelings either way, I just try to sneak in these small >> namespace >> structure tricks when nobody's looking! ;-) >> >> Thanks, >> >> Ingo > > This seems needlessly verbose to me and clutters the code. > > How about: > > PROC..ENDPROC, LOCALPROC..ENDPROC and DATA..ENDDATA. Clear, unambiguous and > balanced. I tried this, but: arch/x86/kernel/relocate_kernel_64.S:27:0: warning: "DATA" redefined #define DATA(offset) (KEXEC_CONTROL_CODE_MAX_SIZE+(offset)) I am not saying that I cannot fix it up. I just want to say, that these names might be too generic, especially "PROC" and "DATA". So should I really stick to these? thanks, -- js suse labs
[PATCH 3/3] dt-bindings: mtd: Add Octal SPI support to Cadence QSPI.
This patch updates Cadence QSPI Device Tree documentation to include information about new property used to indicate, whether or not Octal SPI transfers are supported by the device. Signed-off-by: Artur Jedrysek --- Documentation/devicetree/bindings/mtd/cadence-quadspi.txt | 4 1 file changed, 4 insertions(+) diff --git a/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt b/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt index f248056..8438184 100644 --- a/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt +++ b/Documentation/devicetree/bindings/mtd/cadence-quadspi.txt @@ -14,6 +14,9 @@ Required properties: Optional properties: - cdns,is-decoded-cs : Flag to indicate whether decoder is used or not. +- cdns,octal-controller : Flag to indicate, that used controller supports Octal + SPI transfer mode. May be intentionally omitted to + switch it back to Quad SPI mode. Optional subnodes: Subnodes of the Cadence Quad SPI controller are spi slave nodes with additional @@ -44,6 +47,7 @@ Example: cdns,fifo-depth = <128>; cdns,fifo-width = <4>; cdns,trigger-address = <0x>; + #cdns,octal-controller flash0: n25q00@0 { ... -- 2.2.2
Re: [PATCH 1/3] futex: remove duplicated code
Hi Jiri, On Mon, Mar 6, 2017 at 9:46 AM, Jiri Slaby wrote: > futex: make the encoded_op decoding readable > > Decoding of encoded_op is a bit unreadable. It contains shifts to the > left and to the right by some constants. Make it clearly visible what > part of the bit mask is taken and shift the values only to the right > appropriately. And make sure sign extension takes place using > sign_extend32. > > Signed-off-by: Jiri Slaby > > diff --git a/kernel/futex.c b/kernel/futex.c > index 0ead0756a593..f90314bd42cb 100644 > --- a/kernel/futex.c > +++ b/kernel/futex.c > @@ -1461,10 +1461,10 @@ futex_wake(u32 __user *uaddr, unsigned int > flags, int nr_wake, u32 bitset) > > static int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr) > { > - int op = (encoded_op >> 28) & 7; > - int cmp = (encoded_op >> 24) & 15; At least for the two above (modulo 7 vs 15?), the old decoding code matched the flow of operation in FUTEX_OP(). > - int oparg = (encoded_op << 8) >> 20; > - int cmparg = (encoded_op << 20) >> 20; > + int op = (encoded_op & 0x7000) >> 28; > + int cmp = (encoded_op & 0x0f00) >> 24; > + int oparg = sign_extend32((encoded_op & 0x00fff000) >> 12, 12); > + int cmparg = sign_extend32(encoded_op & 0x0fff, 12); > int oldval, ret; Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
[PATCH v4 3/7] perf/sdt: Allow recording of existing events
Add functionality to fetch matching events from uprobe_events. If no events are fourd from it, fetch matching events from probe-cache and add them in uprobe_events. If all events are already present in uprobe_events, reuse them. If few of them are present, add entries for missing events and record them. At the end of the record session, delete newly added entries. Below is detailed algorithm that describe implementation of this patch: arr1 = fetch all sdt events from uprobe_events if (event with exact name in arr1) add that in sdt_event_list return if (user has used pattern) if (pattern matching entries found from arr1) add those events in sdt_event_list return arr2 = lookup probe-cache if (arr2 empty) return ctr = 0 foreach (compare entries of arr1 and arr2 using filepath+address) if (match) add event from arr1 to sdt_event_list ctr++ if (!pattern used) remove entry from arr2 if (!pattern used || ctr == 0) add all entries of arr2 in sdt_event_list Example: Consider sdt event sdt_libpthread:mutex_release found in /usr/lib64/libpthread-2.24.so. $ readelf -n /usr/lib64/libpthread-2.24.so | grep -A2 Provider Provider: libpthread Name: mutex_release Location: 0xb126, Base: 0x000139cc, Semaphore: 0x -- Provider: libpthread Name: mutex_release Location: 0xb2f6, Base: 0x000139cc, Semaphore: 0x -- Provider: libpthread Name: mutex_release Location: 0xb498, Base: 0x000139cc, Semaphore: 0x -- Provider: libpthread Name: mutex_release Location: 0xb596, Base: 0x000139cc, Semaphore: 0x When no probepoint exists, $ sudo ./perf record -a -e sdt_libpthread:mutex_* Warning: Recording on 15 occurrences of sdt_libpthread:mutex_* $ sudo ./perf record -a -e sdt_libpthread:mutex_release Warning: Recording on 4 occurrences of sdt_libpthread:mutex_release $ sudo ./perf evlist sdt_libpthread:mutex_release_3 sdt_libpthread:mutex_release_2 sdt_libpthread:mutex_release_1 sdt_libpthread:mutex_release When probepoints already exists for all matching events, $ sudo ./perf probe sdt_libpthread:mutex_release Added new events: sdt_libpthread:mutex_release (on %mutex_release in /usr/lib64/libpthread-2.24.so) sdt_libpthread:mutex_release_1 (on %mutex_release in /usr/lib64/libpthread-2.24.so) sdt_libpthread:mutex_release_2 (on %mutex_release in /usr/lib64/libpthread-2.24.so) sdt_libpthread:mutex_release_3 (on %mutex_release in /usr/lib64/libpthread-2.24.so) $ sudo ./perf record -a -e sdt_libpthread:mutex_release_1 $ sudo ./perf evlist sdt_libpthread:mutex_release_1 $ sudo ./perf record -a -e sdt_libpthread:mutex_release $ sudo ./perf evlist sdt_libpthread:mutex_release $ sudo ./perf record -a -e sdt_libpthread:mutex_* Warning: Recording on 4 occurrences of sdt_libpthread:mutex_* $ sudo ./perf evlist sdt_libpthread:mutex_release_3 sdt_libpthread:mutex_release_2 sdt_libpthread:mutex_release_1 sdt_libpthread:mutex_release $ sudo ./perf record -a -e sdt_libpthread:mutex_release_* Warning: Recording on 3 occurrences of sdt_libpthread:mutex_release_* When probepoints are partially exists, $ sudo ./perf probe -d sdt_libpthread:mutex_release $ sudo ./perf probe -d sdt_libpthread:mutex_release_2 $ sudo ./perf record -a -e sdt_libpthread:mutex_release Warning: Recording on 4 occurrences of sdt_libpthread:mutex_release $ sudo ./perf evlist sdt_libpthread:mutex_release sdt_libpthread:mutex_release_3 sdt_libpthread:mutex_release_2 sdt_libpthread:mutex_release_1 $ sudo ./perf record -a -e sdt_libpthread:mutex_release* Warning: Recording on 2 occurrences of sdt_libpthread:mutex_release* $ sudo ./perf evlist sdt_libpthread:mutex_release_3 sdt_libpthread:mutex_release_1 $ sudo ./perf record -a -e sdt_libpthread:* Warning: Recording on 2 occurrences of sdt_libpthread:* $ sudo ./perf evlist sdt_libpthread:mutex_release_3 sdt_libpthread:mutex_release_1 Signed-off-by: Ravi Bangoria --- tools/perf/util/probe-event.c | 58 +- tools/perf/util/probe-event.h | 5 ++ tools/perf/util/probe-file.c | 173 +- tools/perf/util/probe-file.h | 3 +- 4 files changed, 215 insertions(+), 24 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index b879076..947b2ec 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -231,7 +231,7 @@ static void clear_perf_probe_point(struct perf_probe_point *pp) free(pp->lazy_line); } -static void clear_probe_trace_events(struct probe_trac
[PATCHv4 19/33] x86: convert the rest of the code to support p4d_t
This patch converts x86 to use proper folding of new page table level with . That's a bit of kitchen sink, but I don't see how to split it further. Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/paravirt.h | 33 +- arch/x86/include/asm/paravirt_types.h | 12 ++- arch/x86/include/asm/pgalloc.h| 35 ++- arch/x86/include/asm/pgtable.h| 59 ++- arch/x86/include/asm/pgtable_64.h | 12 ++- arch/x86/include/asm/pgtable_types.h | 10 +- arch/x86/include/asm/xen/page.h | 8 +- arch/x86/kernel/paravirt.c| 10 +- arch/x86/mm/init_64.c | 183 +++--- arch/x86/xen/mmu.c| 152 include/trace/events/xen.h| 28 +++--- 11 files changed, 401 insertions(+), 141 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 0489884fdc44..158d877ce9e9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -536,7 +536,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud) PVOP_VCALL2(pv_mmu_ops.set_pud, pudp, val); } -#if CONFIG_PGTABLE_LEVELS == 4 +#if CONFIG_PGTABLE_LEVELS >= 4 static inline pud_t __pud(pudval_t val) { pudval_t ret; @@ -565,6 +565,32 @@ static inline pudval_t pud_val(pud_t pud) return ret; } +static inline void pud_clear(pud_t *pudp) +{ + set_pud(pudp, __pud(0)); +} + +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d) +{ + p4dval_t val = native_p4d_val(p4d); + + if (sizeof(p4dval_t) > sizeof(long)) + PVOP_VCALL3(pv_mmu_ops.set_p4d, p4dp, + val, (u64)val >> 32); + else + PVOP_VCALL2(pv_mmu_ops.set_p4d, p4dp, + val); +} + +static inline void p4d_clear(p4d_t *p4dp) +{ + set_p4d(p4dp, __p4d(0)); +} + +#if CONFIG_PGTABLE_LEVELS >= 5 + +#error FIXME + static inline void set_pgd(pgd_t *pgdp, pgd_t pgd) { pgdval_t val = native_pgd_val(pgd); @@ -582,10 +608,7 @@ static inline void pgd_clear(pgd_t *pgdp) set_pgd(pgdp, __pgd(0)); } -static inline void pud_clear(pud_t *pudp) -{ - set_pud(pudp, __pud(0)); -} +#endif /* CONFIG_PGTABLE_LEVELS == 5 */ #endif /* CONFIG_PGTABLE_LEVELS == 4 */ diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index b060f962d581..93c49cf09b63 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -279,12 +279,18 @@ struct pv_mmu_ops { struct paravirt_callee_save pmd_val; struct paravirt_callee_save make_pmd; -#if CONFIG_PGTABLE_LEVELS == 4 +#if CONFIG_PGTABLE_LEVELS >= 4 struct paravirt_callee_save pud_val; struct paravirt_callee_save make_pud; - void (*set_pgd)(pgd_t *pudp, pgd_t pgdval); -#endif /* CONFIG_PGTABLE_LEVELS == 4 */ + void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval); + +#if CONFIG_PGTABLE_LEVELS >= 5 +#error FIXME +#endif /* CONFIG_PGTABLE_LEVELS >= 5 */ + +#endif /* CONFIG_PGTABLE_LEVELS >= 4 */ + #endif /* CONFIG_PGTABLE_LEVELS >= 3 */ struct pv_lazy_ops lazy_mode; diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h index b6d425999f99..2f585054c63c 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -121,10 +121,10 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) #endif /* CONFIG_X86_PAE */ #if CONFIG_PGTABLE_LEVELS > 3 -static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud) { paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT); - set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pud))); + set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud))); } static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) @@ -150,6 +150,37 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud, ___pud_free_tlb(tlb, pud); } +#if CONFIG_PGTABLE_LEVELS > 4 +static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d) +{ + paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT); + set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d))); +} + +static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr) +{ + gfp_t gfp = GFP_KERNEL_ACCOUNT; + + if (mm == &init_mm) + gfp &= ~__GFP_ACCOUNT; + return (p4d_t *)get_zeroed_page(gfp); +} + +static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d) +{ + BUG_ON((unsigned long)p4d & (PAGE_SIZE-1)); + free_page((unsigned long)p4d); +} + +extern void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d); + +static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d, + unsigned long address) +{ + ___p4d_free_tlb(tlb,
[PATCHv4 02/33] asm-generic: introduce 5level-fixup.h
We are going to switch core MM to 5-level paging abstraction. This is preparation step which adds As with 4level-fixup.h, the new header allows quickly make all architectures compatible with 5-level paging in core MM. In long run we would like to switch architectures to properly folded p4d level by using , but it requires more changes to arch-specific code. Signed-off-by: Kirill A. Shutemov --- include/asm-generic/4level-fixup.h | 3 ++- include/asm-generic/5level-fixup.h | 41 ++ include/linux/mm.h | 3 +++ 3 files changed, 46 insertions(+), 1 deletion(-) create mode 100644 include/asm-generic/5level-fixup.h diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h index 5bdab6bffd23..928fd66b1271 100644 --- a/include/asm-generic/4level-fixup.h +++ b/include/asm-generic/4level-fixup.h @@ -15,7 +15,6 @@ ((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \ NULL: pmd_offset(pud, address)) -#define pud_alloc(mm, pgd, address)(pgd) #define pud_offset(pgd, start) (pgd) #define pud_none(pud) 0 #define pud_bad(pud) 0 @@ -35,4 +34,6 @@ #undef pud_addr_end #define pud_addr_end(addr, end)(end) +#include + #endif diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h new file mode 100644 index ..b5ca82dc4175 --- /dev/null +++ b/include/asm-generic/5level-fixup.h @@ -0,0 +1,41 @@ +#ifndef _5LEVEL_FIXUP_H +#define _5LEVEL_FIXUP_H + +#define __ARCH_HAS_5LEVEL_HACK +#define __PAGETABLE_P4D_FOLDED + +#define P4D_SHIFT PGDIR_SHIFT +#define P4D_SIZE PGDIR_SIZE +#define P4D_MASK PGDIR_MASK +#define PTRS_PER_P4D 1 + +#define p4d_t pgd_t + +#define pud_alloc(mm, p4d, address) \ + ((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \ + NULL : pud_offset(p4d, address)) + +#define p4d_alloc(mm, pgd, address)(pgd) +#define p4d_offset(pgd, start) (pgd) +#define p4d_none(p4d) 0 +#define p4d_bad(p4d) 0 +#define p4d_present(p4d) 1 +#define p4d_ERROR(p4d) do { } while (0) +#define p4d_clear(p4d) pgd_clear(p4d) +#define p4d_val(p4d) pgd_val(p4d) +#define p4d_populate(mm, p4d, pud) pgd_populate(mm, p4d, pud) +#define p4d_page(p4d) pgd_page(p4d) +#define p4d_page_vaddr(p4d)pgd_page_vaddr(p4d) + +#define __p4d(x) __pgd(x) +#define set_p4d(p4dp, p4d) set_pgd(p4dp, p4d) + +#undef p4d_free_tlb +#define p4d_free_tlb(tlb, x, addr) do { } while (0) +#define p4d_free(mm, x)do { } while (0) +#define __p4d_free_tlb(tlb, x, addr) do { } while (0) + +#undef p4d_addr_end +#define p4d_addr_end(addr, end)(end) + +#endif diff --git a/include/linux/mm.h b/include/linux/mm.h index 0d65dd72c0f4..be1fe264eb37 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1619,11 +1619,14 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address); * Remove it when 4level-fixup.h has been removed. */ #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK) + +#ifndef __ARCH_HAS_5LEVEL_HACK static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) { return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))? NULL: pud_offset(pgd, address); } +#endif /* !__ARCH_HAS_5LEVEL_HACK */ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) { -- 2.11.0
[PATCHv4 28/33] x86/mm: add support of additional page table level during early boot
This patch adds support for 5-level paging during early boot. It generalizes boot for 4- and 5-level paging on 64-bit systems with compile-time switch between them. Signed-off-by: Kirill A. Shutemov --- arch/x86/boot/compressed/head_64.S | 23 +-- arch/x86/include/asm/pgtable.h | 2 +- arch/x86/include/asm/pgtable_64.h | 6 ++- arch/x86/include/uapi/asm/processor-flags.h | 2 + arch/x86/kernel/espfix_64.c | 2 +- arch/x86/kernel/head64.c| 40 +- arch/x86/kernel/head_64.S | 63 + arch/x86/kernel/machine_kexec_64.c | 2 +- arch/x86/mm/dump_pagetables.c | 2 +- arch/x86/mm/kasan_init_64.c | 12 +++--- arch/x86/realmode/init.c| 2 +- arch/x86/xen/mmu.c | 38 ++--- 12 files changed, 135 insertions(+), 59 deletions(-) diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S index d2ae1f821e0c..3ed26769810b 100644 --- a/arch/x86/boot/compressed/head_64.S +++ b/arch/x86/boot/compressed/head_64.S @@ -122,9 +122,12 @@ ENTRY(startup_32) addl%ebp, gdt+2(%ebp) lgdtgdt(%ebp) - /* Enable PAE mode */ + /* Enable PAE and LA57 mode */ movl%cr4, %eax orl $X86_CR4_PAE, %eax +#ifdef CONFIG_X86_5LEVEL + orl $X86_CR4_LA57, %eax +#endif movl%eax, %cr4 /* @@ -136,13 +139,24 @@ ENTRY(startup_32) movl$(BOOT_INIT_PGT_SIZE/4), %ecx rep stosl + xorl%edx, %edx + + /* Build Top Level */ + lealpgtable(%ebx,%edx,1), %edi + leal0x1007 (%edi), %eax + movl%eax, 0(%edi) + +#ifdef CONFIG_X86_5LEVEL /* Build Level 4 */ - lealpgtable + 0(%ebx), %edi + addl$0x1000, %edx + lealpgtable(%ebx,%edx), %edi leal0x1007 (%edi), %eax movl%eax, 0(%edi) +#endif /* Build Level 3 */ - lealpgtable + 0x1000(%ebx), %edi + addl$0x1000, %edx + lealpgtable(%ebx,%edx), %edi leal0x1007(%edi), %eax movl$4, %ecx 1: movl%eax, 0x00(%edi) @@ -152,7 +166,8 @@ ENTRY(startup_32) jnz 1b /* Build Level 2 */ - lealpgtable + 0x2000(%ebx), %edi + addl$0x1000, %edx + lealpgtable(%ebx,%edx), %edi movl$0x0183, %eax movl$2048, %ecx 1: movl%eax, 0(%edi) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 90f32116acd8..6cefd861ac65 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -917,7 +917,7 @@ extern pgd_t trampoline_pgd_entry; static inline void __meminit init_trampoline_default(void) { /* Default trampoline pgd value */ - trampoline_pgd_entry = init_level4_pgt[pgd_index(__PAGE_OFFSET)]; + trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)]; } # ifdef CONFIG_RANDOMIZE_MEMORY void __meminit init_trampoline(void); diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 9991224f6238..c9e41f1599dd 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -14,15 +14,17 @@ #include #include +extern p4d_t level4_kernel_pgt[512]; +extern p4d_t level4_ident_pgt[512]; extern pud_t level3_kernel_pgt[512]; extern pud_t level3_ident_pgt[512]; extern pmd_t level2_kernel_pgt[512]; extern pmd_t level2_fixmap_pgt[512]; extern pmd_t level2_ident_pgt[512]; extern pte_t level1_fixmap_pgt[512]; -extern pgd_t init_level4_pgt[]; +extern pgd_t init_top_pgt[]; -#define swapper_pg_dir init_level4_pgt +#define swapper_pg_dir init_top_pgt extern void paging_init(void); diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 567de50a4c2a..185f3d10c194 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -104,6 +104,8 @@ #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT) #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */ #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT) +#define X86_CR4_LA57_BIT 12 /* enable 5-level page tables */ +#define X86_CR4_LA57 _BITUL(X86_CR4_LA57_BIT) #define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */ #define X86_CR4_VMXE _BITUL(X86_CR4_VMXE_BIT) #define X86_CR4_SMXE_BIT 14 /* enable safer mode (TXT) */ diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index 8e598a1ad986..6b91e2eb8d3f 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -125,7 +125,7 @@ void __init init_espfix_bsp(void) p4d_t *p4d; /* Install the espfix pud into the kernel page directory */ - pgd = &init_level4_pgt[pgd_index(
Re: Build regressions/improvements in v4.11-rc1
On Mon, Mar 6, 2017 at 2:59 PM, Geert Uytterhoeven wrote: > Below is the list of build error/warning regressions/improvements in > v4.11-rc1[1] compared to v4.10[2]. > > Summarized: > - build errors: +19/-1 > [1] > http://kisskb.ellerman.id.au/kisskb/head/c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201/ > (all 266 configs) > 19 error regressions: > + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: > dereferencing pointer to incomplete type: => 58 > + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: implicit > declaration of function 'user_mode': => 60 > + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: implicit > declaration of function 'task_stack_page' > [-Werror=implicit-function-declaration]: => 31:3 > + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: invalid > application of 'sizeof' to incomplete type 'struct pt_regs' : => 31:3 > + /home/kisskb/slave/src/arch/mips/cavium-octeon/crypto/octeon-crypto.c: > error: implicit declaration of function 'task_stack_page' > [-Werror=implicit-function-declaration]: => 35:6 > + /home/kisskb/slave/src/arch/mips/cavium-octeon/smp.c: error: implicit > declaration of function 'task_stack_page' > [-Werror=implicit-function-declaration]: => 214:2 > + /home/kisskb/slave/src/arch/mips/include/asm/fpu.h: error: invalid > application of 'sizeof' to incomplete type 'struct pt_regs' : => 140:3, > 188:2, 138:3, 136:2 > + /home/kisskb/slave/src/arch/mips/include/asm/processor.h: error: invalid > application of 'sizeof' to incomplete type 'struct pt_regs': => 385:31 > + /home/kisskb/slave/src/arch/mips/kernel/smp-mt.c: error: implicit > declaration of function 'task_stack_page' > [-Werror=implicit-function-declaration]: => 215:2 > + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: > dereferencing pointer to incomplete type: => 59:17, 66:13 > + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: implicit > declaration of function 'force_sig' [-Werror=implicit-function-declaration]: > => 75:2 > + /home/kisskb/slave/src/arch/mips/sgi-ip32/ip32-berr.c: error: implicit > declaration of function 'force_sig' [-Werror=implicit-function-declaration]: > => 31:2 > + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown > opcode2 `l.lwa'.: => 70, 107, 69 > + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown > opcode2 `l.swa'.: => 72, 71, 111 > + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: > unknown opcode2 `l.lwa'.: => 18, 35, 70, 90 > + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: > unknown opcode2 `l.swa'.: => 20, 37, 92, 72 > + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: > unknown opcode2 `l.lwa'.: => 68, 30 > + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: > unknown opcode2 `l.swa'.: => 34, 69 > + /home/kisskb/slave/src/drivers/char/nwbutton.c: error: implicit > declaration of function 'kill_cad_pid' > [-Werror=implicit-function-declaration]: => 134:3 CC mingo ;-) Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
[PATCHv4 22/33] x86/mm: define virtual memory map for 5-level paging
The first part of memory map (up to %esp fixup) simply scales existing map for 4-level paging by factor of 9 -- number of bits addressed by additional page table level. The rest of the map is uncahnged. Signed-off-by: Kirill A. Shutemov --- Documentation/x86/x86_64/mm.txt | 33 ++--- arch/x86/Kconfig| 1 + arch/x86/include/asm/kasan.h| 9 ++--- arch/x86/include/asm/page_64_types.h| 10 ++ arch/x86/include/asm/pgtable_64_types.h | 6 ++ arch/x86/include/asm/sparsemem.h| 9 +++-- 6 files changed, 60 insertions(+), 8 deletions(-) diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 5724092db811..0303a47b82f8 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -4,7 +4,7 @@ Virtual memory map with 4 level page tables: - 7fff (=47 bits) user space, different per mm -hole caused by [48:63] sign extension +hole caused by [47:63] sign extension 8000 - 87ff (=43 bits) guard hole, reserved for hypervisor 8800 - c7ff (=64 TB) direct mapping of all phys. memory c800 - c8ff (=40 bits) hole @@ -23,12 +23,39 @@ a000 - ff5f (=1526 MB) module mapping space ff60 - ffdf (=8 MB) vsyscalls ffe0 - (=2 MB) unused hole +Virtual memory map with 5 level page tables: + + - 00ff (=56 bits) user space, different per mm +hole caused by [56:63] sign extension +ff00 - ff0f (=52 bits) guard hole, reserved for hypervisor +ff10 - ff8f (=55 bits) direct mapping of all phys. memory +ff90 - ff91 (=49 bits) hole +ff92 - ffd1 (=54 bits) vmalloc/ioremap space +ffd2 - ffd3 (=49 bits) hole +ffd4 - ffd5 (=49 bits) virtual memory map (512TB) +... unused hole ... +ffd8 - fff7 (=53 bits) kasan shadow memory (8PB) +... unused hole ... +fffe - fffe007f (=39 bits) %esp fixup stacks +... unused hole ... +ffef - fffe (=64 GB) EFI region mapping space +... unused hole ... +8000 - 9fff (=512 MB) kernel text mapping, from phys 0 +a000 - ff5f (=1526 MB) module mapping space +ff60 - ffdf (=8 MB) vsyscalls +ffe0 - (=2 MB) unused hole + +Architecture defines a 64-bit virtual address. Implementations can support +less. Currently supported are 48- and 57-bit virtual addresses. Bits 63 +through to the most-significant implemented bit are set to either all ones +or all zero. This causes hole between user space and kernel addresses. + The direct mapping covers all memory in the system up to the highest memory address (this means in some cases it can also include PCI memory holes). -vmalloc space is lazily synchronized into the different PML4 pages of -the processes using the page fault handler, with init_level4_pgt as +vmalloc space is lazily synchronized into the different PML4/PML5 pages of +the processes using the page fault handler, with init_top_pgt as reference. Current X86-64 implementations support up to 46 bits of address space (64 TB), diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..747f06f00a22 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -290,6 +290,7 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC config KASAN_SHADOW_OFFSET hex depends on KASAN + default 0xdff8 if X86_5LEVEL default 0xdc00 config HAVE_INTEL_TXT diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h index 1410b567ecde..f527b02a0ee3 100644 --- a/arch/x86/include/asm/kasan.h +++ b/arch/x86/include/asm/kasan.h @@ -11,9 +11,12 @@ * 'kernel address space start' >> KASAN_SHADOW_SCALE_SHIFT */ #define KASAN_SHADOW_START (KASAN_SHADOW_OFFSET + \ - (0x8000ULL >> 3)) -/* 47 bits for kernel address -> (47 - 3) bits for shadow */ -#define KASAN_SHADOW_END(KASAN_SHADOW_START + (1ULL << (47 - 3))) + ((-1UL << __VIRTUAL_MASK_SHIFT) >> 3)) +/* + * 47 bits for kernel address -> (47 - 3) bits for shadow + * 56 bits for kernel address -> (56 - 3) bits for shadow + */ +#define KASAN_SHADOW_END(KASAN_SHADOW_START + (1ULL << (__VIRTUAL_MASK_SHIFT - 3))) #ifndef __ASSEMBLY__ diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 9215e0527647..3f5f08b010d0 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -36,7 +36,12 @@ * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's * what Xen re
[PATCHv4 23/33] x86/paravirt: make paravirt code support 5-level paging
Add operations to allocate/release p4ds. TODO: cover XEN. Not-yet-Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/paravirt.h | 44 +++ arch/x86/include/asm/paravirt_types.h | 7 +- arch/x86/include/asm/pgalloc.h| 2 ++ arch/x86/kernel/paravirt.c| 9 +-- 4 files changed, 55 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 158d877ce9e9..677edf3b6421 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -357,6 +357,16 @@ static inline void paravirt_release_pud(unsigned long pfn) PVOP_VCALL1(pv_mmu_ops.release_pud, pfn); } +static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn) +{ + PVOP_VCALL2(pv_mmu_ops.alloc_p4d, mm, pfn); +} + +static inline void paravirt_release_p4d(unsigned long pfn) +{ + PVOP_VCALL1(pv_mmu_ops.release_p4d, pfn); +} + static inline void pte_update(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { @@ -582,14 +592,35 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d) val); } -static inline void p4d_clear(p4d_t *p4dp) +#if CONFIG_PGTABLE_LEVELS >= 5 + +static inline p4d_t __p4d(p4dval_t val) { - set_p4d(p4dp, __p4d(0)); + p4dval_t ret; + + if (sizeof(p4dval_t) > sizeof(long)) + ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.make_p4d, + val, (u64)val >> 32); + else + ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.make_p4d, + val); + + return (p4d_t) { ret }; } -#if CONFIG_PGTABLE_LEVELS >= 5 +static inline p4dval_t p4d_val(p4d_t p4d) +{ + p4dval_t ret; + + if (sizeof(p4dval_t) > sizeof(long)) + ret = PVOP_CALLEE2(p4dval_t, pv_mmu_ops.p4d_val, + p4d.p4d, (u64)p4d.p4d >> 32); + else + ret = PVOP_CALLEE1(p4dval_t, pv_mmu_ops.p4d_val, + p4d.p4d); -#error FIXME + return ret; +} static inline void set_pgd(pgd_t *pgdp, pgd_t pgd) { @@ -610,6 +641,11 @@ static inline void pgd_clear(pgd_t *pgdp) #endif /* CONFIG_PGTABLE_LEVELS == 5 */ +static inline void p4d_clear(p4d_t *p4dp) +{ + set_p4d(p4dp, __p4d(0)); +} + #endif /* CONFIG_PGTABLE_LEVELS == 4 */ #endif /* CONFIG_PGTABLE_LEVELS >= 3 */ diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 93c49cf09b63..7465d6fe336f 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -238,9 +238,11 @@ struct pv_mmu_ops { void (*alloc_pte)(struct mm_struct *mm, unsigned long pfn); void (*alloc_pmd)(struct mm_struct *mm, unsigned long pfn); void (*alloc_pud)(struct mm_struct *mm, unsigned long pfn); + void (*alloc_p4d)(struct mm_struct *mm, unsigned long pfn); void (*release_pte)(unsigned long pfn); void (*release_pmd)(unsigned long pfn); void (*release_pud)(unsigned long pfn); + void (*release_p4d)(unsigned long pfn); /* Pagetable manipulation functions */ void (*set_pte)(pte_t *ptep, pte_t pteval); @@ -286,7 +288,10 @@ struct pv_mmu_ops { void (*set_p4d)(p4d_t *p4dp, p4d_t p4dval); #if CONFIG_PGTABLE_LEVELS >= 5 -#error FIXME + struct paravirt_callee_save p4d_val; + struct paravirt_callee_save make_p4d; + + void (*set_pgd)(pgd_t *pgdp, pgd_t pgdval); #endif /* CONFIG_PGTABLE_LEVELS >= 5 */ #endif /* CONFIG_PGTABLE_LEVELS >= 4 */ diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h index 2f585054c63c..b2d0cd8288aa 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -17,9 +17,11 @@ static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) { static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long clonepfn, unsigned long start, unsigned long count) {} static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn) {} +static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn) {} static inline void paravirt_release_pte(unsigned long pfn) {} static inline void paravirt_release_pmd(unsigned long pfn) {} static inline void paravirt_release_pud(unsigned long pfn) {} +static inline void paravirt_release_p4d(unsigned long pfn) {} #endif /* diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 110daf22f5c7..3586996fc50d 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -405,9 +405,11 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = { .alloc_pte = paravirt_nop, .alloc_pmd = paravirt_nop, .alloc_pud = paravirt_nop, + .alloc_p4d = paravirt_nop, .release_
[PATCHv4 01/33] x86/cpufeature: Add 5-level paging detection
Look for 'la57' in /proc/cpuinfo to see if your machine supports 5-level paging. Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/cpufeatures.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 4e7772387c6e..b04bb6dfed7f 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -289,7 +289,8 @@ #define X86_FEATURE_PKU(16*32+ 3) /* Protection Keys for Userspace */ #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */ #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW */ -#define X86_FEATURE_RDPID (16*32+ 22) /* RDPID instruction */ +#define X86_FEATURE_LA57 (16*32+16) /* 5-level page tables */ +#define X86_FEATURE_RDPID (16*32+22) /* RDPID instruction */ /* AMD-defined CPU features, CPUID level 0x8007 (ebx), word 17 */ #define X86_FEATURE_OVERFLOW_RECOV (17*32+0) /* MCA overflow recovery support */ -- 2.11.0
[PATCHv4 30/33] x86/mm: make kernel_physical_mapping_init() support 5-level paging
Properly populate addition pagetable level if CONFIG_X86_5LEVEL is enabled. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/init_64.c | 71 --- 1 file changed, 62 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 5ba99090dc3c..ef117a69f74e 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -622,6 +622,58 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, return paddr_last; } +static unsigned long __meminit +phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end, + unsigned long page_size_mask) +{ + unsigned long paddr_next, paddr_last = paddr_end; + unsigned long vaddr = (unsigned long)__va(paddr); + int i = p4d_index(vaddr); + + if (!IS_ENABLED(CONFIG_X86_5LEVEL)) + return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end, page_size_mask); + + for (; i < PTRS_PER_P4D; i++, paddr = paddr_next) { + p4d_t *p4d; + pud_t *pud; + + vaddr = (unsigned long)__va(paddr); + p4d = p4d_page + p4d_index(vaddr); + paddr_next = (paddr & P4D_MASK) + P4D_SIZE; + + if (paddr >= paddr_end) { + if (!after_bootmem && + !e820_any_mapped(paddr & P4D_MASK, paddr_next, +E820_RAM) && + !e820_any_mapped(paddr & P4D_MASK, paddr_next, +E820_RESERVED_KERN)) { + set_p4d(p4d, __p4d(0)); + } + continue; + } + + if (!p4d_none(*p4d)) { + pud = pud_offset(p4d, 0); + paddr_last = phys_pud_init(pud, paddr, + paddr_end, + page_size_mask); + __flush_tlb_all(); + continue; + } + + pud = alloc_low_page(); + paddr_last = phys_pud_init(pud, paddr, paddr_end, + page_size_mask); + + spin_lock(&init_mm.page_table_lock); + p4d_populate(&init_mm, p4d, pud); + spin_unlock(&init_mm.page_table_lock); + } + __flush_tlb_all(); + + return paddr_last; +} + /* * Create page table mapping for the physical memory for specific physical * addresses. The virtual and physical addresses have to be aligned on PMD level @@ -643,26 +695,27 @@ kernel_physical_mapping_init(unsigned long paddr_start, for (; vaddr < vaddr_end; vaddr = vaddr_next) { pgd_t *pgd = pgd_offset_k(vaddr); p4d_t *p4d; - pud_t *pud; vaddr_next = (vaddr & PGDIR_MASK) + PGDIR_SIZE; - BUILD_BUG_ON(pgd_none(*pgd)); - p4d = p4d_offset(pgd, vaddr); - if (p4d_val(*p4d)) { - pud = (pud_t *)p4d_page_vaddr(*p4d); - paddr_last = phys_pud_init(pud, __pa(vaddr), + if (pgd_val(*pgd)) { + p4d = (p4d_t *)pgd_page_vaddr(*pgd); + paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end), page_size_mask); continue; } - pud = alloc_low_page(); - paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end), + p4d = alloc_low_page(); + paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end), page_size_mask); spin_lock(&init_mm.page_table_lock); - p4d_populate(&init_mm, p4d, pud); + if (IS_ENABLED(CONFIG_X86_5LEVEL)) + pgd_populate(&init_mm, pgd, p4d); + else + p4d_populate(&init_mm, p4d_offset(pgd, vaddr), + (pud_t *) p4d); spin_unlock(&init_mm.page_table_lock); pgd_changed = true; } -- 2.11.0
[PATCHv4 05/33] asm-generic: introduce
Like with pgtable-nopud.h for 4-level paging, this new header is base for converting an architectures to properly folded p4d_t level. Signed-off-by: Kirill A. Shutemov --- include/asm-generic/pgtable-nop4d.h | 56 + include/asm-generic/pgtable-nopud.h | 43 ++-- include/asm-generic/tlb.h | 14 -- 3 files changed, 89 insertions(+), 24 deletions(-) create mode 100644 include/asm-generic/pgtable-nop4d.h diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h new file mode 100644 index ..de364ecb8df6 --- /dev/null +++ b/include/asm-generic/pgtable-nop4d.h @@ -0,0 +1,56 @@ +#ifndef _PGTABLE_NOP4D_H +#define _PGTABLE_NOP4D_H + +#ifndef __ASSEMBLY__ + +#define __PAGETABLE_P4D_FOLDED + +typedef struct { pgd_t pgd; } p4d_t; + +#define P4D_SHIFT PGDIR_SHIFT +#define PTRS_PER_P4D 1 +#define P4D_SIZE (1UL << P4D_SHIFT) +#define P4D_MASK (~(P4D_SIZE-1)) + +/* + * The "pgd_xxx()" functions here are trivial for a folded two-level + * setup: the p4d is never bad, and a p4d always exists (as it's folded + * into the pgd entry) + */ +static inline int pgd_none(pgd_t pgd) { return 0; } +static inline int pgd_bad(pgd_t pgd) { return 0; } +static inline int pgd_present(pgd_t pgd) { return 1; } +static inline void pgd_clear(pgd_t *pgd) { } +#define p4d_ERROR(p4d) (pgd_ERROR((p4d).pgd)) + +#define pgd_populate(mm, pgd, p4d) do { } while (0) +/* + * (p4ds are folded into pgds so this doesn't get actually called, + * but the define is needed for a generic inline function.) + */ +#define set_pgd(pgdptr, pgdval)set_p4d((p4d_t *)(pgdptr), (p4d_t) { pgdval }) + +static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) +{ + return (p4d_t *)pgd; +} + +#define p4d_val(x) (pgd_val((x).pgd)) +#define __p4d(x) ((p4d_t) { __pgd(x) }) + +#define pgd_page(pgd) (p4d_page((p4d_t){ pgd })) +#define pgd_page_vaddr(pgd)(p4d_page_vaddr((p4d_t){ pgd })) + +/* + * allocating and freeing a p4d is trivial: the 1-entry p4d is + * inside the pgd, so has no extra memory associated with it. + */ +#define p4d_alloc_one(mm, address) NULL +#define p4d_free(mm, x)do { } while (0) +#define __p4d_free_tlb(tlb, x, a) do { } while (0) + +#undef p4d_addr_end +#define p4d_addr_end(addr, end)(end) + +#endif /* __ASSEMBLY__ */ +#endif /* _PGTABLE_NOP4D_H */ diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h index 5e49430a30a4..c2b9b96d6268 100644 --- a/include/asm-generic/pgtable-nopud.h +++ b/include/asm-generic/pgtable-nopud.h @@ -6,53 +6,54 @@ #ifdef __ARCH_USE_5LEVEL_HACK #include #else +#include #define __PAGETABLE_PUD_FOLDED /* - * Having the pud type consist of a pgd gets the size right, and allows - * us to conceptually access the pgd entry that this pud is folded into + * Having the pud type consist of a p4d gets the size right, and allows + * us to conceptually access the p4d entry that this pud is folded into * without casting. */ -typedef struct { pgd_t pgd; } pud_t; +typedef struct { p4d_t p4d; } pud_t; -#define PUD_SHIFT PGDIR_SHIFT +#define PUD_SHIFT P4D_SHIFT #define PTRS_PER_PUD 1 #define PUD_SIZE (1UL << PUD_SHIFT) #define PUD_MASK (~(PUD_SIZE-1)) /* - * The "pgd_xxx()" functions here are trivial for a folded two-level + * The "p4d_xxx()" functions here are trivial for a folded two-level * setup: the pud is never bad, and a pud always exists (as it's folded - * into the pgd entry) + * into the p4d entry) */ -static inline int pgd_none(pgd_t pgd) { return 0; } -static inline int pgd_bad(pgd_t pgd) { return 0; } -static inline int pgd_present(pgd_t pgd) { return 1; } -static inline void pgd_clear(pgd_t *pgd) { } -#define pud_ERROR(pud) (pgd_ERROR((pud).pgd)) +static inline int p4d_none(p4d_t p4d) { return 0; } +static inline int p4d_bad(p4d_t p4d) { return 0; } +static inline int p4d_present(p4d_t p4d) { return 1; } +static inline void p4d_clear(p4d_t *p4d) { } +#define pud_ERROR(pud) (p4d_ERROR((pud).p4d)) -#define pgd_populate(mm, pgd, pud) do { } while (0) +#define p4d_populate(mm, p4d, pud) do { } while (0) /* - * (puds are folded into pgds so this doesn't get actually called, + * (puds are folded into p4ds so this doesn't get actually called, * but the define is needed for a generic inline function.) */ -#define set_pgd(pgdptr, pgdval)set_pud((pud_t *)(pgdptr), (pud_t) { pgdval }) +#define set_p4d(p4dptr, p4dval)set_pud((pud_t *)(p4dptr), (pud_t) { p4dval }) -static inline
[PATCHv4 27/33] x86/espfix: support 5-level paging
We don't need extra virtual address space for ESPFIX, so it stays within one PUD page table for both 4- and 5-level paging. Signed-off-by: Kirill A. Shutemov --- arch/x86/kernel/espfix_64.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c index 04f89caef9c4..8e598a1ad986 100644 --- a/arch/x86/kernel/espfix_64.c +++ b/arch/x86/kernel/espfix_64.c @@ -50,11 +50,11 @@ #define ESPFIX_STACKS_PER_PAGE (PAGE_SIZE/ESPFIX_STACK_SIZE) /* There is address space for how many espfix pages? */ -#define ESPFIX_PAGE_SPACE (1UL << (PGDIR_SHIFT-PAGE_SHIFT-16)) +#define ESPFIX_PAGE_SPACE (1UL << (P4D_SHIFT-PAGE_SHIFT-16)) #define ESPFIX_MAX_CPUS(ESPFIX_STACKS_PER_PAGE * ESPFIX_PAGE_SPACE) #if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS -# error "Need more than one PGD for the ESPFIX hack" +# error "Need more virtual address space for the ESPFIX hack" #endif #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO) @@ -121,11 +121,13 @@ static void init_espfix_random(void) void __init init_espfix_bsp(void) { - pgd_t *pgd_p; + pgd_t *pgd; + p4d_t *p4d; /* Install the espfix pud into the kernel page directory */ - pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)]; - pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page); + pgd = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)]; + p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR); + p4d_populate(&init_mm, p4d, espfix_pud_page); /* Randomize the locations */ init_espfix_random(); -- 2.11.0
[PATCHv4 21/33] x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert
We don't need it anymore. 17be0aec74fb ("x86/asm/entry/64: Implement better check for canonical addresses") made canonical address check generic wrt. address width. Signed-off-by: Kirill A. Shutemov --- arch/x86/entry/entry_64.S | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 044d18ebc43c..f07b4efb34d5 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -265,12 +265,9 @@ return_from_SYSCALL_64: * * If width of "canonical tail" ever becomes variable, this will need * to be updated to remain correct on both old and new CPUs. +* +* Change top 16 bits to be the sign-extension of 47th bit */ - .ifne __VIRTUAL_MASK_SHIFT - 47 - .error "virtual address width changed -- SYSRET checks need update" - .endif - - /* Change top 16 bits to be the sign-extension of 47th bit */ shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx -- 2.11.0
[PATCHv4 14/33] x86/kexec: support p4d_t
Handle additional page table level in kexec code. Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/kexec.h | 1 + arch/x86/kernel/machine_kexec_32.c | 4 +++- arch/x86/kernel/machine_kexec_64.c | 14 -- 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 282630e4c6ea..70ef205489f0 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -164,6 +164,7 @@ struct kimage_arch { }; #else struct kimage_arch { + p4d_t *p4d; pud_t *pud; pmd_t *pmd; pte_t *pte; diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c index 469b23d6acc2..5f43cec296c5 100644 --- a/arch/x86/kernel/machine_kexec_32.c +++ b/arch/x86/kernel/machine_kexec_32.c @@ -103,6 +103,7 @@ static void machine_kexec_page_table_set_one( pgd_t *pgd, pmd_t *pmd, pte_t *pte, unsigned long vaddr, unsigned long paddr) { + p4d_t *p4d; pud_t *pud; pgd += pgd_index(vaddr); @@ -110,7 +111,8 @@ static void machine_kexec_page_table_set_one( if (!(pgd_val(*pgd) & _PAGE_PRESENT)) set_pgd(pgd, __pgd(__pa(pmd) | _PAGE_PRESENT)); #endif - pud = pud_offset(pgd, vaddr); + p4d = p4d_offset(pgd, vaddr); + pud = pud_offset(p4d, vaddr); pmd = pmd_offset(pud, vaddr); if (!(pmd_val(*pmd) & _PAGE_PRESENT)) set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE)); diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 307b1f4543de..42eae96c8450 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -36,6 +36,7 @@ static struct kexec_file_ops *kexec_file_loaders[] = { static void free_transition_pgtable(struct kimage *image) { + free_page((unsigned long)image->arch.p4d); free_page((unsigned long)image->arch.pud); free_page((unsigned long)image->arch.pmd); free_page((unsigned long)image->arch.pte); @@ -43,6 +44,7 @@ static void free_transition_pgtable(struct kimage *image) static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) { + p4d_t *p4d; pud_t *pud; pmd_t *pmd; pte_t *pte; @@ -53,13 +55,21 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE); pgd += pgd_index(vaddr); if (!pgd_present(*pgd)) { + p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL); + if (!p4d) + goto err; + image->arch.p4d = p4d; + set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE)); + } + p4d = p4d_offset(pgd, vaddr); + if (!p4d_present(*p4d)) { pud = (pud_t *)get_zeroed_page(GFP_KERNEL); if (!pud) goto err; image->arch.pud = pud; - set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE)); + set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE)); } - pud = pud_offset(pgd, vaddr); + pud = pud_offset(p4d, vaddr); if (!pud_present(*pud)) { pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL); if (!pmd) -- 2.11.0
[PATCHv4 20/33] x86: detect 5-level paging support
5-level paging support is required from hardware when compiled with CONFIG_X86_5LEVEL=y. We may implement runtime switch support later. Signed-off-by: Kirill A. Shutemov --- arch/x86/boot/cpucheck.c | 9 + arch/x86/boot/cpuflags.c | 12 ++-- arch/x86/include/asm/disabled-features.h | 8 +++- arch/x86/include/asm/required-features.h | 8 +++- 4 files changed, 33 insertions(+), 4 deletions(-) diff --git a/arch/x86/boot/cpucheck.c b/arch/x86/boot/cpucheck.c index 4ad7d70e8739..8f0c4c9fc904 100644 --- a/arch/x86/boot/cpucheck.c +++ b/arch/x86/boot/cpucheck.c @@ -44,6 +44,15 @@ static const u32 req_flags[NCAPINTS] = 0, /* REQUIRED_MASK5 not implemented in this file */ REQUIRED_MASK6, 0, /* REQUIRED_MASK7 not implemented in this file */ + 0, /* REQUIRED_MASK8 not implemented in this file */ + 0, /* REQUIRED_MASK9 not implemented in this file */ + 0, /* REQUIRED_MASK10 not implemented in this file */ + 0, /* REQUIRED_MASK11 not implemented in this file */ + 0, /* REQUIRED_MASK12 not implemented in this file */ + 0, /* REQUIRED_MASK13 not implemented in this file */ + 0, /* REQUIRED_MASK14 not implemented in this file */ + 0, /* REQUIRED_MASK15 not implemented in this file */ + REQUIRED_MASK16, }; #define A32(a, b, c, d) (((d) << 24)+((c) << 16)+((b) << 8)+(a)) diff --git a/arch/x86/boot/cpuflags.c b/arch/x86/boot/cpuflags.c index 6687ab953257..9e77c23c2422 100644 --- a/arch/x86/boot/cpuflags.c +++ b/arch/x86/boot/cpuflags.c @@ -70,16 +70,19 @@ int has_eflag(unsigned long mask) # define EBX_REG "=b" #endif -static inline void cpuid(u32 id, u32 *a, u32 *b, u32 *c, u32 *d) +static inline void cpuid_count(u32 id, u32 count, + u32 *a, u32 *b, u32 *c, u32 *d) { asm volatile(".ifnc %%ebx,%3 ; movl %%ebx,%3 ; .endif \n\t" "cpuid \n\t" ".ifnc %%ebx,%3 ; xchgl %%ebx,%3 ; .endif \n\t" : "=a" (*a), "=c" (*c), "=d" (*d), EBX_REG (*b) - : "a" (id) + : "a" (id), "c" (count) ); } +#define cpuid(id, a, b, c, d) cpuid_count(id, 0, a, b, c, d) + void get_cpuflags(void) { u32 max_intel_level, max_amd_level; @@ -108,6 +111,11 @@ void get_cpuflags(void) cpu.model += ((tfms >> 16) & 0xf) << 4; } + if (max_intel_level >= 0x0007) { + cpuid_count(0x0007, 0, &ignored, &ignored, + &cpu.flags[16], &ignored); + } + cpuid(0x8000, &max_amd_level, &ignored, &ignored, &ignored); diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 85599ad4d024..fc0960236fc3 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -36,6 +36,12 @@ # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31)) #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */ +#ifdef CONFIG_X86_5LEVEL +#define DISABLE_LA57 0 +#else +#define DISABLE_LA57 (1<<(X86_FEATURE_LA57 & 31)) +#endif + /* * Make sure to add features to the correct mask */ @@ -55,7 +61,7 @@ #define DISABLED_MASK130 #define DISABLED_MASK140 #define DISABLED_MASK150 -#define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE) +#define DISABLED_MASK16(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57) #define DISABLED_MASK170 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18) diff --git a/arch/x86/include/asm/required-features.h b/arch/x86/include/asm/required-features.h index fac9a5c0abe9..d91ba04dd007 100644 --- a/arch/x86/include/asm/required-features.h +++ b/arch/x86/include/asm/required-features.h @@ -53,6 +53,12 @@ # define NEED_MOVBE0 #endif +#ifdef CONFIG_X86_5LEVEL +# define NEED_LA57 (1<<(X86_FEATURE_LA57 & 31)) +#else +# define NEED_LA57 0 +#endif + #ifdef CONFIG_X86_64 #ifdef CONFIG_PARAVIRT /* Paravirtualized systems may not have PSE or PGE available */ @@ -98,7 +104,7 @@ #define REQUIRED_MASK130 #define REQUIRED_MASK140 #define REQUIRED_MASK150 -#define REQUIRED_MASK160 +#define REQUIRED_MASK16(NEED_LA57) #define REQUIRED_MASK170 #define REQUIRED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18) -- 2.11.0
[PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d
Split these helpers few per-level functions and add p4d support. Signed-off-by: Xiong Zhang [kirill.shute...@linux.intel.com: split off into separate patch] Signed-off-by: Kirill A. Shutemov --- arch/x86/xen/mmu.c | 243 - arch/x86/xen/mmu.h | 1 + 2 files changed, 148 insertions(+), 96 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 37cb5aad71de..75af8da7b54f 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -593,6 +593,62 @@ static void xen_set_pgd(pgd_t *ptr, pgd_t val) } #endif /* CONFIG_PGTABLE_LEVELS == 4 */ +static int xen_pmd_walk(struct mm_struct *mm, pmd_t *pmd, + int (*func)(struct mm_struct *mm, struct page *, enum pt_level), + bool last, unsigned long limit) +{ + int i, nr, flush = 0; + + nr = last ? pmd_index(limit) + 1 : PTRS_PER_PMD; + for (i = 0; i < nr; i++) { + if (!pmd_none(pmd[i])) + flush |= (*func)(mm, pmd_page(pmd[i]), PT_PTE); + } + return flush; +} + +static int xen_pud_walk(struct mm_struct *mm, pud_t *pud, + int (*func)(struct mm_struct *mm, struct page *, enum pt_level), + bool last, unsigned long limit) +{ + int i, nr, flush = 0; + + nr = last ? pud_index(limit) + 1 : PTRS_PER_PUD; + for (i = 0; i < nr; i++) { + pmd_t *pmd; + + if (pud_none(pud[i])) + continue; + + pmd = pmd_offset(&pud[i], 0); + if (PTRS_PER_PMD > 1) + flush |= (*func)(mm, virt_to_page(pmd), PT_PMD); + xen_pmd_walk(mm, pmd, func, last && i == nr - 1, limit); + } + return flush; +} + +static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d, + int (*func)(struct mm_struct *mm, struct page *, enum pt_level), + bool last, unsigned long limit) +{ + int i, nr, flush = 0; + + nr = last ? p4d_index(limit) + 1 : PTRS_PER_P4D; + for (i = 0; i < nr; i++) { + pud_t *pud; + + if (p4d_none(p4d[i])) + continue; + + pud = pud_offset(&p4d[i], 0); + if (PTRS_PER_PUD > 1) + flush |= (*func)(mm, virt_to_page(pud), PT_PUD); + xen_pud_walk(mm, pud, func, last && i == nr - 1, limit); + } + return flush; +} + /* * (Yet another) pagetable walker. This one is intended for pinning a * pagetable. This means that it walks a pagetable and calls the @@ -613,10 +669,8 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd, enum pt_level), unsigned long limit) { - int flush = 0; + int i, nr, flush = 0; unsigned hole_low, hole_high; - unsigned pgdidx_limit, pudidx_limit, pmdidx_limit; - unsigned pgdidx, pudidx, pmdidx; /* The limit is the last byte to be touched */ limit--; @@ -633,65 +687,22 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd, hole_low = pgd_index(USER_LIMIT); hole_high = pgd_index(PAGE_OFFSET); - pgdidx_limit = pgd_index(limit); -#if PTRS_PER_PUD > 1 - pudidx_limit = pud_index(limit); -#else - pudidx_limit = 0; -#endif -#if PTRS_PER_PMD > 1 - pmdidx_limit = pmd_index(limit); -#else - pmdidx_limit = 0; -#endif - - for (pgdidx = 0; pgdidx <= pgdidx_limit; pgdidx++) { - pud_t *pud; + nr = pgd_index(limit) + 1; + for (i = 0; i < nr; i++) { + p4d_t *p4d; - if (pgdidx >= hole_low && pgdidx < hole_high) + if (i >= hole_low && i < hole_high) continue; - if (!pgd_val(pgd[pgdidx])) + if (pgd_none(pgd[i])) continue; - pud = pud_offset(&pgd[pgdidx], 0); - - if (PTRS_PER_PUD > 1) /* not folded */ - flush |= (*func)(mm, virt_to_page(pud), PT_PUD); - - for (pudidx = 0; pudidx < PTRS_PER_PUD; pudidx++) { - pmd_t *pmd; - - if (pgdidx == pgdidx_limit && - pudidx > pudidx_limit) - goto out; - - if (pud_none(pud[pudidx])) - continue; - - pmd = pmd_offset(&pud[pudidx], 0); - - if (PTRS_PER_PMD > 1) /* not folded */ - flush |= (*func)(mm, virt_to_page(pmd), PT_PMD); - - for (pmdidx = 0; pmdidx < PTRS_PER_PMD; pmdidx++) { - struct page *pte; - - if (pgdidx == pgdidx_limit && - pudidx == pudidx_limit && - pmdidx > pmdidx_limit) -
[PATCHv4 00/33] 5-level paging
Here is v4 of 5-level paging patchset. Please review and consider applying. == Overview == x86-64 is currently limited to 256 TiB of virtual address space and 64 TiB of physical address space. We are already bumping into this limit: some vendors offers servers with 64 TiB of memory today. To overcome the limitation upcoming hardware will introduce support for 5-level paging[1]. It is a straight-forward extension of the current page table structure adding one more layer of translation. It bumps the limits to 128 PiB of virtual address space and 4 PiB of physical address space. This "ought to be enough for anybody" ©. == Patches == The patchset is build on top of v4.11-rc1. Current QEMU upstream git supports 5-level paging. Use "-cpu qemu64,+la57" to enable it. Patch 1: Detect la57 feature for /proc/cpuinfo. Patches 2-7: Brings 5-level paging to generic code and convert all architectures to it using Patches 8-19: Convert x86 to properly folded p4d layer using . Patches 20-32: Enabling of real 5-level paging. CONFIG_X86_5LEVEL=y will enable new paging mode. Patch 33: Introduce new prctl(2) handles -- PR_SET_MAX_VADDR and PR_GET_MAX_VADDR. This aims to address compatibility issue. Only supports x86 for now, but should be useful for other archtectures. Git: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git la57/v4 == TODO == There is still work to do: - CONFIG_XEN is broken for 5-level paging. Xen for 5-level paging requires more work to get functional. Xen on 4-level paging works, so it's not a regression. - Boot-time switch between 4- and 5-level paging. We assume that distributions will be keen to avoid returning to the i386 days where we shipped one kernel binary for each page table layout. As page table format is the same for 4- and 5-level paging it should be possible to have single kernel binary and switch between them at boot-time without too much hassle. For now I only implemented compile-time switch. This will implemented with separate patchset. == Changelong == v4: - Rebased to v4.11-rc1; - Use mmap() hint address to allocate virtual addresss space above 47-bits insteads of prctl() handles. v3: - Rebased to v4.10-rc5; - prctl() handles for large address space opt-in; - Xen works for 4-level paging; - EFI boot fixed for both 4- and 5-level paging; - Hibernation fixed for 4-level paging; - kexec() fixed; - Couple of build fixes; v2: - Rebased to v4.10-rc1; - RLIMIT_VADDR proposal; - Fix virtual map and update documentation; - Fix few build errors; - Rework cpuid helpers in boot code; - Fix espfix code to work with 5-level pages; [1] https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf Kirill A. Shutemov (33): x86/cpufeature: Add 5-level paging detection asm-generic: introduce 5level-fixup.h asm-generic: introduce __ARCH_USE_5LEVEL_HACK arch, mm: convert all architectures to use 5level-fixup.h asm-generic: introduce mm: convert generic code to 5-level paging mm: introduce __p4d_alloc() x86: basic changes into headers for 5-level paging x86: trivial portion of 5-level paging conversion x86/gup: add 5-level paging support x86/ident_map: add 5-level paging support x86/mm: add support of p4d_t in vmalloc_fault() x86/power: support p4d_t in hibernate code x86/kexec: support p4d_t x86/efi: handle p4d in EFI pagetables x86/mm/pat: handle additional page table x86/kasan: prepare clear_pgds() to switch to x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d x86: convert the rest of the code to support p4d_t x86: detect 5-level paging support x86/asm: remove __VIRTUAL_MASK_SHIFT==47 assert x86/mm: define virtual memory map for 5-level paging x86/paravirt: make paravirt code support 5-level paging x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL x86/dump_pagetables: support 5-level paging x86/kasan: extend to support 5-level paging x86/espfix: support 5-level paging x86/mm: add support of additional page table level during early boot x86/mm: add sync_global_pgds() for configuration with 5-level paging x86/mm: make kernel_physical_mapping_init() support 5-level paging x86/mm: add support for 5-level paging for KASLR x86: enable 5-level paging support x86/mm: allow to have userspace mappigs above 47-bits Documentation/x86/x86_64/mm.txt | 33 +- arch/arc/include/asm/hugepage.h | 1 + arch/arc/include/asm/pgtable.h | 1 + arch/arm/include/asm/pgtable.h | 1 + arch/arm64/include/asm/pgtable-types.h | 4 + arch/avr32/include/asm/pgtable-2level.h | 1 + arch/cris/include/asm/pgtable.h | 1 + arch/frv/include/asm/pgtable.h
[PATCHv4 09/33] x86: trivial portion of 5-level paging conversion
This patch covers simple cases only. Signed-off-by: Kirill A. Shutemov --- arch/x86/kernel/tboot.c| 6 +- arch/x86/kernel/vm86_32.c | 6 +- arch/x86/mm/fault.c| 39 +-- arch/x86/mm/init_32.c | 22 -- arch/x86/mm/ioremap.c | 3 ++- arch/x86/mm/pgtable.c | 4 +++- arch/x86/mm/pgtable_32.c | 8 +++- arch/x86/platform/efi/efi_64.c | 13 + arch/x86/power/hibernate_32.c | 7 +-- 9 files changed, 85 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index b868fa1b812b..5db0f33cbf2c 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -118,12 +118,16 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn, pgprot_t prot) { pgd_t *pgd; + p4d_t *p4d; pud_t *pud; pmd_t *pmd; pte_t *pte; pgd = pgd_offset(&tboot_mm, vaddr); - pud = pud_alloc(&tboot_mm, pgd, vaddr); + p4d = p4d_alloc(&tboot_mm, pgd, vaddr); + if (!p4d) + return -1; + pud = pud_alloc(&tboot_mm, p4d, vaddr); if (!pud) return -1; pmd = pmd_alloc(&tboot_mm, pud, vaddr); diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c index 23ee89ce59a9..62597c300d94 100644 --- a/arch/x86/kernel/vm86_32.c +++ b/arch/x86/kernel/vm86_32.c @@ -164,6 +164,7 @@ static void mark_screen_rdonly(struct mm_struct *mm) struct vm_area_struct *vma; spinlock_t *ptl; pgd_t *pgd; + p4d_t *p4d; pud_t *pud; pmd_t *pmd; pte_t *pte; @@ -173,7 +174,10 @@ static void mark_screen_rdonly(struct mm_struct *mm) pgd = pgd_offset(mm, 0xA); if (pgd_none_or_clear_bad(pgd)) goto out; - pud = pud_offset(pgd, 0xA); + p4d = p4d_offset(pgd, 0xA); + if (p4d_none_or_clear_bad(p4d)) + goto out; + pud = pud_offset(p4d, 0xA); if (pud_none_or_clear_bad(pud)) goto out; pmd = pmd_offset(pud, 0xA); diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 428e31763cb9..605fd5e8e048 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -253,6 +253,7 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) { unsigned index = pgd_index(address); pgd_t *pgd_k; + p4d_t *p4d, *p4d_k; pud_t *pud, *pud_k; pmd_t *pmd, *pmd_k; @@ -265,10 +266,15 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) /* * set_pgd(pgd, *pgd_k); here would be useless on PAE * and redundant with the set_pmd() on non-PAE. As would -* set_pud. +* set_p4d/set_pud. */ - pud = pud_offset(pgd, address); - pud_k = pud_offset(pgd_k, address); + p4d = p4d_offset(pgd, address); + p4d_k = p4d_offset(pgd_k, address); + if (!p4d_present(*p4d_k)) + return NULL; + + pud = pud_offset(p4d, address); + pud_k = pud_offset(p4d_k, address); if (!pud_present(*pud_k)) return NULL; @@ -384,6 +390,8 @@ static void dump_pagetable(unsigned long address) { pgd_t *base = __va(read_cr3()); pgd_t *pgd = &base[pgd_index(address)]; + p4d_t *p4d; + pud_t *pud; pmd_t *pmd; pte_t *pte; @@ -392,7 +400,9 @@ static void dump_pagetable(unsigned long address) if (!low_pfn(pgd_val(*pgd) >> PAGE_SHIFT) || !pgd_present(*pgd)) goto out; #endif - pmd = pmd_offset(pud_offset(pgd, address), address); + p4d = p4d_offset(pgd, address); + pud = pud_offset(p4d, address); + pmd = pmd_offset(pud, address); printk(KERN_CONT "*pde = %0*Lx ", sizeof(*pmd) * 2, (u64)pmd_val(*pmd)); /* @@ -526,6 +536,7 @@ static void dump_pagetable(unsigned long address) { pgd_t *base = __va(read_cr3() & PHYSICAL_PAGE_MASK); pgd_t *pgd = base + pgd_index(address); + p4d_t *p4d; pud_t *pud; pmd_t *pmd; pte_t *pte; @@ -538,7 +549,15 @@ static void dump_pagetable(unsigned long address) if (!pgd_present(*pgd)) goto out; - pud = pud_offset(pgd, address); + p4d = p4d_offset(pgd, address); + if (bad_address(p4d)) + goto bad; + + printk("P4D %lx ", p4d_val(*p4d)); + if (!p4d_present(*p4d) || p4d_large(*p4d)) + goto out; + + pud = pud_offset(p4d, address); if (bad_address(pud)) goto bad; @@ -1082,6 +1101,7 @@ static noinline int spurious_fault(unsigned long error_code, unsigned long address) { pgd_t *pgd; + p4d_t *p4d; pud_t *pud; pmd_t *pmd; pte_t *pte; @@ -1104,7 +1124,14 @@ spurious_fault(unsigned long error_code, unsigned long address)
[PATCHv4 08/33] x86: basic changes into headers for 5-level paging
This patch extends x86 headers to enable 5-level paging support. It's still based on . We will get to the point where we can have later. Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/pgtable-2level_types.h | 1 + arch/x86/include/asm/pgtable-3level_types.h | 1 + arch/x86/include/asm/pgtable.h | 26 - arch/x86/include/asm/pgtable_64_types.h | 1 + arch/x86/include/asm/pgtable_types.h| 30 - 5 files changed, 53 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/pgtable-2level_types.h b/arch/x86/include/asm/pgtable-2level_types.h index 392576433e77..373ab1de909f 100644 --- a/arch/x86/include/asm/pgtable-2level_types.h +++ b/arch/x86/include/asm/pgtable-2level_types.h @@ -7,6 +7,7 @@ typedef unsigned long pteval_t; typedef unsigned long pmdval_t; typedef unsigned long pudval_t; +typedef unsigned long p4dval_t; typedef unsigned long pgdval_t; typedef unsigned long pgprotval_t; diff --git a/arch/x86/include/asm/pgtable-3level_types.h b/arch/x86/include/asm/pgtable-3level_types.h index bcc89625ebe5..b8a4341faafa 100644 --- a/arch/x86/include/asm/pgtable-3level_types.h +++ b/arch/x86/include/asm/pgtable-3level_types.h @@ -7,6 +7,7 @@ typedef u64pteval_t; typedef u64pmdval_t; typedef u64pudval_t; +typedef u64p4dval_t; typedef u64pgdval_t; typedef u64pgprotval_t; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 1cfb36b8c024..6f6f351e0a81 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -179,6 +179,17 @@ static inline unsigned long pud_pfn(pud_t pud) return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT; } +static inline unsigned long p4d_pfn(p4d_t p4d) +{ + return (p4d_val(p4d) & p4d_pfn_mask(p4d)) >> PAGE_SHIFT; +} + +static inline int p4d_large(p4d_t p4d) +{ + /* No 512 GiB pages yet */ + return 0; +} + #define pte_page(pte) pfn_to_page(pte_pfn(pte)) static inline int pmd_large(pmd_t pte) @@ -770,6 +781,16 @@ static inline int pud_large(pud_t pud) } #endif /* CONFIG_PGTABLE_LEVELS > 2 */ +static inline unsigned long pud_index(unsigned long address) +{ + return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1); +} + +static inline unsigned long p4d_index(unsigned long address) +{ + return (address >> P4D_SHIFT) & (PTRS_PER_P4D - 1); +} + #if CONFIG_PGTABLE_LEVELS > 3 static inline int pgd_present(pgd_t pgd) { @@ -788,11 +809,6 @@ static inline unsigned long pgd_page_vaddr(pgd_t pgd) #define pgd_page(pgd) pfn_to_page(pgd_val(pgd) >> PAGE_SHIFT) /* to find an entry in a page-table-directory. */ -static inline unsigned long pud_index(unsigned long address) -{ - return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1); -} - static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address) { return (pud_t *)pgd_page_vaddr(*pgd) + pud_index(address); diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 3a264200c62f..0b2797e5083c 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -13,6 +13,7 @@ typedef unsigned long pteval_t; typedef unsigned long pmdval_t; typedef unsigned long pudval_t; +typedef unsigned long p4dval_t; typedef unsigned long pgdval_t; typedef unsigned long pgprotval_t; diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 62484333673d..df08535f774a 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -272,9 +272,20 @@ static inline pgdval_t pgd_flags(pgd_t pgd) return native_pgd_val(pgd) & PTE_FLAGS_MASK; } -#if CONFIG_PGTABLE_LEVELS > 3 +#if CONFIG_PGTABLE_LEVELS > 4 + +#error FIXME + +#else #include +static inline p4dval_t native_p4d_val(p4d_t p4d) +{ + return native_pgd_val(p4d); +} +#endif + +#if CONFIG_PGTABLE_LEVELS > 3 typedef struct { pudval_t pud; } pud_t; static inline pud_t native_make_pud(pmdval_t val) @@ -318,6 +329,22 @@ static inline pmdval_t native_pmd_val(pmd_t pmd) } #endif +static inline p4dval_t p4d_pfn_mask(p4d_t p4d) +{ + /* No 512 GiB huge pages yet */ + return PTE_PFN_MASK; +} + +static inline p4dval_t p4d_flags_mask(p4d_t p4d) +{ + return ~p4d_pfn_mask(p4d); +} + +static inline p4dval_t p4d_flags(p4d_t p4d) +{ + return native_p4d_val(p4d) & p4d_flags_mask(p4d); +} + static inline pudval_t pud_pfn_mask(pud_t pud) { if (native_pud_val(pud) & _PAGE_PSE) @@ -461,6 +488,7 @@ enum pg_level { PG_LEVEL_4K, PG_LEVEL_2M, PG_LEVEL_1G, + PG_LEVEL_512G, PG_LEVEL_NUM }; -- 2.11.0
[PATCH] iommu/arm-smmu: Report smmu type in dmesg
The ARM SMMU detection especially depends from system firmware. For better diagnostic, log the detected type in dmesg. The smmu type's name is now stored in struct arm_smmu_type and ACPI code is modified to use that struct too. Rename ARM_SMMU_MATCH_DATA() macro to ARM_SMMU_TYPE() for better readability. Signed-off-by: Robert Richter --- drivers/iommu/arm-smmu.c | 61 1 file changed, 30 insertions(+), 31 deletions(-) diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index abf6496843a6..5c793b3d3173 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -366,6 +366,7 @@ struct arm_smmu_device { u32 options; enum arm_smmu_arch_version version; enum arm_smmu_implementationmodel; + const char *name; u32 num_context_banks; u32 num_s2_context_banks; @@ -1955,19 +1956,20 @@ static int arm_smmu_device_cfg_probe(struct arm_smmu_device *smmu) return 0; } -struct arm_smmu_match_data { +struct arm_smmu_type { enum arm_smmu_arch_version version; enum arm_smmu_implementation model; + const char *name; }; -#define ARM_SMMU_MATCH_DATA(name, ver, imp)\ -static struct arm_smmu_match_data name = { .version = ver, .model = imp } +#define ARM_SMMU_TYPE(var, ver, imp, _name)\ +static struct arm_smmu_type var = { .version = ver, .model = imp, .name = _name } -ARM_SMMU_MATCH_DATA(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU); -ARM_SMMU_MATCH_DATA(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU); -ARM_SMMU_MATCH_DATA(arm_mmu401, ARM_SMMU_V1_64K, GENERIC_SMMU); -ARM_SMMU_MATCH_DATA(arm_mmu500, ARM_SMMU_V2, ARM_MMU500); -ARM_SMMU_MATCH_DATA(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2); +ARM_SMMU_TYPE(smmu_generic_v1, ARM_SMMU_V1, GENERIC_SMMU, "smmu-generic-v1"); +ARM_SMMU_TYPE(smmu_generic_v2, ARM_SMMU_V2, GENERIC_SMMU, "smmu-generic-v2"); +ARM_SMMU_TYPE(arm_mmu401, ARM_SMMU_V1_64K, GENERIC_SMMU, "arm-mmu401"); +ARM_SMMU_TYPE(arm_mmu500, ARM_SMMU_V2, ARM_MMU500, "arm-mmu500"); +ARM_SMMU_TYPE(cavium_smmuv2, ARM_SMMU_V2, CAVIUM_SMMUV2, "cavium-smmuv2"); static const struct of_device_id arm_smmu_of_match[] = { { .compatible = "arm,smmu-v1", .data = &smmu_generic_v1 }, @@ -1981,29 +1983,19 @@ static const struct of_device_id arm_smmu_of_match[] = { MODULE_DEVICE_TABLE(of, arm_smmu_of_match); #ifdef CONFIG_ACPI -static int acpi_smmu_get_data(u32 model, struct arm_smmu_device *smmu) +static struct arm_smmu_type *acpi_smmu_get_type(u32 model) { - int ret = 0; - switch (model) { case ACPI_IORT_SMMU_V1: case ACPI_IORT_SMMU_CORELINK_MMU400: - smmu->version = ARM_SMMU_V1; - smmu->model = GENERIC_SMMU; - break; + return &smmu_generic_v1; case ACPI_IORT_SMMU_V2: - smmu->version = ARM_SMMU_V2; - smmu->model = GENERIC_SMMU; - break; + return &smmu_generic_v2; case ACPI_IORT_SMMU_CORELINK_MMU500: - smmu->version = ARM_SMMU_V2; - smmu->model = ARM_MMU500; - break; - default: - ret = -ENODEV; + return &arm_mmu500; } - return ret; + return NULL; } static int arm_smmu_device_acpi_probe(struct platform_device *pdev, @@ -2013,14 +2005,18 @@ static int arm_smmu_device_acpi_probe(struct platform_device *pdev, struct acpi_iort_node *node = *(struct acpi_iort_node **)dev_get_platdata(dev); struct acpi_iort_smmu *iort_smmu; - int ret; + struct arm_smmu_type *type; /* Retrieve SMMU1/2 specific data */ iort_smmu = (struct acpi_iort_smmu *)node->node_data; - ret = acpi_smmu_get_data(iort_smmu->model, smmu); - if (ret < 0) - return ret; + type = acpi_smmu_get_type(iort_smmu->model); + if (!type) + return -ENODEV; + + smmu->version = type->version; + smmu->model = type->model; + smmu->name = type->name; /* Ignore the configuration access interrupt */ smmu->num_global_irqs = 1; @@ -2041,8 +2037,8 @@ static inline int arm_smmu_device_acpi_probe(struct platform_device *pdev, static int arm_smmu_device_dt_probe(struct platform_device *pdev, struct arm_smmu_device *smmu) { - const struct arm_smmu_match_data *data; struct device *dev = &pdev->dev; + const struct arm_smmu_type *type; bool legacy_binding; if (of_property_read_u32(dev->of_node, "#global-interrupts", @@ -2051,9 +2047,10 @@ static int arm_smmu_device_dt_probe(struct platform_device *pdev, return -ENODEV; } - data = of_device_get_match_data(dev); - smmu->version = data->version; -
Re: [Patch v2 02/11] s5p-mfc: Adding initial support for MFC v10.10
On 03.03.2017 10:07, Smitha T Murthy wrote: > Adding the support for MFC v10.10, with new register file and > necessary hw control, decoder, encoder and structural changes. > > Signed-off-by: Smitha T Murthy Reviewed-by: Andrzej Hajda Few nitpicks below. > CC: Rob Herring > CC: devicet...@vger.kernel.org > --- > .../devicetree/bindings/media/s5p-mfc.txt |1 + > drivers/media/platform/s5p-mfc/regs-mfc-v10.h | 36 > drivers/media/platform/s5p-mfc/s5p_mfc.c | 30 + > drivers/media/platform/s5p-mfc/s5p_mfc_common.h|4 +- > drivers/media/platform/s5p-mfc/s5p_mfc_ctrl.c |4 ++ > drivers/media/platform/s5p-mfc/s5p_mfc_dec.c | 44 > +++- > drivers/media/platform/s5p-mfc/s5p_mfc_enc.c | 21 + > drivers/media/platform/s5p-mfc/s5p_mfc_opr_v6.c|9 +++- > drivers/media/platform/s5p-mfc/s5p_mfc_opr_v6.h|2 + > 9 files changed, 118 insertions(+), 33 deletions(-) > create mode 100644 drivers/media/platform/s5p-mfc/regs-mfc-v10.h > > diff --git a/Documentation/devicetree/bindings/media/s5p-mfc.txt > b/Documentation/devicetree/bindings/media/s5p-mfc.txt > index 2c90128..b83727b 100644 > --- a/Documentation/devicetree/bindings/media/s5p-mfc.txt > +++ b/Documentation/devicetree/bindings/media/s5p-mfc.txt > @@ -13,6 +13,7 @@ Required properties: > (c) "samsung,mfc-v7" for MFC v7 present in Exynos5420 SoC > (d) "samsung,mfc-v8" for MFC v8 present in Exynos5800 SoC > (e) "samsung,exynos5433-mfc" for MFC v8 present in Exynos5433 SoC > + (f) "samsung,mfc-v10" for MFC v10 present in Exynos7880 SoC > >- reg : Physical base address of the IP registers and length of memory > mapped region. > diff --git a/drivers/media/platform/s5p-mfc/regs-mfc-v10.h > b/drivers/media/platform/s5p-mfc/regs-mfc-v10.h > new file mode 100644 > index 000..bd671a5 > --- /dev/null > +++ b/drivers/media/platform/s5p-mfc/regs-mfc-v10.h > @@ -0,0 +1,36 @@ > +/* > + * Register definition file for Samsung MFC V10.x Interface (FIMV) driver > + * > + * Copyright (c) 2017 Samsung Electronics Co., Ltd. > + * http://www.samsung.com/ > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#ifndef _REGS_MFC_V10_H > +#define _REGS_MFC_V10_H > + > +#include > +#include "regs-mfc-v8.h" > + > +/* MFCv10 register definitions*/ > +#define S5P_FIMV_MFC_CLOCK_OFF_V10 0x7120 > +#define S5P_FIMV_MFC_STATE_V10 0x7124 > + > +/* MFCv10 Context buffer sizes */ > +#define MFC_CTX_BUF_SIZE_V10 (30 * SZ_1K)/* 30KB */ > +#define MFC_H264_DEC_CTX_BUF_SIZE_V10(2 * SZ_1M) /* 2MB */ > +#define MFC_OTHER_DEC_CTX_BUF_SIZE_V10 (20 * SZ_1K)/* 20KB */ > +#define MFC_H264_ENC_CTX_BUF_SIZE_V10(100 * SZ_1K) /* 100KB */ > +#define MFC_OTHER_ENC_CTX_BUF_SIZE_V10 (15 * SZ_1K)/* 15KB */ > + > +/* MFCv10 variant defines */ > +#define MAX_FW_SIZE_V10 (SZ_1M) /* 1MB */ > +#define MAX_CPB_SIZE_V10 (3 * SZ_1M) /* 3MB */ These comments seems redundant, definition is clear enough, you could remove them if there will be next iteration. > +#define MFC_VERSION_V10 0xA0 > +#define MFC_NUM_PORTS_V101 > + > +#endif /*_REGS_MFC_V10_H*/ > + > diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c > b/drivers/media/platform/s5p-mfc/s5p_mfc.c > index bb0a588..a043cce 100644 > --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c > +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c > @@ -1542,6 +1542,33 @@ static int s5p_mfc_resume(struct device *dev) > .num_clocks = 3, > }; > > +static struct s5p_mfc_buf_size_v6 mfc_buf_size_v10 = { > + .dev_ctx= MFC_CTX_BUF_SIZE_V10, > + .h264_dec_ctx = MFC_H264_DEC_CTX_BUF_SIZE_V10, > + .other_dec_ctx = MFC_OTHER_DEC_CTX_BUF_SIZE_V10, > + .h264_enc_ctx = MFC_H264_ENC_CTX_BUF_SIZE_V10, > + .other_enc_ctx = MFC_OTHER_ENC_CTX_BUF_SIZE_V10, > +}; > + > +static struct s5p_mfc_buf_size buf_size_v10 = { > + .fw = MAX_FW_SIZE_V10, > + .cpb= MAX_CPB_SIZE_V10, > + .priv = &mfc_buf_size_v10, > +}; > + > +static struct s5p_mfc_buf_align mfc_buf_align_v10 = { > + .base = 0, > +}; > + > +static struct s5p_mfc_variant mfc_drvdata_v10 = { > + .version= MFC_VERSION_V10, > + .version_bit= MFC_V10_BIT, > + .port_num = MFC_NUM_PORTS_V10, > + .buf_size = &buf_size_v10, > + .buf_align = &mfc_buf_align_v10, > + .fw_name[0] = "s5p-mfc-v10.fw", > +}; > + > static const struct of_device_id exynos_mfc_match[] = { > { > .compatible = "samsung,mfc-v5", > @@ -1558,6 +1585,9 @@ static int s5p_mfc_resume(struct device *dev) > }, { > .compatible = "samsung,exynos5433-mfc
[PATCHv4 15/33] x86/efi: handle p4d in EFI pagetables
Allocate additional page table level and change efi_sync_low_kernel_mappings() to make syncing logic work with additional page table level. Signed-off-by: Kirill A. Shutemov Reviewed-by: Matt Fleming --- arch/x86/platform/efi/efi_64.c | 33 +++-- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 8544dae3d1b4..34d019f75239 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -135,6 +135,7 @@ static pgd_t *efi_pgd; int __init efi_alloc_page_tables(void) { pgd_t *pgd; + p4d_t *p4d; pud_t *pud; gfp_t gfp_mask; @@ -147,15 +148,20 @@ int __init efi_alloc_page_tables(void) return -ENOMEM; pgd = efi_pgd + pgd_index(EFI_VA_END); + p4d = p4d_alloc(&init_mm, pgd, EFI_VA_END); + if (!p4d) { + free_page((unsigned long)efi_pgd); + return -ENOMEM; + } - pud = pud_alloc_one(NULL, 0); + pud = pud_alloc(&init_mm, p4d, EFI_VA_END); if (!pud) { + if (CONFIG_PGTABLE_LEVELS > 4) + free_page((unsigned long) pgd_page_vaddr(*pgd)); free_page((unsigned long)efi_pgd); return -ENOMEM; } - pgd_populate(NULL, pgd, pud); - return 0; } @@ -190,6 +196,18 @@ void efi_sync_low_kernel_mappings(void) num_entries = pgd_index(EFI_VA_END) - pgd_index(PAGE_OFFSET); memcpy(pgd_efi, pgd_k, sizeof(pgd_t) * num_entries); + /* The same story as with PGD entries */ + BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END)); + BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK)); + + pgd_efi = efi_pgd + pgd_index(EFI_VA_END); + pgd_k = pgd_offset_k(EFI_VA_END); + p4d_efi = p4d_offset(pgd_efi, 0); + p4d_k = p4d_offset(pgd_k, 0); + + num_entries = p4d_index(EFI_VA_END); + memcpy(p4d_efi, p4d_k, sizeof(p4d_t) * num_entries); + /* * We share all the PUD entries apart from those that map the * EFI regions. Copy around them. @@ -197,20 +215,15 @@ void efi_sync_low_kernel_mappings(void) BUILD_BUG_ON((EFI_VA_START & ~PUD_MASK) != 0); BUILD_BUG_ON((EFI_VA_END & ~PUD_MASK) != 0); - pgd_efi = efi_pgd + pgd_index(EFI_VA_END); - p4d_efi = p4d_offset(pgd_efi, 0); + p4d_efi = p4d_offset(pgd_efi, EFI_VA_END); + p4d_k = p4d_offset(pgd_k, EFI_VA_END); pud_efi = pud_offset(p4d_efi, 0); - - pgd_k = pgd_offset_k(EFI_VA_END); - p4d_k = p4d_offset(pgd_k, 0); pud_k = pud_offset(p4d_k, 0); num_entries = pud_index(EFI_VA_END); memcpy(pud_efi, pud_k, sizeof(pud_t) * num_entries); - p4d_efi = p4d_offset(pgd_efi, EFI_VA_START); pud_efi = pud_offset(p4d_efi, EFI_VA_START); - p4d_k = p4d_offset(pgd_k, EFI_VA_START); pud_k = pud_offset(p4d_k, EFI_VA_START); num_entries = PTRS_PER_PUD - pud_index(EFI_VA_START); -- 2.11.0
Re: [PATCH] mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
On Fri 03-03-17 18:53:56, Tahsin Erdogan wrote: > mem_cgroup_free() indirectly calls wb_domain_exit() which is not > prepared to deal with a struct wb_domain object that hasn't executed > wb_domain_init(). For instance, the following warning message is > printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc(): > > INFO: trying to register non-static key. > the code is fine but needs lockdep annotation. > turning off the locking correctness validator. > CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > Call Trace: >dump_stack+0x67/0x99 >register_lock_class+0x36d/0x540 >__lock_acquire+0x7f/0x1a30 >? irq_work_queue+0x73/0x90 >? wake_up_klogd+0x36/0x40 >? console_unlock+0x45d/0x540 >? vprintk_emit+0x211/0x2e0 >lock_acquire+0xcc/0x200 >? try_to_del_timer_sync+0x60/0x60 >del_timer_sync+0x3c/0xc0 >? try_to_del_timer_sync+0x60/0x60 >wb_domain_exit+0x14/0x20 >mem_cgroup_free+0x14/0x40 >mem_cgroup_css_alloc+0x3f9/0x620 >cgroup_apply_control_enable+0x190/0x390 >cgroup_mkdir+0x290/0x3d0 >kernfs_iop_mkdir+0x58/0x80 >vfs_mkdir+0x10e/0x1a0 >SyS_mkdirat+0xa8/0xd0 >SyS_mkdir+0x14/0x20 >entry_SYSCALL_64_fastpath+0x18/0xad > > Fix mem_cgroup_alloc() by doing more granular clean up in case of > failures. > > Fixes: 0b8f73e104285 ("mm: memcontrol: clean up alloc, online, offline, free > functions") > Signed-off-by: Tahsin Erdogan Please do not duplicate mem_cgroup_free code and rather add __mem_cgroup_free which does everything except for wb_domain_exit. An alternative would be to teach memcg_wb_domain_exit to not call wb_domain_exit if it hasn't been initialized yet. The first option seems easier. Thanks! > --- > mm/memcontrol.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c52ec893e241..9a9d5630df91 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4194,9 +4194,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void) > idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); > return memcg; > fail: > + for_each_node(node) > + free_mem_cgroup_per_node_info(memcg, node); > + free_percpu(memcg->stat); > if (memcg->id.id > 0) > idr_remove(&mem_cgroup_idr, memcg->id.id); > - mem_cgroup_free(memcg); > + kfree(memcg); > return NULL; > } > > -- > 2.12.0.rc1.440.g5b76565f74-goog > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- Michal Hocko SUSE Labs
[PATCHv4 12/33] x86/mm: add support of p4d_t in vmalloc_fault()
With 4-level paging copying happens on p4d level, as we have pgd_none() always false when p4d_t folded. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/fault.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 605fd5e8e048..fcc887f607c2 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -435,6 +435,7 @@ void vmalloc_sync_all(void) static noinline int vmalloc_fault(unsigned long address) { pgd_t *pgd, *pgd_ref; + p4d_t *p4d, *p4d_ref; pud_t *pud, *pud_ref; pmd_t *pmd, *pmd_ref; pte_t *pte, *pte_ref; @@ -462,13 +463,26 @@ static noinline int vmalloc_fault(unsigned long address) BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); } + /* With 4-level paging copying happens on p4d level. */ + p4d = p4d_offset(pgd, address); + p4d_ref = p4d_offset(pgd_ref, address); + if (p4d_none(*p4d_ref)) + return -1; + + if (p4d_none(*p4d)) { + set_p4d(p4d, *p4d_ref); + arch_flush_lazy_mmu_mode(); + } else { + BUG_ON(p4d_pfn(*p4d) != p4d_pfn(*p4d_ref)); + } + /* * Below here mismatches are bugs because these lower tables * are shared: */ - pud = pud_offset(pgd, address); - pud_ref = pud_offset(pgd_ref, address); + pud = pud_offset(p4d, address); + pud_ref = pud_offset(p4d_ref, address); if (pud_none(*pud_ref)) return -1; -- 2.11.0
[v2 PATCH 1/3] mmc: sdhci-cadence: Fix writing PHY delay
Add polling for ACK to be sure that data are written to PHY register. Signed-off-by: Piotr Sroka --- Changes for v2: - fix indent --- drivers/mmc/host/sdhci-cadence.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/mmc/host/sdhci-cadence.c b/drivers/mmc/host/sdhci-cadence.c index 316cfec..b2334ec 100644 --- a/drivers/mmc/host/sdhci-cadence.c +++ b/drivers/mmc/host/sdhci-cadence.c @@ -66,11 +66,12 @@ struct sdhci_cdns_priv { void __iomem *hrs_addr; }; -static void sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv, -u8 addr, u8 data) +static int sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv, + u8 addr, u8 data) { void __iomem *reg = priv->hrs_addr + SDHCI_CDNS_HRS04; u32 tmp; + int ret; tmp = (data << SDHCI_CDNS_HRS04_WDATA_SHIFT) | (addr << SDHCI_CDNS_HRS04_ADDR_SHIFT); @@ -79,8 +80,14 @@ static void sdhci_cdns_write_phy_reg(struct sdhci_cdns_priv *priv, tmp |= SDHCI_CDNS_HRS04_WR; writel(tmp, reg); + ret = readl_poll_timeout(reg, tmp, tmp & SDHCI_CDNS_HRS04_ACK, 0, 10); + if (ret) + return ret; + tmp &= ~SDHCI_CDNS_HRS04_WR; writel(tmp, reg); + + return 0; } static void sdhci_cdns_phy_init(struct sdhci_cdns_priv *priv) -- 2.2.2
[PATCHv4 10/33] x86/gup: add 5-level paging support
It's simply extension for one more page table level. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/gup.c | 33 +++-- 1 file changed, 27 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index 99c7805a9693..eb407cf0f6d3 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -76,9 +76,9 @@ static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages) } /* - * 'pteval' can come from a pte, pmd or pud. We only check + * 'pteval' can come from a pte, pmd, pud or p4d. We only check * _PAGE_PRESENT, _PAGE_USER, and _PAGE_RW in here which are the - * same value on all 3 types. + * same value on all 4 types. */ static inline int pte_allows_gup(unsigned long pteval, int write) { @@ -290,13 +290,13 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr, return 1; } -static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, +static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; - pudp = pud_offset(&pgd, addr); + pudp = pud_offset(&p4d, addr); do { pud_t pud = *pudp; @@ -315,6 +315,27 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end, return 1; } +static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, + int write, struct page **pages, int *nr) +{ + unsigned long next; + p4d_t *p4dp; + + p4dp = p4d_offset(&pgd, addr); + do { + p4d_t p4d = *p4dp; + + next = p4d_addr_end(addr, end); + if (p4d_none(p4d)) + return 0; + BUILD_BUG_ON(p4d_large(p4d)); + if (!gup_pud_range(p4d, addr, next, write, pages, nr)) + return 0; + } while (p4dp++, addr = next, addr != end); + + return 1; +} + /* * Like get_user_pages_fast() except its IRQ-safe in that it won't fall * back to the regular GUP. @@ -363,7 +384,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, next = pgd_addr_end(addr, end); if (pgd_none(pgd)) break; - if (!gup_pud_range(pgd, addr, next, write, pages, &nr)) + if (!gup_p4d_range(pgd, addr, next, write, pages, &nr)) break; } while (pgdp++, addr = next, addr != end); local_irq_restore(flags); @@ -435,7 +456,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, next = pgd_addr_end(addr, end); if (pgd_none(pgd)) goto slow; - if (!gup_pud_range(pgd, addr, next, write, pages, &nr)) + if (!gup_p4d_range(pgd, addr, next, write, pages, &nr)) goto slow; } while (pgdp++, addr = next, addr != end); local_irq_enable(); -- 2.11.0
[PATCH v2 6/6] powerpc/perf: Add Power8 mem_access event to sysfs
Patch add "mem_access" event to sysfs. This as-is not a raw event supported by Power8 pmu. Instead, it is formed based on raw event encoding specificed in isa207-common.h. Primary PMU event used here is PM_MRK_INST_CMPL. This event tracks only the completed marked instructions. Random sampling mode (MMCRA[SM]) with Random Instruction Sampling (RIS) is enabled to mark type of instructions. With Random sampling in RLS mode with PM_MRK_INST_CMPL event, the LDST /DATA_SRC fields in SIER identifies the memory hierarchy level (eg: L1, L2 etc) statisfied a data-cache miss for a marked instruction. Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Sukadev Bhattiprolu Cc: Daniel Axtens Cc: Andrew Donnellan Signed-off-by: Madhavan Srinivasan --- arch/powerpc/perf/power8-events-list.h | 6 ++ arch/powerpc/perf/power8-pmu.c | 2 ++ 2 files changed, 8 insertions(+) diff --git a/arch/powerpc/perf/power8-events-list.h b/arch/powerpc/perf/power8-events-list.h index 3a2e6e8ebb92..0f1d184627cc 100644 --- a/arch/powerpc/perf/power8-events-list.h +++ b/arch/powerpc/perf/power8-events-list.h @@ -89,3 +89,9 @@ EVENT(PM_MRK_FILT_MATCH, 0x2013c) EVENT(PM_MRK_FILT_MATCH_ALT, 0x3012e) /* Alternate event code for PM_LD_MISS_L1 */ EVENT(PM_LD_MISS_L1_ALT, 0x400f0) +/* + * Memory Access Event -- mem_access + * Primary PMU event used here is PM_MRK_INST_CMPL, along with + * Random Load/Store Facility Sampling (RIS) in Random sampling mode (MMCRA[SM]). + */ +EVENT(MEM_ACCESS, 0x10401e0) diff --git a/arch/powerpc/perf/power8-pmu.c b/arch/powerpc/perf/power8-pmu.c index 932d7536f0eb..5463516e369b 100644 --- a/arch/powerpc/perf/power8-pmu.c +++ b/arch/powerpc/perf/power8-pmu.c @@ -90,6 +90,7 @@ GENERIC_EVENT_ATTR(branch-instructions, PM_BRU_FIN); GENERIC_EVENT_ATTR(branch-misses, PM_BR_MPRED_CMPL); GENERIC_EVENT_ATTR(cache-references, PM_LD_REF_L1); GENERIC_EVENT_ATTR(cache-misses, PM_LD_MISS_L1); +GENERIC_EVENT_ATTR(mem_access, MEM_ACCESS); CACHE_EVENT_ATTR(L1-dcache-load-misses,PM_LD_MISS_L1); CACHE_EVENT_ATTR(L1-dcache-loads, PM_LD_REF_L1); @@ -120,6 +121,7 @@ static struct attribute *power8_events_attr[] = { GENERIC_EVENT_PTR(PM_BR_MPRED_CMPL), GENERIC_EVENT_PTR(PM_LD_REF_L1), GENERIC_EVENT_PTR(PM_LD_MISS_L1), + GENERIC_EVENT_PTR(MEM_ACCESS), CACHE_EVENT_PTR(PM_LD_MISS_L1), CACHE_EVENT_PTR(PM_LD_REF_L1), -- 2.7.4
[PATCHv4 16/33] x86/mm/pat: handle additional page table
Straight-forward extension of existing code to support additional page table level. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/pageattr.c | 56 -- 1 file changed, 41 insertions(+), 15 deletions(-) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 28d42130243c..eb0ad12cdfde 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -346,6 +346,7 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long address, pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, unsigned int *level) { + p4d_t *p4d; pud_t *pud; pmd_t *pmd; @@ -354,7 +355,15 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, if (pgd_none(*pgd)) return NULL; - pud = pud_offset(pgd, address); + p4d = p4d_offset(pgd, address); + if (p4d_none(*p4d)) + return NULL; + + *level = PG_LEVEL_512G; + if (p4d_large(*p4d) || !p4d_present(*p4d)) + return (pte_t *)p4d; + + pud = pud_offset(p4d, address); if (pud_none(*pud)) return NULL; @@ -406,13 +415,18 @@ static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address, pmd_t *lookup_pmd_address(unsigned long address) { pgd_t *pgd; + p4d_t *p4d; pud_t *pud; pgd = pgd_offset_k(address); if (pgd_none(*pgd)) return NULL; - pud = pud_offset(pgd, address); + p4d = p4d_offset(pgd, address); + if (p4d_none(*p4d) || p4d_large(*p4d) || !p4d_present(*p4d)) + return NULL; + + pud = pud_offset(p4d, address); if (pud_none(*pud) || pud_large(*pud) || !pud_present(*pud)) return NULL; @@ -477,11 +491,13 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) list_for_each_entry(page, &pgd_list, lru) { pgd_t *pgd; + p4d_t *p4d; pud_t *pud; pmd_t *pmd; pgd = (pgd_t *)page_address(page) + pgd_index(address); - pud = pud_offset(pgd, address); + p4d = p4d_offset(pgd, address); + pud = pud_offset(p4d, address); pmd = pmd_offset(pud, address); set_pte_atomic((pte_t *)pmd, pte); } @@ -836,9 +852,9 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end) pud_clear(pud); } -static void unmap_pud_range(pgd_t *pgd, unsigned long start, unsigned long end) +static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end) { - pud_t *pud = pud_offset(pgd, start); + pud_t *pud = pud_offset(p4d, start); /* * Not on a GB page boundary? @@ -1004,8 +1020,8 @@ static long populate_pmd(struct cpa_data *cpa, return num_pages; } -static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, -pgprot_t pgprot) +static int populate_pud(struct cpa_data *cpa, unsigned long start, p4d_t *p4d, + pgprot_t pgprot) { pud_t *pud; unsigned long end; @@ -1026,7 +1042,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, cur_pages = (pre_end - start) >> PAGE_SHIFT; cur_pages = min_t(int, (int)cpa->numpages, cur_pages); - pud = pud_offset(pgd, start); + pud = pud_offset(p4d, start); /* * Need a PMD page? @@ -1047,7 +1063,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, if (cpa->numpages == cur_pages) return cur_pages; - pud = pud_offset(pgd, start); + pud = pud_offset(p4d, start); pud_pgprot = pgprot_4k_2_large(pgprot); /* @@ -1067,7 +1083,7 @@ static long populate_pud(struct cpa_data *cpa, unsigned long start, pgd_t *pgd, if (start < end) { long tmp; - pud = pud_offset(pgd, start); + pud = pud_offset(p4d, start); if (pud_none(*pud)) if (alloc_pmd_page(pud)) return -1; @@ -1090,33 +1106,43 @@ static int populate_pgd(struct cpa_data *cpa, unsigned long addr) { pgprot_t pgprot = __pgprot(_KERNPG_TABLE); pud_t *pud = NULL; /* shut up gcc */ + p4d_t *p4d; pgd_t *pgd_entry; long ret; pgd_entry = cpa->pgd + pgd_index(addr); + if (pgd_none(*pgd_entry)) { + p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL | __GFP_NOTRACK); + if (!p4d) + return -1; + + set_pgd(pgd_entry, __pgd(__pa(p4d) | _KERNPG_TABLE))
[RESEND PATCH v3 5/8] phy: phy-mt65xx-usb3: add support for new version phy
There are some variations from mt2701 to mt2712: 1. banks shared by multiple ports are put back into each port, such as SPLLC and U2FREQ; 2. add a new bank MISC for u2port, and CHIP for u3port; 3. bank's offset in each port are also rearranged; Signed-off-by: Chunfeng Yun --- drivers/phy/phy-mt65xx-usb3.c | 344 ++--- 1 file changed, 217 insertions(+), 127 deletions(-) diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c index f4a3505..eb33499 100644 --- a/drivers/phy/phy-mt65xx-usb3.c +++ b/drivers/phy/phy-mt65xx-usb3.c @@ -23,46 +23,54 @@ #include #include -/* - * for sifslv2 register, but exclude port's; - * relative to USB3_SIF2_BASE base address - */ -#define SSUSB_SIFSLV_SPLLC 0x -#define SSUSB_SIFSLV_U2FREQ0x0100 - -/* offsets of banks in each u2phy registers */ -#define SSUSB_SIFSLV_U2PHY_COM_BASE0x -/* offsets of banks in each u3phy registers */ -#define SSUSB_SIFSLV_U3PHYD_BASE 0x -#define SSUSB_SIFSLV_U3PHYA_BASE 0x0200 - -#define U3P_USBPHYACR0 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x) +/* version V1 sub-banks offset base address */ +/* banks shared by multiple phys */ +#define SSUSB_SIFSLV_V1_SPLLC 0x000 /* shared by u3 phys */ +#define SSUSB_SIFSLV_V1_U2FREQ 0x100 /* shared by u2 phys */ +/* u2 phy bank */ +#define SSUSB_SIFSLV_V1_U2PHY_COM 0x000 +/* u3 phy banks */ +#define SSUSB_SIFSLV_V1_U3PHYD 0x000 +#define SSUSB_SIFSLV_V1_U3PHYA 0x200 + +/* version V2 sub-banks offset base address */ +/* u2 phy banks */ +#define SSUSB_SIFSLV_V2_MISC 0x000 +#define SSUSB_SIFSLV_V2_U2FREQ 0x100 +#define SSUSB_SIFSLV_V2_U2PHY_COM 0x300 +/* u3 phy banks */ +#define SSUSB_SIFSLV_V2_SPLLC 0x000 +#define SSUSB_SIFSLV_V2_CHIP 0x100 +#define SSUSB_SIFSLV_V2_U3PHYD 0x200 +#define SSUSB_SIFSLV_V2_U3PHYA 0x400 + +#define U3P_USBPHYACR0 0x000 #define PA0_RG_U2PLL_FORCE_ON BIT(15) -#define U3P_USBPHYACR2 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0008) +#define U3P_USBPHYACR2 0x008 #define PA2_RG_SIF_U2PLL_FORCE_EN BIT(18) -#define U3P_USBPHYACR5 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0014) +#define U3P_USBPHYACR5 0x014 #define PA5_RG_U2_HSTX_SRCAL_ENBIT(15) #define PA5_RG_U2_HSTX_SRCTRL GENMASK(14, 12) #define PA5_RG_U2_HSTX_SRCTRL_VAL(x) ((0x7 & (x)) << 12) #define PA5_RG_U2_HS_100U_U3_ENBIT(11) -#define U3P_USBPHYACR6 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0018) +#define U3P_USBPHYACR6 0x018 #define PA6_RG_U2_BC11_SW_EN BIT(23) #define PA6_RG_U2_OTG_VBUSCMP_EN BIT(20) #define PA6_RG_U2_SQTH GENMASK(3, 0) #define PA6_RG_U2_SQTH_VAL(x) (0xf & (x)) -#define U3P_U2PHYACR4 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0020) +#define U3P_U2PHYACR4 0x020 #define P2C_RG_USB20_GPIO_CTL BIT(9) #define P2C_USB20_GPIO_MODEBIT(8) #define P2C_U2_GPIO_CTR_MSK(P2C_RG_USB20_GPIO_CTL | P2C_USB20_GPIO_MODE) -#define U3D_U2PHYDCR0 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0060) +#define U3D_U2PHYDCR0 0x060 #define P2C_RG_SIF_U2PLL_FORCE_ON BIT(24) -#define U3P_U2PHYDTM0 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0068) +#define U3P_U2PHYDTM0 0x068 #define P2C_FORCE_UART_EN BIT(26) #define P2C_FORCE_DATAIN BIT(23) #define P2C_FORCE_DM_PULLDOWN BIT(21) @@ -84,59 +92,56 @@ P2C_FORCE_TERMSEL | P2C_RG_DMPULLDOWN | \ P2C_RG_DPPULLDOWN | P2C_RG_TERMSEL) -#define U3P_U2PHYDTM1 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x006C) +#define U3P_U2PHYDTM1 0x06C #define P2C_RG_UART_EN BIT(16) #define P2C_RG_VBUSVALID BIT(5) #define P2C_RG_SESSEND BIT(4) #define P2C_RG_AVALID BIT(2) -#define U3P_U3_PHYA_REG0 (SSUSB_SIFSLV_U3PHYA_BASE + 0x) -#define P3A_RG_U3_VUSB10_ONBIT(5) - -#define U3P_U3_PHYA_REG6 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0018) +#define U3P_U3_PHYA_REG6 0x018 #define P3A_RG_TX_EIDLE_CM GENMASK(31, 28) #define P3A_RG_TX_EIDLE_CM_VAL(x) ((0xf & (x)) << 28) -#define U3P_U3_PHYA_REG9 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0024) +#define U3P_U3_PHYA_REG9 0x024 #define P3A_RG_RX_DAC_MUX GENMASK(5, 1) #define P3A_RG_RX_DAC_MUX_VAL(x) ((0x1f & (x)) << 1) -#define U3P_U3PHYA_DA_REG0 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0100) +#define U3P_U3_PHYA_DA_REG00x100 #define P3A_RG_XTAL_EXT_EN_U3 GENMASK(11, 10) #define P3A_RG_XTAL_EXT_EN_U3_VAL(x) ((0x3 & (x)) << 10) -#define U3P_U3_PHYD_LFPS1 (SSUSB_SIFSLV_U3PHYD_BASE + 0x000c) +#define U3P_U3_PHYD_LFPS1 0x00c #define P3D_RG_FWAKE_THGENMASK(21, 16) #define P3D_RG_FWAKE_TH_VAL(x) ((0x3f & (x)) << 16) -#define U3P_PHYD_CDR1
[RESEND PATCH v3 3/8] phy: phy-mt65xx-usb3: split SuperSpeed port into two ones
Currently usb3 port in fact includes two sub-ports, but it is not flexible for some cases, such as following one: usb3 port0 includes u2port0 and u3port0; usb2 port0 includes u2port1; If wants to support only HS, we can use u2port0 or u2port1, when select u2port0, u3port0 is not needed; If wants to support SS, we can compound u2port0 and u3port0, or u2port1 and u3port0, if select latter one, u2port0 is not needed. So it's more flexible to split usb3 port into two ones and also try best to save power by disabling unnecessary ports. Signed-off-by: Chunfeng Yun --- drivers/phy/phy-mt65xx-usb3.c | 149 + 1 file changed, 75 insertions(+), 74 deletions(-) diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c index 4fd47d0..7fff482 100644 --- a/drivers/phy/phy-mt65xx-usb3.c +++ b/drivers/phy/phy-mt65xx-usb3.c @@ -30,11 +30,11 @@ #define SSUSB_SIFSLV_SPLLC 0x #define SSUSB_SIFSLV_U2FREQ0x0100 -/* offsets of sub-segment in each port registers */ +/* offsets of banks in each u2phy registers */ #define SSUSB_SIFSLV_U2PHY_COM_BASE0x -#define SSUSB_SIFSLV_U3PHYD_BASE 0x0100 -#define SSUSB_USB30_PHYA_SIV_B_BASE0x0300 -#define SSUSB_SIFSLV_U3PHYA_DA_BASE0x0400 +/* offsets of banks in each u3phy registers */ +#define SSUSB_SIFSLV_U3PHYD_BASE 0x +#define SSUSB_SIFSLV_U3PHYA_BASE 0x0200 #define U3P_USBPHYACR0 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x) #define PA0_RG_U2PLL_FORCE_ON BIT(15) @@ -49,7 +49,6 @@ #define PA5_RG_U2_HS_100U_U3_ENBIT(11) #define U3P_USBPHYACR6 (SSUSB_SIFSLV_U2PHY_COM_BASE + 0x0018) -#define PA6_RG_U2_ISO_EN BIT(31) #define PA6_RG_U2_BC11_SW_EN BIT(23) #define PA6_RG_U2_OTG_VBUSCMP_EN BIT(20) #define PA6_RG_U2_SQTH GENMASK(3, 0) @@ -91,18 +90,18 @@ #define P2C_RG_SESSEND BIT(4) #define P2C_RG_AVALID BIT(2) -#define U3P_U3_PHYA_REG0 (SSUSB_USB30_PHYA_SIV_B_BASE + 0x) +#define U3P_U3_PHYA_REG0 (SSUSB_SIFSLV_U3PHYA_BASE + 0x) #define P3A_RG_U3_VUSB10_ONBIT(5) -#define U3P_U3_PHYA_REG6 (SSUSB_USB30_PHYA_SIV_B_BASE + 0x0018) +#define U3P_U3_PHYA_REG6 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0018) #define P3A_RG_TX_EIDLE_CM GENMASK(31, 28) #define P3A_RG_TX_EIDLE_CM_VAL(x) ((0xf & (x)) << 28) -#define U3P_U3_PHYA_REG9 (SSUSB_USB30_PHYA_SIV_B_BASE + 0x0024) +#define U3P_U3_PHYA_REG9 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0024) #define P3A_RG_RX_DAC_MUX GENMASK(5, 1) #define P3A_RG_RX_DAC_MUX_VAL(x) ((0x1f & (x)) << 1) -#define U3P_U3PHYA_DA_REG0 (SSUSB_SIFSLV_U3PHYA_DA_BASE + 0x) +#define U3P_U3PHYA_DA_REG0 (SSUSB_SIFSLV_U3PHYA_BASE + 0x0100) #define P3A_RG_XTAL_EXT_EN_U3 GENMASK(11, 10) #define P3A_RG_XTAL_EXT_EN_U3_VAL(x) ((0x3 & (x)) << 10) @@ -160,7 +159,7 @@ struct mt65xx_phy_instance { struct mt65xx_u3phy { struct device *dev; - void __iomem *sif_base; /* include sif2, but exclude port's */ + void __iomem *sif_base; /* only shared sif */ struct clk *u3phya_ref; /* reference clock of usb3 anolog phy */ const struct mt65xx_phy_pdata *pdata; struct mt65xx_phy_instance **phys; @@ -190,7 +189,7 @@ static void hs_slew_rate_calibrate(struct mt65xx_u3phy *u3phy, tmp = readl(sif_base + U3P_U2FREQ_FMCR0); tmp &= ~(P2F_RG_CYCLECNT | P2F_RG_MONCLK_SEL); tmp |= P2F_RG_CYCLECNT_VAL(U3P_FM_DET_CYCLE_CNT); - tmp |= P2F_RG_MONCLK_SEL_VAL(instance->index); + tmp |= P2F_RG_MONCLK_SEL_VAL(instance->index >> 1); writel(tmp, sif_base + U3P_U2FREQ_FMCR0); /* enable frequency meter */ @@ -238,6 +237,56 @@ static void hs_slew_rate_calibrate(struct mt65xx_u3phy *u3phy, writel(tmp, instance->port_base + U3P_USBPHYACR5); } +static void u3_phy_instance_init(struct mt65xx_u3phy *u3phy, + struct mt65xx_phy_instance *instance) +{ + void __iomem *port_base = instance->port_base; + u32 tmp; + + /* gating PCIe Analog XTAL clock */ + tmp = readl(u3phy->sif_base + U3P_XTALCTL3); + tmp |= XC3_RG_U3_XTAL_RX_PWD | XC3_RG_U3_FRC_XTAL_RX_PWD; + writel(tmp, u3phy->sif_base + U3P_XTALCTL3); + + /* gating XSQ */ + tmp = readl(port_base + U3P_U3PHYA_DA_REG0); + tmp &= ~P3A_RG_XTAL_EXT_EN_U3; + tmp |= P3A_RG_XTAL_EXT_EN_U3_VAL(2); + writel(tmp, port_base + U3P_U3PHYA_DA_REG0); + + tmp = readl(port_base + U3P_U3_PHYA_REG9); + tmp &= ~P3A_RG_RX_DAC_MUX; + tmp |= P3A_RG_RX_DAC_MUX_VAL(4); + writel(tmp, port_base + U3P_U3_PHYA_REG9); + + tmp = readl(port_base + U3P_U3_PHYA_REG6); + tmp &= ~P3A_RG_TX_EIDLE_CM; + tmp |= P3A_RG_TX_EIDLE_CM_VAL(0xe); + writel(tmp, port_base + U3P_U3_PHYA_REG6); + + tmp = readl(port_base + U3P_PHYD_CDR1); + tmp &= ~(P3
Re: [PATCH V11 10/10] arm/arm64: KVM: add guest SEA support
Hello James, On 3/6/2017 3:28 AM, James Morse wrote: On 28/02/17 19:43, Baicar, Tyler wrote: On 2/24/2017 3:42 AM, James Morse wrote: On 21/02/17 21:22, Tyler Baicar wrote: Currently external aborts are unsupported by the guest abort handling. Add handling for SEAs so that the host kernel reports SEAs which occur in the guest kernel. diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index b2d57fc..403277b 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -602,6 +602,24 @@ static const char *fault_name(unsigned int esr) } /* + * Handle Synchronous External Aborts that occur in a guest kernel. + */ +int handle_guest_sea(unsigned long addr, unsigned int esr) +{ +if(IS_ENABLED(HAVE_ACPI_APEI_SEA)) { +nmi_enter(); +ghes_notify_sea(); +nmi_exit(); This nmi stuff was needed for synchronous aborts that may have interrupted APEI's interrupts-masked code. We want to avoid trying to take the same set of locks, hence taking the in_nmi() path through APEI. Here we know we interrupted a guest, so there is no risk that we have interrupted APEI on the host. ghes_notify_sea() can safely take the normal path. Makes sense, I can remove the nmi_* calls here. Just occurs to me: if we do this we need to add the rcu_read_lock() in ghes_notify_sea() as its not protected by the rcu/nmi weirdness. True, would you suggest leaving these nmi_* calls or adding the rcu_* calls? And since that's only needed for this KVM case, shouldn't the rcu_* calls just replace the nmi_* calls here (outside of ghes_notify_sea)? Thanks, Tyler -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[RESEND PATCH v3 2/8] phy: phy-mt65xx-usb3: increase LFPS filter threshold
Increase LFPS filter threshold to avoid some fake remote wakeup signal which cause U3 link fail and link to U2 only at about 0.01% probability. Signed-off-by: Chunfeng Yun --- drivers/phy/phy-mt65xx-usb3.c |9 + 1 file changed, 9 insertions(+) diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c index fe2392a..4fd47d0 100644 --- a/drivers/phy/phy-mt65xx-usb3.c +++ b/drivers/phy/phy-mt65xx-usb3.c @@ -106,6 +106,10 @@ #define P3A_RG_XTAL_EXT_EN_U3 GENMASK(11, 10) #define P3A_RG_XTAL_EXT_EN_U3_VAL(x) ((0x3 & (x)) << 10) +#define U3P_U3_PHYD_LFPS1 (SSUSB_SIFSLV_U3PHYD_BASE + 0x000c) +#define P3D_RG_FWAKE_THGENMASK(21, 16) +#define P3D_RG_FWAKE_TH_VAL(x) ((0x3f & (x)) << 16) + #define U3P_PHYD_CDR1 (SSUSB_SIFSLV_U3PHYD_BASE + 0x005c) #define P3D_RG_CDR_BIR_LTD1GENMASK(28, 24) #define P3D_RG_CDR_BIR_LTD1_VAL(x) ((0x1f & (x)) << 24) @@ -303,6 +307,11 @@ static void phy_instance_init(struct mt65xx_u3phy *u3phy, tmp |= P3D_RG_CDR_BIR_LTD0_VAL(0xc) | P3D_RG_CDR_BIR_LTD1_VAL(0x3); writel(tmp, port_base + U3P_PHYD_CDR1); + tmp = readl(port_base + U3P_U3_PHYD_LFPS1); + tmp &= ~P3D_RG_FWAKE_TH; + tmp |= P3D_RG_FWAKE_TH_VAL(0x34); + writel(tmp, port_base + U3P_U3_PHYD_LFPS1); + tmp = readl(port_base + U3P_U3_PHYD_RXDET1); tmp &= ~P3D_RG_RXDET_STB2_SET; tmp |= P3D_RG_RXDET_STB2_SET_VAL(0x10); -- 1.7.9.5
[RESEND PATCH v3 6/8] arm64: dts: mt8173: split usb SuperSpeed port into two ports
split the old SuperSpeed port node into a HighSpeed one and a new SuperSpeed one. Signed-off-by: Chunfeng Yun --- arch/arm64/boot/dts/mediatek/mt8173.dtsi | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi b/arch/arm64/boot/dts/mediatek/mt8173.dtsi index 6922252..1dc4629 100644 --- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi +++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi @@ -731,8 +731,9 @@ <0 0x11280700 0 0x0100>; reg-names = "mac", "ippc"; interrupts = ; - phys = <&phy_port0 PHY_TYPE_USB3>, - <&phy_port1 PHY_TYPE_USB2>; + phys = <&u2port0 PHY_TYPE_USB2>, + <&u3port0 PHY_TYPE_USB3>, + <&u2port1 PHY_TYPE_USB2>; power-domains = <&scpsys MT8173_POWER_DOMAIN_USB>; clocks = <&topckgen CLK_TOP_USB30_SEL>, <&clk26m>, @@ -770,14 +771,20 @@ ranges; status = "okay"; - phy_port0: port@11290800 { - reg = <0 0x11290800 0 0x800>; + u2port0: usb-phy@11290800 { + reg = <0 0x11290800 0 0x100>; #phy-cells = <1>; status = "okay"; }; - phy_port1: port@11291000 { - reg = <0 0x11291000 0 0x800>; + u3port0: usb-phy@11290900 { + reg = <0 0x11290900 0 0x700>; + #phy-cells = <1>; + status = "okay"; + }; + + u2port1: usb-phy@11291000 { + reg = <0 0x11291000 0 0x100>; #phy-cells = <1>; status = "okay"; }; -- 1.7.9.5
[RESEND PATCH v3 1/8] phy: phy-mt65xx-usb3: improve RX detection stable time
The default value of RX detection stable time is 10us, and this margin is too big for some critical cases which cause U3 link fail and link to U2(probability is about 1%). So change it to 5us. Signed-off-by: Chunfeng Yun --- drivers/phy/phy-mt65xx-usb3.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c index d972067..fe2392a 100644 --- a/drivers/phy/phy-mt65xx-usb3.c +++ b/drivers/phy/phy-mt65xx-usb3.c @@ -112,6 +112,14 @@ #define P3D_RG_CDR_BIR_LTD0GENMASK(12, 8) #define P3D_RG_CDR_BIR_LTD0_VAL(x) ((0x1f & (x)) << 8) +#define U3P_U3_PHYD_RXDET1 (SSUSB_SIFSLV_U3PHYD_BASE + 0x128) +#define P3D_RG_RXDET_STB2_SET GENMASK(17, 9) +#define P3D_RG_RXDET_STB2_SET_VAL(x) ((0x1ff & (x)) << 9) + +#define U3P_U3_PHYD_RXDET2 (SSUSB_SIFSLV_U3PHYD_BASE + 0x12c) +#define P3D_RG_RXDET_STB2_SET_P3 GENMASK(8, 0) +#define P3D_RG_RXDET_STB2_SET_P3_VAL(x)(0x1ff & (x)) + #define U3P_XTALCTL3 (SSUSB_SIFSLV_SPLLC + 0x0018) #define XC3_RG_U3_XTAL_RX_PWD BIT(9) #define XC3_RG_U3_FRC_XTAL_RX_PWD BIT(8) @@ -295,6 +303,16 @@ static void phy_instance_init(struct mt65xx_u3phy *u3phy, tmp |= P3D_RG_CDR_BIR_LTD0_VAL(0xc) | P3D_RG_CDR_BIR_LTD1_VAL(0x3); writel(tmp, port_base + U3P_PHYD_CDR1); + tmp = readl(port_base + U3P_U3_PHYD_RXDET1); + tmp &= ~P3D_RG_RXDET_STB2_SET; + tmp |= P3D_RG_RXDET_STB2_SET_VAL(0x10); + writel(tmp, port_base + U3P_U3_PHYD_RXDET1); + + tmp = readl(port_base + U3P_U3_PHYD_RXDET2); + tmp &= ~P3D_RG_RXDET_STB2_SET_P3; + tmp |= P3D_RG_RXDET_STB2_SET_P3_VAL(0x10); + writel(tmp, port_base + U3P_U3_PHYD_RXDET2); + dev_dbg(u3phy->dev, "%s(%d)\n", __func__, index); } -- 1.7.9.5
[RESEND PATCH v3 7/8] arm64: dts: mt8173: move clock from phy node into port nodes
there is a reference clock for each port, HighSpeed port is 48M, and SuperSpeed port is 26M which usually comes from 26M oscillator directly, but some SoCs is not. it is flexible to move it into port node. Signed-off-by: Chunfeng Yun --- arch/arm64/boot/dts/mediatek/mt8173.dtsi |8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/arm64/boot/dts/mediatek/mt8173.dtsi b/arch/arm64/boot/dts/mediatek/mt8173.dtsi index 1dc4629..1c9e0d5 100644 --- a/arch/arm64/boot/dts/mediatek/mt8173.dtsi +++ b/arch/arm64/boot/dts/mediatek/mt8173.dtsi @@ -764,8 +764,6 @@ u3phy: usb-phy@1129 { compatible = "mediatek,mt8173-u3phy"; reg = <0 0x1129 0 0x800>; - clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>; - clock-names = "u3phya_ref"; #address-cells = <2>; #size-cells = <2>; ranges; @@ -773,18 +771,24 @@ u2port0: usb-phy@11290800 { reg = <0 0x11290800 0 0x100>; + clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>; + clock-names = "ref"; #phy-cells = <1>; status = "okay"; }; u3port0: usb-phy@11290900 { reg = <0 0x11290900 0 0x700>; + clocks = <&clk26m>; + clock-names = "ref"; #phy-cells = <1>; status = "okay"; }; u2port1: usb-phy@11291000 { reg = <0 0x11291000 0 0x100>; + clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>; + clock-names = "ref"; #phy-cells = <1>; status = "okay"; }; -- 1.7.9.5
[PATCHv4 33/33] x86/mm: allow to have userspace mappigs above 47-bits
On x86, 5-level paging enables 56-bit userspace virtual address space. Not all user space is ready to handle wide addresses. It's known that at least some JIT compilers use higher bits in pointers to encode their information. It collides with valid pointers with 5-level paging and leads to crashes. To mitigate this, we are not going to allocate virtual address space above 47-bit by default. But userspace can ask for allocation from full address space by specifying hint address (with or without MAP_FIXED) above 47-bits. If hint address set above 47-bit, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 47-bit window. This approach helps to easily make application's memory allocator aware about large address space without manually tracking allocated virtual address space. One important case we need to handle here is interaction with MPX. MPX (without MAWA( extension cannot handle addresses above 47-bit, so we need to make sure that MPX cannot be enabled we already have VMA above the boundary and forbid creating such VMAs once MPX is enabled. Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/elf.h | 2 +- arch/x86/include/asm/mpx.h | 9 + arch/x86/include/asm/processor.h | 9 ++--- arch/x86/kernel/sys_x86_64.c | 28 +++- arch/x86/mm/hugetlbpage.c| 31 +++ arch/x86/mm/mmap.c | 4 ++-- arch/x86/mm/mpx.c| 33 - 7 files changed, 104 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index 9d49c18b5ea9..265625b0d6cb 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -250,7 +250,7 @@ extern int force_personality32; the loader. We need to make sure that it is out of the way of the program that it will "exec", and that there is sufficient room for the brk. */ -#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2) +#define ELF_ET_DYN_BASE(DEFAULT_MAP_WINDOW / 3 * 2) /* This yields a mask that user programs can use to figure out what instruction set this CPU supports. This could be done in user space, diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h index a0d662be4c5b..7d7404756bb4 100644 --- a/arch/x86/include/asm/mpx.h +++ b/arch/x86/include/asm/mpx.h @@ -73,6 +73,9 @@ static inline void mpx_mm_init(struct mm_struct *mm) } void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long start, unsigned long end); + +unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len, + unsigned long flags); #else static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs) { @@ -94,6 +97,12 @@ static inline void mpx_notify_unmap(struct mm_struct *mm, unsigned long start, unsigned long end) { } + +static inline unsigned long mpx_unmapped_area_check(unsigned long addr, + unsigned long len, unsigned long flags) +{ + return addr; +} #endif /* CONFIG_X86_INTEL_MPX */ #endif /* _ASM_X86_MPX_H */ diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index f385eca5407a..da8ab4f2d0c7 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -799,6 +799,7 @@ static inline void spin_lock_prefetch(const void *x) */ #define TASK_SIZE PAGE_OFFSET #define TASK_SIZE_MAX TASK_SIZE +#define DEFAULT_MAP_WINDOW TASK_SIZE #define STACK_TOP TASK_SIZE #define STACK_TOP_MAX STACK_TOP @@ -838,7 +839,9 @@ static inline void spin_lock_prefetch(const void *x) * particular problem by preventing anything from being mapped * at the maximum canonical address. */ -#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE) +#define TASK_SIZE_MAX ((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE) + +#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE) /* This decides where the kernel will search for a free chunk of vm * space during mmap's. @@ -851,7 +854,7 @@ static inline void spin_lock_prefetch(const void *x) #define TASK_SIZE_OF(child)((test_tsk_thread_flag(child, TIF_ADDR32)) ? \ IA32_PAGE_OFFSET : TASK_SIZE_MAX) -#define STACK_TOP TASK_SIZE +#define STACK_TOP DEFAULT_MAP_WINDOW #define STACK_TOP_MAX TASK_SIZE_MAX #define INIT_THREAD { \ @@ -873,7 +876,7 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip, * This decides where the kernel will search for a free chunk of vm * space during mmap's. */ -#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3)) +#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW /
Re: [PATCH net] team: use ETH_MAX_MTU as max mtu
Mon, Mar 06, 2017 at 02:48:58PM CET, ja...@redhat.com wrote: >This restores the ability to set a team device's mtu to anything higher >than 1500. Similar to the reported issue with bonding, the team driver >calls ether_setup(), which sets an initial max_mtu of 1500, while the >underlying hardware can handle something much larger. Just set it to >ETH_MAX_MTU to support all possible values, and the limitations of the >underlying devices will prevent setting anything too large. > >Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra") >CC: Cong Wang >CC: Jiri Pirko >CC: net...@vger.kernel.org >Signed-off-by: Jarod Wilson Acked-by: Jiri Pirko
Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR
2017-02-21 15:42 GMT+03:00 Kirill A. Shutemov : > On Tue, Feb 21, 2017 at 02:54:20PM +0300, Dmitry Safonov wrote: >> 2017-02-17 19:50 GMT+03:00 Andy Lutomirski : >> > On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov >> > wrote: >> >> This patch introduces two new prctl(2) handles to manage maximum virtual >> >> address available to userspace to map. >> ... >> > Anyway, can you and Dmitry try to reconcile your patches? >> >> So, how can I help that? >> Is there the patch's version, on which I could rebase? >> Here are BTW the last patches, which I will resend with trivial ifdef-fixup >> after the merge window: >> http://marc.info/?i=20170214183621.2537-1-dsafonov%20()%20virtuozzo%20!%20com > > Could you check if this patch collides with anything you do: > > http://lkml.kernel.org/r/20170220131515.ga9...@node.shutemov.name Ok, sorry for the late reply - it was the merge window anyway and I've got urgent work to do. Let's see: I'll need minor merge fixup here: >-#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3)) >+#define TASK_UNMAPPED_BASE (PAGE_ALIGN(DEFAULT_MAP_WINDOW / 3)) while in my patches: >+#define __TASK_UNMAPPED_BASE(task_size)(PAGE_ALIGN(task_size / 3)) >+#define TASK_UNMAPPED_BASE __TASK_UNMAPPED_BASE(TASK_SIZE) This should be just fine with my changes: >- info.high_limit = end; >+ info.high_limit = min(end, DEFAULT_MAP_WINDOW); This will need another minor fixup: >-#define MAX_GAP (TASK_SIZE/6*5) >+#define MAX_GAP (DEFAULT_MAP_WINDOW/6*5) I've moved it from macro to mmap_base() as local var, which depends on task_size parameter. That's all, as far as I can see at this moment. Does not seems hard to fix. So I suggest sending patches sets in parallel, the second accepted will rebase the set. Is it convenient for you? If you have/will have some questions about my patches, I'll be open to answer. -- Dmitry
[PATCHv4 32/33] x86: enable 5-level paging support
Most of things are in place and we can enable support of 5-level paging. Enabling XEN with 5-level paging requires more work. The patch makes XEN dependent on !X86_5LEVEL. Signed-off-by: Kirill A. Shutemov --- arch/x86/Kconfig | 5 + arch/x86/xen/Kconfig | 1 + 2 files changed, 6 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 747f06f00a22..43b3343402f5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -317,6 +317,7 @@ config FIX_EARLYCON_MEM config PGTABLE_LEVELS int + default 5 if X86_5LEVEL default 4 if X86_64 default 3 if X86_PAE default 2 @@ -1381,6 +1382,10 @@ config X86_PAE has the cost of more pagetable lookup overhead, and also consumes more pagetable space per process. +config X86_5LEVEL + bool "Enable 5-level page tables support" + depends on X86_64 + config ARCH_PHYS_ADDR_T_64BIT def_bool y depends on X86_64 || X86_PAE diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig index 76b6dbd627df..b90d481ce5a1 100644 --- a/arch/x86/xen/Kconfig +++ b/arch/x86/xen/Kconfig @@ -5,6 +5,7 @@ config XEN bool "Xen guest support" depends on PARAVIRT + depends on !X86_5LEVEL select PARAVIRT_CLOCK select XEN_HAVE_PVMMU select XEN_HAVE_VPMU -- 2.11.0
Re: Question Regarding ERMS memcpy
On Mon, Mar 06, 2017 at 05:41:22AM -0800, h...@zytor.com wrote: > It isn't really that straightforward IMO. > > For UC memory transaction size really needs to be specified explicitly > at all times and should be part of the API, rather than implicit. > > For WC/WT/WB device memory, the ordinary memcpy is valid and > preferred. I'm practically partially reverting 6175ddf06b61 ("x86: Clean up mem*io functions.") Are you saying, this was wrong before too? Maybe it was wrong, strictly speaking, but maybe that was good enough for our purposes... -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
[PATCHv4 11/33] x86/ident_map: add 5-level paging support
Nothing special: just handle one more level. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/ident_map.c | 47 --- 1 file changed, 40 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c index 4473cb4f8b90..2c9a62282fb1 100644 --- a/arch/x86/mm/ident_map.c +++ b/arch/x86/mm/ident_map.c @@ -45,6 +45,34 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page, return 0; } +static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page, + unsigned long addr, unsigned long end) +{ + unsigned long next; + + for (; addr < end; addr = next) { + p4d_t *p4d = p4d_page + p4d_index(addr); + pud_t *pud; + + next = (addr & P4D_MASK) + P4D_SIZE; + if (next > end) + next = end; + + if (p4d_present(*p4d)) { + pud = pud_offset(p4d, 0); + ident_pud_init(info, pud, addr, next); + continue; + } + pud = (pud_t *)info->alloc_pgt_page(info->context); + if (!pud) + return -ENOMEM; + ident_pud_init(info, pud, addr, next); + set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE)); + } + + return 0; +} + int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page, unsigned long pstart, unsigned long pend) { @@ -55,27 +83,32 @@ int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page, for (; addr < end; addr = next) { pgd_t *pgd = pgd_page + pgd_index(addr); - pud_t *pud; + p4d_t *p4d; next = (addr & PGDIR_MASK) + PGDIR_SIZE; if (next > end) next = end; if (pgd_present(*pgd)) { - pud = pud_offset(pgd, 0); - result = ident_pud_init(info, pud, addr, next); + p4d = p4d_offset(pgd, 0); + result = ident_p4d_init(info, p4d, addr, next); if (result) return result; continue; } - pud = (pud_t *)info->alloc_pgt_page(info->context); - if (!pud) + p4d = (p4d_t *)info->alloc_pgt_page(info->context); + if (!p4d) return -ENOMEM; - result = ident_pud_init(info, pud, addr, next); + result = ident_p4d_init(info, p4d, addr, next); if (result) return result; - set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE)); + if (IS_ENABLED(CONFIG_X86_5LEVEL)) { + set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE)); + } else { + pud_t *pud = pud_offset(p4d, 0); + set_pgd(pgd, __pgd(__pa(pud) | _KERNPG_TABLE)); + } } return 0; -- 2.11.0
[PATCHv4 07/33] mm: introduce __p4d_alloc()
For full 5-level paging we need a helper to allocate p4d page table. Signed-off-by: Kirill A. Shutemov --- mm/memory.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 7f1c2163b3ce..235ba51b2fbf 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3906,6 +3906,29 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(handle_mm_fault); +#ifndef __PAGETABLE_P4D_FOLDED +/* + * Allocate p4d page table. + * We've already handled the fast-path in-line. + */ +int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address) +{ + p4d_t *new = p4d_alloc_one(mm, address); + if (!new) + return -ENOMEM; + + smp_wmb(); /* See comment in __pte_alloc */ + + spin_lock(&mm->page_table_lock); + if (pgd_present(*pgd)) /* Another has populated it */ + p4d_free(mm, new); + else + pgd_populate(mm, pgd, new); + spin_unlock(&mm->page_table_lock); + return 0; +} +#endif /* __PAGETABLE_P4D_FOLDED */ + #ifndef __PAGETABLE_PUD_FOLDED /* * Allocate page upper directory. -- 2.11.0
[PATCHv4 13/33] x86/power: support p4d_t in hibernate code
set_up_temporary_text_mapping() and relocate_restore_code() require trivial adjustments to handle additional page table level. Signed-off-by: Kirill A. Shutemov --- arch/x86/power/hibernate_64.c | 49 ++- 1 file changed, 35 insertions(+), 14 deletions(-) diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c index ded2e8272382..9ec941638932 100644 --- a/arch/x86/power/hibernate_64.c +++ b/arch/x86/power/hibernate_64.c @@ -49,6 +49,7 @@ static int set_up_temporary_text_mapping(pgd_t *pgd) { pmd_t *pmd; pud_t *pud; + p4d_t *p4d; /* * The new mapping only has to cover the page containing the image @@ -63,6 +64,13 @@ static int set_up_temporary_text_mapping(pgd_t *pgd) * the virtual address space after switching over to the original page * tables used by the image kernel. */ + + if (IS_ENABLED(CONFIG_X86_5LEVEL)) { + p4d = (p4d_t *)get_safe_page(GFP_ATOMIC); + if (!p4d) + return -ENOMEM; + } + pud = (pud_t *)get_safe_page(GFP_ATOMIC); if (!pud) return -ENOMEM; @@ -75,8 +83,15 @@ static int set_up_temporary_text_mapping(pgd_t *pgd) __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC)); set_pud(pud + pud_index(restore_jump_address), __pud(__pa(pmd) | _KERNPG_TABLE)); - set_pgd(pgd + pgd_index(restore_jump_address), - __pgd(__pa(pud) | _KERNPG_TABLE)); + if (IS_ENABLED(CONFIG_X86_5LEVEL)) { + set_p4d(p4d + p4d_index(restore_jump_address), + __p4d(__pa(pud) | _KERNPG_TABLE)); + set_pgd(pgd + pgd_index(restore_jump_address), + __pgd(__pa(p4d) | _KERNPG_TABLE)); + } else { + set_pgd(pgd + pgd_index(restore_jump_address), + __pgd(__pa(pud) | _KERNPG_TABLE)); + } return 0; } @@ -124,7 +139,10 @@ static int set_up_temporary_mappings(void) static int relocate_restore_code(void) { pgd_t *pgd; + p4d_t *p4d; pud_t *pud; + pmd_t *pmd; + pte_t *pte; relocated_restore_code = get_safe_page(GFP_ATOMIC); if (!relocated_restore_code) @@ -134,22 +152,25 @@ static int relocate_restore_code(void) /* Make the page containing the relocated code executable */ pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code); - pud = pud_offset(pgd, relocated_restore_code); + p4d = p4d_offset(pgd, relocated_restore_code); + if (p4d_large(*p4d)) { + set_p4d(p4d, __p4d(p4d_val(*p4d) & ~_PAGE_NX)); + goto out; + } + pud = pud_offset(p4d, relocated_restore_code); if (pud_large(*pud)) { set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX)); - } else { - pmd_t *pmd = pmd_offset(pud, relocated_restore_code); - - if (pmd_large(*pmd)) { - set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX)); - } else { - pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code); - - set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX)); - } + goto out; + } + pmd = pmd_offset(pud, relocated_restore_code); + if (pmd_large(*pmd)) { + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX)); + goto out; } + pte = pte_offset_kernel(pmd, relocated_restore_code); + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX)); +out: __flush_tlb_all(); - return 0; } -- 2.11.0
[PATCH net] team: use ETH_MAX_MTU as max mtu
This restores the ability to set a team device's mtu to anything higher than 1500. Similar to the reported issue with bonding, the team driver calls ether_setup(), which sets an initial max_mtu of 1500, while the underlying hardware can handle something much larger. Just set it to ETH_MAX_MTU to support all possible values, and the limitations of the underlying devices will prevent setting anything too large. Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra") CC: Cong Wang CC: Jiri Pirko CC: net...@vger.kernel.org Signed-off-by: Jarod Wilson --- drivers/net/team/team.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 4a24b5d15f5a..1b52520715ae 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -2072,6 +2072,7 @@ static int team_dev_type_check_change(struct net_device *dev, static void team_setup(struct net_device *dev) { ether_setup(dev); + dev->max_mtu = ETH_MAX_MTU; dev->netdev_ops = &team_netdev_ops; dev->ethtool_ops = &team_ethtool_ops; -- 2.11.0
Build regressions/improvements in v4.11-rc1
Below is the list of build error/warning regressions/improvements in v4.11-rc1[1] compared to v4.10[2]. Summarized: - build errors: +19/-1 - build warnings: +1108/-835 Happy fixing! ;-) Thanks to the linux-next team for providing the build service. [1] http://kisskb.ellerman.id.au/kisskb/head/c1ae3cfa0e89fa1a7ecc4c99031f5e9ae99d9201/ (all 266 configs) [2] http://kisskb.ellerman.id.au/kisskb/head/c470abd4fde40ea6a0846a2beab642a578c0b8cd/ (all 266 configs) *** ERRORS *** 19 error regressions: + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: dereferencing pointer to incomplete type: => 58 + /home/kisskb/slave/src/arch/avr32/oprofile/backtrace.c: error: implicit declaration of function 'user_mode': => 60 + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: implicit declaration of function 'task_stack_page' [-Werror=implicit-function-declaration]: => 31:3 + /home/kisskb/slave/src/arch/mips/cavium-octeon/cpu.c: error: invalid application of 'sizeof' to incomplete type 'struct pt_regs' : => 31:3 + /home/kisskb/slave/src/arch/mips/cavium-octeon/crypto/octeon-crypto.c: error: implicit declaration of function 'task_stack_page' [-Werror=implicit-function-declaration]: => 35:6 + /home/kisskb/slave/src/arch/mips/cavium-octeon/smp.c: error: implicit declaration of function 'task_stack_page' [-Werror=implicit-function-declaration]: => 214:2 + /home/kisskb/slave/src/arch/mips/include/asm/fpu.h: error: invalid application of 'sizeof' to incomplete type 'struct pt_regs' : => 140:3, 188:2, 138:3, 136:2 + /home/kisskb/slave/src/arch/mips/include/asm/processor.h: error: invalid application of 'sizeof' to incomplete type 'struct pt_regs': => 385:31 + /home/kisskb/slave/src/arch/mips/kernel/smp-mt.c: error: implicit declaration of function 'task_stack_page' [-Werror=implicit-function-declaration]: => 215:2 + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: dereferencing pointer to incomplete type: => 59:17, 66:13 + /home/kisskb/slave/src/arch/mips/sgi-ip27/ip27-berr.c: error: implicit declaration of function 'force_sig' [-Werror=implicit-function-declaration]: => 75:2 + /home/kisskb/slave/src/arch/mips/sgi-ip32/ip32-berr.c: error: implicit declaration of function 'force_sig' [-Werror=implicit-function-declaration]: => 31:2 + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown opcode2 `l.lwa'.: => 70, 107, 69 + /home/kisskb/slave/src/arch/openrisc/include/asm/atomic.h: Error: unknown opcode2 `l.swa'.: => 72, 71, 111 + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: unknown opcode2 `l.lwa'.: => 18, 35, 70, 90 + /home/kisskb/slave/src/arch/openrisc/include/asm/bitops/atomic.h: Error: unknown opcode2 `l.swa'.: => 20, 37, 92, 72 + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: unknown opcode2 `l.lwa'.: => 68, 30 + /home/kisskb/slave/src/arch/openrisc/include/asm/cmpxchg.h: Error: unknown opcode2 `l.swa'.: => 34, 69 + /home/kisskb/slave/src/drivers/char/nwbutton.c: error: implicit declaration of function 'kill_cad_pid' [-Werror=implicit-function-declaration]: => 134:3 1 error improvements: - error: rtnetlink.c: relocation truncated to fit: R_AVR32_11H_PCREL against `.text'+217dc: (.text+0x21bec) => *** WARNINGS *** 1108 warning regressions: [Deleted 1030 lines about "warning: -ffunction-sections disabled; it makes profiling impossible [enabled by default]" on parisc-allmodconfig] + /home/kisskb/slave/src/arch/arc/include/asm/kprobes.h: warning: 'trap_is_kprobe' defined but not used [-Wunused-function]: => 57:13 + /home/kisskb/slave/src/arch/mips/include/asm/sibyte/bcm1480_scd.h: warning: "M_SPC_CFG_CLEAR" redefined: => 274:0 + /home/kisskb/slave/src/arch/mips/include/asm/sibyte/bcm1480_scd.h: warning: "M_SPC_CFG_ENABLE" redefined: => 275:0 + /home/kisskb/slave/src/arch/x86/hyperv/hv_init.c: warning: label 'register_msr_cs' defined but not used [-Wunused-label]: => 167:1 + /home/kisskb/slave/src/arch/x86/kernel/e820.c: warning: 'gapstart' may be used uninitialized in this function [-Wuninitialized]: => 643:16, 645:8 + /home/kisskb/slave/src/crypto/ccm.c: warning: 'crypto_ccm_auth' uses dynamic stack allocation [enabled by default]: => 235:1 + /home/kisskb/slave/src/drivers/crypto/chelsio/chcr_algo.c: warning: 'chcr_copy_assoc.isra.20' uses dynamic stack allocation [enabled by default]: => 1336:1 + /home/kisskb/slave/src/drivers/crypto/mediatek/mtk-sha.c: warning: 'mtk_sha_finish_hmac' uses dynamic stack allocation [enabled by default]: => 371:1 + /home/kisskb/slave/src/drivers/crypto/mediatek/mtk-sha.c: warning: 'mtk_sha_setkey' uses dynamic stack allocation [enabled by default]: => 880:1 + /home/kisskb/slave/src/drivers/gpu/drm/nouveau/nvkm/subdev/secboot/acr_r352.c: warning: 'acr_r352_load' uses dynamic stack allocation [enabled by default]: => 736:1 + /home/kisskb/slave/src/d
[PATCHv4 26/33] x86/kasan: extend to support 5-level paging
This patch bring support for non-folded additional page table level. Signed-off-by: Kirill A. Shutemov Cc: Dmitry Vyukov = 5 && i < PTRS_PER_P4D; i++) + kasan_zero_p4d[i] = __p4d(p4d_val); + kasan_map_early_shadow(early_level4_pgt); kasan_map_early_shadow(init_level4_pgt); } -- 2.11.0
[PATCHv4 03/33] asm-generic: introduce __ARCH_USE_5LEVEL_HACK
We are going to introduce to provide abstraction for properly (in opposite to 5level-fixup.h hack) folded p4d level. The new header will be included from pgtable-nopud.h. If an architecture uses , we cannot use 5level-fixup.h directly to quickly convert the architecture to 5-level paging as it would conflict with pgtable-nop4d.h. With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before inclusion to use 5level-fixup.h. Signed-off-by: Kirill A. Shutemov --- include/asm-generic/pgtable-nop4d-hack.h | 62 include/asm-generic/pgtable-nopud.h | 5 +++ 2 files changed, 67 insertions(+) create mode 100644 include/asm-generic/pgtable-nop4d-hack.h diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h new file mode 100644 index ..752fb7511750 --- /dev/null +++ b/include/asm-generic/pgtable-nop4d-hack.h @@ -0,0 +1,62 @@ +#ifndef _PGTABLE_NOP4D_HACK_H +#define _PGTABLE_NOP4D_HACK_H + +#ifndef __ASSEMBLY__ +#include + +#define __PAGETABLE_PUD_FOLDED + +/* + * Having the pud type consist of a pgd gets the size right, and allows + * us to conceptually access the pgd entry that this pud is folded into + * without casting. + */ +typedef struct { pgd_t pgd; } pud_t; + +#define PUD_SHIFT PGDIR_SHIFT +#define PTRS_PER_PUD 1 +#define PUD_SIZE (1UL << PUD_SHIFT) +#define PUD_MASK (~(PUD_SIZE-1)) + +/* + * The "pgd_xxx()" functions here are trivial for a folded two-level + * setup: the pud is never bad, and a pud always exists (as it's folded + * into the pgd entry) + */ +static inline int pgd_none(pgd_t pgd) { return 0; } +static inline int pgd_bad(pgd_t pgd) { return 0; } +static inline int pgd_present(pgd_t pgd) { return 1; } +static inline void pgd_clear(pgd_t *pgd) { } +#define pud_ERROR(pud) (pgd_ERROR((pud).pgd)) + +#define pgd_populate(mm, pgd, pud) do { } while (0) +/* + * (puds are folded into pgds so this doesn't get actually called, + * but the define is needed for a generic inline function.) + */ +#define set_pgd(pgdptr, pgdval)set_pud((pud_t *)(pgdptr), (pud_t) { pgdval }) + +static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address) +{ + return (pud_t *)pgd; +} + +#define pud_val(x) (pgd_val((x).pgd)) +#define __pud(x) ((pud_t) { __pgd(x) }) + +#define pgd_page(pgd) (pud_page((pud_t){ pgd })) +#define pgd_page_vaddr(pgd)(pud_page_vaddr((pud_t){ pgd })) + +/* + * allocating and freeing a pud is trivial: the 1-entry pud is + * inside the pgd, so has no extra memory associated with it. + */ +#define pud_alloc_one(mm, address) NULL +#define pud_free(mm, x)do { } while (0) +#define __pud_free_tlb(tlb, x, a) do { } while (0) + +#undef pud_addr_end +#define pud_addr_end(addr, end)(end) + +#endif /* __ASSEMBLY__ */ +#endif /* _PGTABLE_NOP4D_HACK_H */ diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h index 810431d8351b..5e49430a30a4 100644 --- a/include/asm-generic/pgtable-nopud.h +++ b/include/asm-generic/pgtable-nopud.h @@ -3,6 +3,10 @@ #ifndef __ASSEMBLY__ +#ifdef __ARCH_USE_5LEVEL_HACK +#include +#else + #define __PAGETABLE_PUD_FOLDED /* @@ -58,4 +62,5 @@ static inline pud_t * pud_offset(pgd_t * pgd, unsigned long address) #define pud_addr_end(addr, end)(end) #endif /* __ASSEMBLY__ */ +#endif /* !__ARCH_USE_5LEVEL_HACK */ #endif /* _PGTABLE_NOPUD_H */ -- 2.11.0
[PATCHv4 24/33] x86/mm: basic defines/helpers for CONFIG_X86_5LEVEL
Extends pagetable headers to support new paging mode. Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/pgtable_64.h | 11 +++ arch/x86/include/asm/pgtable_64_types.h | 20 +++ arch/x86/include/asm/pgtable_types.h| 10 +- arch/x86/mm/pgtable.c | 34 - 4 files changed, 73 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 79396bfdc791..9991224f6238 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -35,6 +35,13 @@ extern void paging_init(void); #define pud_ERROR(e) \ pr_err("%s:%d: bad pud %p(%016lx)\n", \ __FILE__, __LINE__, &(e), pud_val(e)) + +#if CONFIG_PGTABLE_LEVELS >= 5 +#define p4d_ERROR(e) \ + pr_err("%s:%d: bad p4d %p(%016lx)\n", \ + __FILE__, __LINE__, &(e), p4d_val(e)) +#endif + #define pgd_ERROR(e) \ pr_err("%s:%d: bad pgd %p(%016lx)\n", \ __FILE__, __LINE__, &(e), pgd_val(e)) @@ -128,7 +135,11 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d) static inline void native_p4d_clear(p4d_t *p4d) { +#ifdef CONFIG_X86_5LEVEL + native_set_p4d(p4d, native_make_p4d(0)); +#else native_set_p4d(p4d, (p4d_t) { .pgd = native_make_pgd(0)}); +#endif } static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd) diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 00dc0c2b456e..7ae641fdbd07 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -23,12 +23,32 @@ typedef struct { pteval_t pte; } pte_t; #define SHARED_KERNEL_PMD 0 +#ifdef CONFIG_X86_5LEVEL + +/* + * PGDIR_SHIFT determines what a top-level page table entry can map + */ +#define PGDIR_SHIFT48 +#define PTRS_PER_PGD 512 + +/* + * 4rd level page in 5-level paging case + */ +#define P4D_SHIFT 39 +#define PTRS_PER_P4D 512 +#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT) +#define P4D_MASK (~(P4D_SIZE - 1)) + +#else /* CONFIG_X86_5LEVEL */ + /* * PGDIR_SHIFT determines what a top-level page table entry can map */ #define PGDIR_SHIFT39 #define PTRS_PER_PGD 512 +#endif /* CONFIG_X86_5LEVEL */ + /* * 3rd level page */ diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 4930afe9df0a..bf9638e1ee42 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -273,9 +273,17 @@ static inline pgdval_t pgd_flags(pgd_t pgd) } #if CONFIG_PGTABLE_LEVELS > 4 +typedef struct { p4dval_t p4d; } p4d_t; -#error FIXME +static inline p4d_t native_make_p4d(pudval_t val) +{ + return (p4d_t) { val }; +} +static inline p4dval_t native_p4d_val(p4d_t p4d) +{ + return p4d.p4d; +} #else #include diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 38b6daf72deb..d26b066944a5 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -81,6 +81,14 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud) paravirt_release_pud(__pa(pud) >> PAGE_SHIFT); tlb_remove_page(tlb, virt_to_page(pud)); } + +#if CONFIG_PGTABLE_LEVELS > 4 +void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d) +{ + paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT); + tlb_remove_page(tlb, virt_to_page(p4d)); +} +#endif /* CONFIG_PGTABLE_LEVELS > 4 */ #endif /* CONFIG_PGTABLE_LEVELS > 3 */ #endif /* CONFIG_PGTABLE_LEVELS > 2 */ @@ -120,7 +128,7 @@ static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd) references from swapper_pg_dir. */ if (CONFIG_PGTABLE_LEVELS == 2 || (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) || - CONFIG_PGTABLE_LEVELS == 4) { + CONFIG_PGTABLE_LEVELS >= 4) { clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY, swapper_pg_dir + KERNEL_PGD_BOUNDARY, KERNEL_PGD_PTRS); @@ -582,6 +590,30 @@ void native_set_fixmap(enum fixed_addresses idx, phys_addr_t phys, } #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +#ifdef CONFIG_X86_5LEVEL +/** + * p4d_set_huge - setup kernel P4D mapping + * + * No 512GB pages yet -- always return 0 + * + * Returns 1 on success and 0 on failure. + */ +int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot) +{ + return 0; +} + +/** + * p4d_clear_huge - clear kernel P4D mapping when it is set + * + * No 512GB pages yet -- always return 0 + */ +int p4d_clear_huge(p4d_t *p4d) +{ + return 0; +} +#endif + /** * pud_set_huge - setup kernel PUD mapping * -- 2.11.0
[PATCHv4 06/33] mm: convert generic code to 5-level paging
Convert all non-architecture-specific code to 5-level paging. It's mostly mechanical adding handling one more page table level in places where we deal with pud_t. Signed-off-by: Kirill A. Shutemov --- drivers/misc/sgi-gru/grufault.c | 9 +- fs/userfaultfd.c| 6 +- include/asm-generic/pgtable.h | 48 +- include/linux/hugetlb.h | 5 +- include/linux/kasan.h | 1 + include/linux/mm.h | 31 -- lib/ioremap.c | 39 +++- mm/gup.c| 46 +++-- mm/huge_memory.c| 7 +- mm/hugetlb.c| 29 +++--- mm/kasan/kasan_init.c | 35 ++- mm/memory.c | 207 +--- mm/mlock.c | 1 + mm/mprotect.c | 26 - mm/mremap.c | 13 ++- mm/page_vma_mapped.c| 6 +- mm/pagewalk.c | 32 ++- mm/pgtable-generic.c| 6 ++ mm/rmap.c | 7 +- mm/sparse-vmemmap.c | 22 - mm/swapfile.c | 26 - mm/userfaultfd.c| 23 +++-- mm/vmalloc.c| 81 23 files changed, 586 insertions(+), 120 deletions(-) diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c index 6fb773dbcd0c..93be82fc338a 100644 --- a/drivers/misc/sgi-gru/grufault.c +++ b/drivers/misc/sgi-gru/grufault.c @@ -219,15 +219,20 @@ static int atomic_pte_lookup(struct vm_area_struct *vma, unsigned long vaddr, int write, unsigned long *paddr, int *pageshift) { pgd_t *pgdp; - pmd_t *pmdp; + p4d_t *p4dp; pud_t *pudp; + pmd_t *pmdp; pte_t pte; pgdp = pgd_offset(vma->vm_mm, vaddr); if (unlikely(pgd_none(*pgdp))) goto err; - pudp = pud_offset(pgdp, vaddr); + p4dp = p4d_offset(pgdp, vaddr); + if (unlikely(p4d_none(*p4dp))) + goto err; + + pudp = pud_offset(p4dp, vaddr); if (unlikely(pud_none(*pudp))) goto err; diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 973607df579d..02ce3944d0f5 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -267,6 +267,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, { struct mm_struct *mm = ctx->mm; pgd_t *pgd; + p4d_t *p4d; pud_t *pud; pmd_t *pmd, _pmd; pte_t *pte; @@ -277,7 +278,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, pgd = pgd_offset(mm, address); if (!pgd_present(*pgd)) goto out; - pud = pud_offset(pgd, address); + p4d = p4d_offset(pgd, address); + if (!p4d_present(*p4d)) + goto out; + pud = pud_offset(p4d, address); if (!pud_present(*pud)) goto out; pmd = pmd_offset(pud, address); diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index f4ca23b158b3..1fad160f35de 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -10,9 +10,9 @@ #include #include -#if 4 - defined(__PAGETABLE_PUD_FOLDED) - defined(__PAGETABLE_PMD_FOLDED) != \ - CONFIG_PGTABLE_LEVELS -#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{PUD,PMD}_FOLDED +#if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ + defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS +#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED #endif /* @@ -424,6 +424,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) (__boundary - 1 < (end) - 1)? __boundary: (end);\ }) +#ifndef p4d_addr_end +#define p4d_addr_end(addr, end) \ +({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK; \ + (__boundary - 1 < (end) - 1)? __boundary: (end);\ +}) +#endif + #ifndef pud_addr_end #define pud_addr_end(addr, end) \ ({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK; \ @@ -444,6 +451,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) * Do the tests inline, but report and clear the bad entry in mm/memory.c. */ void pgd_clear_bad(pgd_t *); +void p4d_clear_bad(p4d_t *); void pud_clear_bad(pud_t *); void pmd_clear_bad(pmd_t *); @@ -458,6 +466,17 @@ static inline int pgd_none_or_clear_bad(pgd_t *pgd) return 0; } +static inline int p4d_none_or_clear_bad(p4d_t *p4d) +{ + if (p4d_none(*p4d)) + return 1; + if (unlikely(p4d_bad(*p4d))) { + p4d_clear_bad(p4d); + return 1; + } + return 0; +} + static inline int pud_none_or_clear_bad(pud_t *pud) {
[PATCHv4 31/33] x86/mm: add support for 5-level paging for KASLR
With 5-level paging randomization happens on P4D level instead of PUD. Maximum amount of physical memory also bumped to 52-bits for 5-level paging. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/kaslr.c | 82 - 1 file changed, 63 insertions(+), 19 deletions(-) diff --git a/arch/x86/mm/kaslr.c b/arch/x86/mm/kaslr.c index 887e57182716..662e5c4b21c8 100644 --- a/arch/x86/mm/kaslr.c +++ b/arch/x86/mm/kaslr.c @@ -6,12 +6,12 @@ * * Entropy is generated using the KASLR early boot functions now shared in * the lib directory (originally written by Kees Cook). Randomization is - * done on PGD & PUD page table levels to increase possible addresses. The - * physical memory mapping code was adapted to support PUD level virtual - * addresses. This implementation on the best configuration provides 30,000 - * possible virtual addresses in average for each memory region. An additional - * low memory page is used to ensure each CPU can start with a PGD aligned - * virtual address (for realmode). + * done on PGD & P4D/PUD page table levels to increase possible addresses. + * The physical memory mapping code was adapted to support P4D/PUD level + * virtual addresses. This implementation on the best configuration provides + * 30,000 possible virtual addresses in average for each memory region. + * An additional low memory page is used to ensure each CPU can start with + * a PGD aligned virtual address (for realmode). * * The order of each memory region is not changed. The feature looks at * the available space for the regions based on different configuration @@ -70,7 +70,8 @@ static __initdata struct kaslr_memory_region { unsigned long *base; unsigned long size_tb; } kaslr_regions[] = { - { &page_offset_base, 64/* Maximum */ }, + { &page_offset_base, + 1 << (__PHYSICAL_MASK_SHIFT - TB_SHIFT) /* Maximum */ }, { &vmalloc_base, VMALLOC_SIZE_TB }, { &vmemmap_base, 1 }, }; @@ -142,7 +143,10 @@ void __init kernel_randomize_memory(void) */ entropy = remain_entropy / (ARRAY_SIZE(kaslr_regions) - i); prandom_bytes_state(&rand_state, &rand, sizeof(rand)); - entropy = (rand % (entropy + 1)) & PUD_MASK; + if (IS_ENABLED(CONFIG_X86_5LEVEL)) + entropy = (rand % (entropy + 1)) & P4D_MASK; + else + entropy = (rand % (entropy + 1)) & PUD_MASK; vaddr += entropy; *kaslr_regions[i].base = vaddr; @@ -151,27 +155,21 @@ void __init kernel_randomize_memory(void) * randomization alignment. */ vaddr += get_padding(&kaslr_regions[i]); - vaddr = round_up(vaddr + 1, PUD_SIZE); + if (IS_ENABLED(CONFIG_X86_5LEVEL)) + vaddr = round_up(vaddr + 1, P4D_SIZE); + else + vaddr = round_up(vaddr + 1, PUD_SIZE); remain_entropy -= entropy; } } -/* - * Create PGD aligned trampoline table to allow real mode initialization - * of additional CPUs. Consume only 1 low memory page. - */ -void __meminit init_trampoline(void) +static void __meminit init_trampoline_pud(void) { unsigned long paddr, paddr_next; pgd_t *pgd; pud_t *pud_page, *pud_page_tramp; int i; - if (!kaslr_memory_enabled()) { - init_trampoline_default(); - return; - } - pud_page_tramp = alloc_low_page(); paddr = 0; @@ -192,3 +190,49 @@ void __meminit init_trampoline(void) set_pgd(&trampoline_pgd_entry, __pgd(_KERNPG_TABLE | __pa(pud_page_tramp))); } + +static void __meminit init_trampoline_p4d(void) +{ + unsigned long paddr, paddr_next; + pgd_t *pgd; + p4d_t *p4d_page, *p4d_page_tramp; + int i; + + p4d_page_tramp = alloc_low_page(); + + paddr = 0; + pgd = pgd_offset_k((unsigned long)__va(paddr)); + p4d_page = (p4d_t *) pgd_page_vaddr(*pgd); + + for (i = p4d_index(paddr); i < PTRS_PER_P4D; i++, paddr = paddr_next) { + p4d_t *p4d, *p4d_tramp; + unsigned long vaddr = (unsigned long)__va(paddr); + + p4d_tramp = p4d_page_tramp + p4d_index(paddr); + p4d = p4d_page + p4d_index(vaddr); + paddr_next = (paddr & P4D_MASK) + P4D_SIZE; + + *p4d_tramp = *p4d; + } + + set_pgd(&trampoline_pgd_entry, + __pgd(_KERNPG_TABLE | __pa(p4d_page_tramp))); +} + +/* + * Create PGD aligned trampoline table to allow real mode initialization + * of additional CPUs. Consume only 1 low memory page. + */ +void __meminit init_trampoline(void) +{ + + if (!kaslr_memory_enabled()) { + init_trampoline_default(); + return; + } + + if (IS_ENABLED(CONFIG_X86_5LEVEL)) +
[PATCHv4 25/33] x86/dump_pagetables: support 5-level paging
Simple extension to support one more page table level. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/dump_pagetables.c | 49 --- 1 file changed, 42 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c index 58b5bee7ea27..0effac6989cd 100644 --- a/arch/x86/mm/dump_pagetables.c +++ b/arch/x86/mm/dump_pagetables.c @@ -110,7 +110,8 @@ static struct addr_marker address_markers[] = { #define PTE_LEVEL_MULT (PAGE_SIZE) #define PMD_LEVEL_MULT (PTRS_PER_PTE * PTE_LEVEL_MULT) #define PUD_LEVEL_MULT (PTRS_PER_PMD * PMD_LEVEL_MULT) -#define PGD_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT) +#define P4D_LEVEL_MULT (PTRS_PER_PUD * PUD_LEVEL_MULT) +#define PGD_LEVEL_MULT (PTRS_PER_PUD * P4D_LEVEL_MULT) #define pt_dump_seq_printf(m, to_dmesg, fmt, args...) \ ({ \ @@ -347,7 +348,7 @@ static bool pud_already_checked(pud_t *prev_pud, pud_t *pud, bool checkwx) return checkwx && prev_pud && (pud_val(*prev_pud) == pud_val(*pud)); } -static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr, +static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr, unsigned long P) { int i; @@ -355,7 +356,7 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr, pgprotval_t prot; pud_t *prev_pud = NULL; - start = (pud_t *) pgd_page_vaddr(addr); + start = (pud_t *) p4d_page_vaddr(addr); for (i = 0; i < PTRS_PER_PUD; i++) { st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT); @@ -377,9 +378,43 @@ static void walk_pud_level(struct seq_file *m, struct pg_state *st, pgd_t addr, } #else -#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(pgd_val(a)),p) -#define pgd_large(a) pud_large(__pud(pgd_val(a))) -#define pgd_none(a) pud_none(__pud(pgd_val(a))) +#define walk_pud_level(m,s,a,p) walk_pmd_level(m,s,__pud(p4d_val(a)),p) +#define p4d_large(a) pud_large(__pud(p4d_val(a))) +#define p4d_none(a) pud_none(__pud(p4d_val(a))) +#endif + +#if PTRS_PER_P4D > 1 + +static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr, + unsigned long P) +{ + int i; + p4d_t *start; + pgprotval_t prot; + + start = (p4d_t *) pgd_page_vaddr(addr); + + for (i = 0; i < PTRS_PER_P4D; i++) { + st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT); + if (!p4d_none(*start)) { + if (p4d_large(*start) || !p4d_present(*start)) { + prot = p4d_flags(*start); + note_page(m, st, __pgprot(prot), 2); + } else { + walk_pud_level(m, st, *start, + P + i * P4D_LEVEL_MULT); + } + } else + note_page(m, st, __pgprot(0), 2); + + start++; + } +} + +#else +#define walk_p4d_level(m,s,a,p) walk_pud_level(m,s,__p4d(pgd_val(a)),p) +#define pgd_large(a) p4d_large(__p4d(pgd_val(a))) +#define pgd_none(a) p4d_none(__p4d(pgd_val(a))) #endif static inline bool is_hypervisor_range(int idx) @@ -424,7 +459,7 @@ static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd, prot = pgd_flags(*start); note_page(m, &st, __pgprot(prot), 1); } else { - walk_pud_level(m, &st, *start, + walk_p4d_level(m, &st, *start, i * PGD_LEVEL_MULT); } } else -- 2.11.0
[PATCHv4 04/33] arch, mm: convert all architectures to use 5level-fixup.h
If an architecture uses 4level-fixup.h we don't need to do anything as it includes 5level-fixup.h. If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK before inclusion of the header. It makes asm-generic code to use 5level-fixup.h. If an architecture has 4-level paging or folds levels on its own, include 5level-fixup.h directly. Signed-off-by: Kirill A. Shutemov --- arch/arc/include/asm/hugepage.h | 1 + arch/arc/include/asm/pgtable.h | 1 + arch/arm/include/asm/pgtable.h | 1 + arch/arm64/include/asm/pgtable-types.h | 4 arch/avr32/include/asm/pgtable-2level.h | 1 + arch/cris/include/asm/pgtable.h | 1 + arch/frv/include/asm/pgtable.h | 1 + arch/h8300/include/asm/pgtable.h | 1 + arch/hexagon/include/asm/pgtable.h | 1 + arch/ia64/include/asm/pgtable.h | 2 ++ arch/metag/include/asm/pgtable.h | 1 + arch/mips/include/asm/pgtable-32.h | 1 + arch/mips/include/asm/pgtable-64.h | 1 + arch/mn10300/include/asm/page.h | 1 + arch/nios2/include/asm/pgtable.h | 1 + arch/openrisc/include/asm/pgtable.h | 1 + arch/powerpc/include/asm/book3s/32/pgtable.h | 1 + arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++ arch/powerpc/include/asm/nohash/32/pgtable.h | 1 + arch/powerpc/include/asm/nohash/64/pgtable-4k.h | 3 +++ arch/powerpc/include/asm/nohash/64/pgtable-64k.h | 1 + arch/s390/include/asm/pgtable.h | 1 + arch/score/include/asm/pgtable.h | 1 + arch/sh/include/asm/pgtable-2level.h | 1 + arch/sh/include/asm/pgtable-3level.h | 1 + arch/sparc/include/asm/pgtable_64.h | 1 + arch/tile/include/asm/pgtable_32.h | 1 + arch/tile/include/asm/pgtable_64.h | 1 + arch/um/include/asm/pgtable-2level.h | 1 + arch/um/include/asm/pgtable-3level.h | 1 + arch/unicore32/include/asm/pgtable.h | 1 + arch/x86/include/asm/pgtable_types.h | 4 arch/xtensa/include/asm/pgtable.h| 1 + 33 files changed, 44 insertions(+) diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h index 317ff773e1ca..b18fcb606908 100644 --- a/arch/arc/include/asm/hugepage.h +++ b/arch/arc/include/asm/hugepage.h @@ -11,6 +11,7 @@ #define _ASM_ARC_HUGEPAGE_H #include +#define __ARCH_USE_5LEVEL_HACK #include static inline pte_t pmd_pte(pmd_t pmd) diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h index e94ca72b974e..ee22d40afef4 100644 --- a/arch/arc/include/asm/pgtable.h +++ b/arch/arc/include/asm/pgtable.h @@ -37,6 +37,7 @@ #include #include +#define __ARCH_USE_5LEVEL_HACK #include #include diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h index a8d656d9aec7..1c462381c225 100644 --- a/arch/arm/include/asm/pgtable.h +++ b/arch/arm/include/asm/pgtable.h @@ -20,6 +20,7 @@ #else +#define __ARCH_USE_5LEVEL_HACK #include #include #include diff --git a/arch/arm64/include/asm/pgtable-types.h b/arch/arm64/include/asm/pgtable-types.h index 69b2fd41503c..345a072b5856 100644 --- a/arch/arm64/include/asm/pgtable-types.h +++ b/arch/arm64/include/asm/pgtable-types.h @@ -55,9 +55,13 @@ typedef struct { pteval_t pgprot; } pgprot_t; #define __pgprot(x)((pgprot_t) { (x) } ) #if CONFIG_PGTABLE_LEVELS == 2 +#define __ARCH_USE_5LEVEL_HACK #include #elif CONFIG_PGTABLE_LEVELS == 3 +#define __ARCH_USE_5LEVEL_HACK #include +#elif CONFIG_PGTABLE_LEVELS == 4 +#include #endif #endif /* __ASM_PGTABLE_TYPES_H */ diff --git a/arch/avr32/include/asm/pgtable-2level.h b/arch/avr32/include/asm/pgtable-2level.h index 425dd567b5b9..d5b1c63993ec 100644 --- a/arch/avr32/include/asm/pgtable-2level.h +++ b/arch/avr32/include/asm/pgtable-2level.h @@ -8,6 +8,7 @@ #ifndef __ASM_AVR32_PGTABLE_2LEVEL_H #define __ASM_AVR32_PGTABLE_2LEVEL_H +#define __ARCH_USE_5LEVEL_HACK #include /* diff --git a/arch/cris/include/asm/pgtable.h b/arch/cris/include/asm/pgtable.h index 2a3210ba4c72..fa3a73004cc5 100644 --- a/arch/cris/include/asm/pgtable.h +++ b/arch/cris/include/asm/pgtable.h @@ -6,6 +6,7 @@ #define _CRIS_PGTABLE_H #include +#define __ARCH_USE_5LEVEL_HACK #include #ifndef __ASSEMBLY__ diff --git a/arch/frv/include/asm/pgtable.h b/arch/frv/include/asm/pgtable.h index a0513d463a1f..ab6e7e961b54 100644 --- a/arch/frv/include/asm/pgtable.h +++ b/arch/frv/include/asm/pgtable.h @@ -16,6 +16,7 @@ #ifndef _ASM_PGTABLE_H #define _ASM_PGTABLE_H +#include #include #include #include diff --git a/arch/h8300/include/asm/pgtable.h b/arch/h8300/include/asm/pgtable.h index 8341db67821d..7d265d28ba5e 100644 --- a/arch/h8300/include/asm/pgtable.h +++ b/arch/h8300/include/asm/pgtable.h @@ -1,5 +1,6 @@ #ifndef _H8
[PATCHv4 29/33] x86/mm: add sync_global_pgds() for configuration with 5-level paging
This basically restores slightly modified version of original sync_global_pgds() which we had before foldedl p4d was introduced. The only modification is protection against 'address' overflow. Signed-off-by: Kirill A. Shutemov --- arch/x86/mm/init_64.c | 37 + 1 file changed, 37 insertions(+) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 7bdda6f1d135..5ba99090dc3c 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -92,6 +92,42 @@ __setup("noexec32=", nonx32_setup); * When memory was added make sure all the processes MM have * suitable PGD entries in the local PGD level page. */ +#ifdef CONFIG_X86_5LEVEL +void sync_global_pgds(unsigned long start, unsigned long end) +{ + unsigned long address; + + for (address = start; address <= end && address >= start; + address += PGDIR_SIZE) { + const pgd_t *pgd_ref = pgd_offset_k(address); + struct page *page; + + if (pgd_none(*pgd_ref)) + continue; + + spin_lock(&pgd_lock); + list_for_each_entry(page, &pgd_list, lru) { + pgd_t *pgd; + spinlock_t *pgt_lock; + + pgd = (pgd_t *)page_address(page) + pgd_index(address); + /* the pgt_lock only for Xen */ + pgt_lock = &pgd_page_get_mm(page)->page_table_lock; + spin_lock(pgt_lock); + + if (!pgd_none(*pgd_ref) && !pgd_none(*pgd)) + BUG_ON(pgd_page_vaddr(*pgd) + != pgd_page_vaddr(*pgd_ref)); + + if (pgd_none(*pgd)) + set_pgd(pgd, *pgd_ref); + + spin_unlock(pgt_lock); + } + spin_unlock(&pgd_lock); + } +} +#else void sync_global_pgds(unsigned long start, unsigned long end) { unsigned long address; @@ -135,6 +171,7 @@ void sync_global_pgds(unsigned long start, unsigned long end) spin_unlock(&pgd_lock); } } +#endif /* * NOTE: This function is marked __ref because it calls __init function -- 2.11.0
[PATCHv4 17/33] x86/kasan: prepare clear_pgds() to switch to
With folded p4d, pgd_clear() is nop. Change clear_pgds() to use p4d_clear() instead. Signed-off-by: Kirill A. Shutemov Cc: Dmitry Vyukov --- arch/x86/mm/kasan_init_64.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 8d63d7a104c3..733f8ba6a01f 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -32,8 +32,15 @@ static int __init map_range(struct range *range) static void __init clear_pgds(unsigned long start, unsigned long end) { - for (; start < end; start += PGDIR_SIZE) - pgd_clear(pgd_offset_k(start)); + pgd_t *pgd; + + for (; start < end; start += PGDIR_SIZE) { + pgd = pgd_offset_k(start); + if (CONFIG_PGTABLE_LEVELS < 5) + p4d_clear(p4d_offset(pgd, start)); + else + pgd_clear(pgd); + } } static void __init kasan_map_early_shadow(pgd_t *pgd) -- 2.11.0
cfq-iosched: two questions about the hrtimer version of CFQ
Hi Jan and list, When testing the hrtimer version of CFQ, we found a performance degradation problem which seems to be caused by commit 0b31c10 ("cfq-iosched: Charge at least 1 jiffie instead of 1 ns"). The following is the test process: * filesystem and block device * XFS + /dev/sda mounted on /tmp/sda * CFQ configuration * default configurations * fio job configuration [global] bs=4k ioengine=psync iodepth=1 direct=1 rw=randwrite time_based runtime=15 cgroup_nodelete=1 group_reporting=1 [cfq_a] filename=/tmp/sda/cfq_a.dat size=2G cgroup_weight=500 cgroup=cfq_a thread=1 numjobs=2 [cfq_b] new_group filename=/tmp/sda/cfq_b.dat size=2G rate=4m cgroup_weight=500 cgroup=cfq_b thread=1 numjobs=2 The following is the test result: * with 0b31c10: * fio report cfq_a: bw=5312.6KB/s, iops=1328 cfq_b: bw=8192.6KB/s, iops=2048 * blkcg debug files ./cfq_a/blkio.group_wait_time:8:0 12062571233 ./cfq_b/blkio.group_wait_time:8:0 155841600 ./cfq_a/blkio.io_serviced:Total 19922 ./cfq_b/blkio.io_serviced:Total 30722 ./cfq_a/blkio.time:8:0 19406083246 ./cfq_b/blkio.time:8:0 19417146869 * without 0b31c10: * fio report cfq_a: bw=21670KB/s, iops=5417 cfq_b: bw=8191.2KB/s, iops=2047 * blkcg debug files ./cfq_a/blkio.group_wait_time:8:0 5798452504 ./cfq_b/blkio.group_wait_time:8:0 5131844007 ./cfq_a/blkio.io_serviced:8:0 Write 81261 ./cfq_b/blkio.io_serviced:8:0 Write 30722 ./cfq_a/blkio.time:8:0 5642608173 ./cfq_b/blkio.time:8:0 5849949812 We want to known the reason why you revert the minimal used slice to 1 jiffy when the slice has not been allocated. Does it lead to some performance regressions or something similar ? If not, I think we could revert the minimal slice to 1 ns again. Another problem is about the time comparison in CFQ code. In no-hrtimer version of CFQ, it uses time_after or time_before when possible, Why the hrtimer version doesn't use the equivalent time_after64/time_before64 ? Can ktime_get_ns() ensure there will be no wrapping problem ? Thanks very much. Regards, Tao
[RESEND PATCH v3 4/8] phy: phy-mt65xx-usb3: move clock from phy node into port nodes
the reference clock of HighSpeed port is 48M which comes from PLL; the reference clock of SuperSpeed port is 26M which usually comes from 26M oscillator directly, but some SoCs are not, add it for compatibility, and put them into port node for flexibility. Signed-off-by: Chunfeng Yun --- drivers/phy/phy-mt65xx-usb3.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/drivers/phy/phy-mt65xx-usb3.c b/drivers/phy/phy-mt65xx-usb3.c index 7fff482..f4a3505 100644 --- a/drivers/phy/phy-mt65xx-usb3.c +++ b/drivers/phy/phy-mt65xx-usb3.c @@ -153,6 +153,7 @@ struct mt65xx_phy_pdata { struct mt65xx_phy_instance { struct phy *phy; void __iomem *port_base; + struct clk *ref_clk;/* reference clock of anolog phy */ u32 index; u8 type; }; @@ -160,7 +161,6 @@ struct mt65xx_phy_instance { struct mt65xx_u3phy { struct device *dev; void __iomem *sif_base; /* only shared sif */ - struct clk *u3phya_ref; /* reference clock of usb3 anolog phy */ const struct mt65xx_phy_pdata *pdata; struct mt65xx_phy_instance **phys; int nphys; @@ -449,9 +449,9 @@ static int mt65xx_phy_init(struct phy *phy) struct mt65xx_u3phy *u3phy = dev_get_drvdata(phy->dev.parent); int ret; - ret = clk_prepare_enable(u3phy->u3phya_ref); + ret = clk_prepare_enable(instance->ref_clk); if (ret) { - dev_err(u3phy->dev, "failed to enable u3phya_ref\n"); + dev_err(u3phy->dev, "failed to enable ref_clk\n"); return ret; } @@ -494,7 +494,7 @@ static int mt65xx_phy_exit(struct phy *phy) if (instance->type == PHY_TYPE_USB2) phy_instance_exit(u3phy, instance); - clk_disable_unprepare(u3phy->u3phya_ref); + clk_disable_unprepare(instance->ref_clk); return 0; } @@ -594,12 +594,6 @@ static int mt65xx_u3phy_probe(struct platform_device *pdev) return PTR_ERR(u3phy->sif_base); } - u3phy->u3phya_ref = devm_clk_get(dev, "u3phya_ref"); - if (IS_ERR(u3phy->u3phya_ref)) { - dev_err(dev, "error to get u3phya_ref\n"); - return PTR_ERR(u3phy->u3phya_ref); - } - port = 0; for_each_child_of_node(np, child_np) { struct mt65xx_phy_instance *instance; @@ -634,6 +628,13 @@ static int mt65xx_u3phy_probe(struct platform_device *pdev) goto put_child; } + instance->ref_clk = devm_clk_get(&phy->dev, "ref"); + if (IS_ERR(instance->ref_clk)) { + dev_err(dev, "failed to get ref_clk(id-%d)\n", port); + retval = PTR_ERR(instance->ref_clk); + goto put_child; + } + instance->phy = phy; instance->index = port; phy_set_drvdata(phy, instance); -- 1.7.9.5
[RESEND PATCH v3 8/8] dt-bindings: phy-mt65xx-usb: add support for new version phy
add a new compatible string for "mt2712", and move reference clock into each port node; Signed-off-by: Chunfeng Yun Acked-by: Rob Herring --- .../devicetree/bindings/phy/phy-mt65xx-usb.txt | 93 +--- 1 file changed, 80 insertions(+), 13 deletions(-) diff --git a/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt b/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt index 33a2b1e..0acc5a9 100644 --- a/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt +++ b/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt @@ -6,12 +6,11 @@ This binding describes a usb3.0 phy for mt65xx platforms of Medaitek SoC. Required properties (controller (parent) node): - compatible : should be one of "mediatek,mt2701-u3phy" + "mediatek,mt2712-u3phy" "mediatek,mt8173-u3phy" - - reg : offset and length of register for phy, exclude port's - register. - - clocks : a list of phandle + clock-specifier pairs, one for each - entry in clock-names - - clock-names : must contain + - clocks : (deprecated, use port's clocks instead) a list of phandle + + clock-specifier pairs, one for each entry in clock-names + - clock-names : (deprecated, use port's one instead) must contain "u3phya_ref": for reference clock of usb3.0 analog phy. Required nodes : a sub-node is required for each port the controller @@ -19,8 +18,19 @@ Required nodes : a sub-node is required for each port the controller 'reg' property is used inside these nodes to describe the controller's topology. +Optional properties (controller (parent) node): + - reg : offset and length of register shared by multiple ports, + exclude port's private register. It is needed on mt2701 + and mt8173, but not on mt2712. + Required properties (port (child) node): - reg : address and length of the register set for the port. +- clocks : a list of phandle + clock-specifier pairs, one for each + entry in clock-names +- clock-names : must contain + "ref": 48M reference clock for HighSpeed analog phy; and 26M + reference clock for SuperSpeed analog phy, sometimes is + 24M, 25M or 27M, depended on platform. - #phy-cells : should be 1 (See second example) cell after port phandle is phy type from: - PHY_TYPE_USB2 @@ -31,21 +41,31 @@ Example: u3phy: usb-phy@1129 { compatible = "mediatek,mt8173-u3phy"; reg = <0 0x1129 0 0x800>; - clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>; - clock-names = "u3phya_ref"; #address-cells = <2>; #size-cells = <2>; ranges; status = "okay"; - phy_port0: port@11290800 { - reg = <0 0x11290800 0 0x800>; + u2port0: usb-phy@11290800 { + reg = <0 0x11290800 0 0x100>; + clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>; + clock-names = "ref"; #phy-cells = <1>; status = "okay"; }; - phy_port1: port@11291000 { - reg = <0 0x11291000 0 0x800>; + u3port0: usb-phy@11290900 { + reg = <0 0x11290800 0 0x700>; + clocks = <&clk26m>; + clock-names = "ref"; + #phy-cells = <1>; + status = "okay"; + }; + + u2port1: usb-phy@11291000 { + reg = <0 0x11291000 0 0x100>; + clocks = <&apmixedsys CLK_APMIXED_REF2USB_TX>; + clock-names = "ref"; #phy-cells = <1>; status = "okay"; }; @@ -64,7 +84,54 @@ Example: usb30: usb@1127 { ... - phys = <&phy_port0 PHY_TYPE_USB3>; - phy-names = "usb3-0"; + phys = <&u2port0 PHY_TYPE_USB2>, <&u3port0 PHY_TYPE_USB3>; + phy-names = "usb2-0", "usb3-0"; ... }; + + +Layout differences of banks between mt8173/mt2701 and mt2712 +- +mt8173 and mt2701: +portoffsetbank +shared 0xSPLLC +0x0100FMREG +u2 port00x0800U2PHY_COM +u3 port00x0900U3PHYD +0x0a00U3PHYD_BANK2 +0x0b00U3PHYA +0x0c00U3PHYA_DA +u2 port10x1000U2PHY_COM +u3 port10x1100U3PHYD +0x1200U3PHYD_BANK2 +0x1300U3PHYA +0x1400U3PHYA_DA +u2 port20x1800U2PHY_COM +... + +mt2712: +portoffsetbank +u2 port00xMISC +0x0100FMREG +0x0300U2PHY_COM +u3 port00x0700SPLLC +0x0800CHIP +0x0900U3PHYD +0x0a00U3PHYD_BANK2 +0x0b00U3PHYA +0x0c00U3PHY
Re: [PATCH] HID: get rid of HID_QUIRK_NO_INIT_REPORTS
On Mar 06 2017 or thereabouts, Jiri Kosina wrote: > On Thu, 5 Jan 2017, Benjamin Tissoires wrote: > > > For case 1, the hiddev documentation provides an ioctl to do the > > init manually. A solution could be to retrieve the requested report > > when EVIOCGUSAGE is called, in the same way hidraw does. I would be > > tempted to not change the behavior and hope that we won't break any > > userspace tool. > > I'd like to be applying the HID_QUIRK_NO_INIT_REPORTS removal as soon as > possible so that it gets exposure in linux-next over the whole development > cycle. > > I am however too conservative to ignore the potential hiddev breakage, I > am afraid. This has a real potential of breaking systems, and > administrators having hard time figuring out of happened; essentialy, this > is userspace-visible behavior change (regression) for which we haven't > done any long-term depreciation (such as printing a warning "please talk > to your hiddev driver vendor" in case the driver seems to assume > initialized reports) at least for a few years. > > I think that either doing it at a connect time, or during first > EVIOCGUSAGE ioctl() call is a must. Yes, that's what I was thinking to do too. Also, I think we need to keep around the list of currently "quirked" devices for hiddev to work properly. I am still wondering whether we should simply keep the list of quirked devices in hid-core, but disable the effects, or move the full list of quirked devices in hiddev. Initially I thought it was better to remove the quirk from core and move the list in hiddev, but on the other hand, that means that we will remove the ability to introduce it from the kernel boot command, so maybe keeping the list in its current state is better, and only have the effects in hiddev. Am I clear enough?) > > Otherwise, I'd be super-happy to finally get rid of this giant PITA. > Me too! Cheers, Benjamin > Thanks! > > -- > Jiri Kosina > SUSE Labs >
Re: [PATCH v2 6/9] kasan: improve slab object description
On Fri, Mar 3, 2017 at 3:39 PM, Andrey Ryabinin wrote: > > > On 03/03/2017 04:52 PM, Alexander Potapenko wrote: >> On Fri, Mar 3, 2017 at 2:31 PM, Andrey Ryabinin >> wrote: >>> On 03/02/2017 04:48 PM, Andrey Konovalov wrote: Changes slab object description from: Object at 880068388540, in cache kmalloc-128 size: 128 to: The buggy address belongs to the object at 880068388540 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 123 bytes inside of 128-byte region [880068388540, 8800683885c0) Makes it more explanatory and adds information about relative offset of the accessed address to the start of the object. >>> >>> I don't think that this is an improvement. You replaced one simple line >>> with a huge >>> and hard to parse text without giving any new/useful information. >>> Except maybe offset, it useful sometimes, so wouldn't mind adding it to >>> description. >> Agreed. >> How about: >> === >> Access 123 bytes inside of 128-byte region [880068388540, >> 8800683885c0) >> Object at 880068388540 belongs to the cache kmalloc-128 >> === >> ? >> > > I would just add the offset in the end: > Object at 880068388540, in cache kmalloc-128 size: 128 accessed > at offset y Access can be inside or outside the object, so it's better to specifically say that. I think we can do (basically what Alexander suggested): Object at 880068388540 belongs to the cache kmalloc-128 of size 128 Access 123 bytes inside of 128-byte region [880068388540, 8800683885c0) What do you think?
Re: Question Regarding ERMS memcpy
On March 6, 2017 5:33:28 AM PST, Borislav Petkov wrote: >On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote: >> Well honestly my issue was solved by fixing my kernel config. I have >no >> idea why I had optimize for size in there in the first place. > >I still think that we should address the iomem memcpy Linus mentioned. >So how about this partial revert. I've made 32-bit use the same special >__memcpy() version. > >Hmmm? > >--- >diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h >index 7afb0e2f07f4..9e378a10796d 100644 >--- a/arch/x86/include/asm/io.h >+++ b/arch/x86/include/asm/io.h >@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void); > #ifdef __KERNEL__ > > #include >+#include > > /* > * Convert a virtual cached pointer to an uncached pointer >@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned >char val, size_t count) > * @src: The (I/O memory) source for the data > * @count:The number of bytes to copy > * >- * Copy a block of data from I/O memory. >+ * Copy a block of data from I/O memory. IO memory is different from >+ * cached memory so we use special memcpy version. > */ > static inline void >memcpy_fromio(void *dst, const volatile void __iomem *src, size_t >count) > { >- memcpy(dst, (const void __force *)src, count); >+ __inline_memcpy(dst, (const void __force *)src, count); > } > > /** >@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void >__iomem *src, size_t count) > * @src: The (RAM) source for the data > * @count:The number of bytes to copy > * >- * Copy a block of data to I/O memory. >+ * Copy a block of data to I/O memory. IO memory is different from >+ * cached memory so we use special memcpy version. > */ > static inline void > memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) > { >- memcpy((void __force *)dst, src, count); >+ __inline_memcpy((void __force *)dst, src, count); > } > > /* >diff --git a/arch/x86/include/asm/string_32.h >b/arch/x86/include/asm/string_32.h >index 3d3e8353ee5c..556fa4a975ff 100644 >--- a/arch/x86/include/asm/string_32.h >+++ b/arch/x86/include/asm/string_32.h >@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c); > #define __HAVE_ARCH_STRLEN > extern size_t strlen(const char *s); > >+#define __inline_memcpy __memcpy >static __always_inline void *__memcpy(void *to, const void *from, >size_t n) > { > int d0, d1, d2; It isn't really that straightforward IMO. For UC memory transaction size really needs to be specified explicitly at all times and should be part of the API, rather than implicit. For WC/WT/WB device memory, the ordinary memcpy is valid and preferred. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [PATCH v5 1/2] perf sdt: add scanning of sdt probles arguments
On Wed, 14 Dec 2016 01:07:31 +0100 Alexis Berlemont wrote: > During a "perf buildid-cache --add" command, the section > ".note.stapsdt" of the "added" binary is scanned in order to list the > available SDT markers available in a binary. The parts containing the > probes arguments were left unscanned. > > The whole section is now parsed; the probe arguments are extracted for > later use. > Looks good to me. Acked-by: Masami Hiramatsu Thanks! > Signed-off-by: Alexis Berlemont > --- > tools/perf/util/symbol-elf.c | 25 +++-- > tools/perf/util/symbol.h | 1 + > 2 files changed, 24 insertions(+), 2 deletions(-) > > diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c > index 99400b0..7725c3f 100644 > --- a/tools/perf/util/symbol-elf.c > +++ b/tools/perf/util/symbol-elf.c > @@ -1822,7 +1822,7 @@ void kcore_extract__delete(struct kcore_extract *kce) > static int populate_sdt_note(Elf **elf, const char *data, size_t len, >struct list_head *sdt_notes) > { > - const char *provider, *name; > + const char *provider, *name, *args; > struct sdt_note *tmp = NULL; > GElf_Ehdr ehdr; > GElf_Addr base_off = 0; > @@ -1881,6 +1881,25 @@ static int populate_sdt_note(Elf **elf, const char > *data, size_t len, > goto out_free_prov; > } > > + args = memchr(name, '\0', data + len - name); > + > + /* > + * There is no argument if: > + * - We reached the end of the note; > + * - There is not enough room to hold a potential string; > + * - The argument string is empty or just contains ':'. > + */ > + if (args == NULL || data + len - args < 2 || > + args[1] == ':' || args[1] == '\0') > + tmp->args = NULL; > + else { > + tmp->args = strdup(++args); > + if (!tmp->args) { > + ret = -ENOMEM; > + goto out_free_name; > + } > + } > + > if (gelf_getclass(*elf) == ELFCLASS32) { > memcpy(&tmp->addr, &buf, 3 * sizeof(Elf32_Addr)); > tmp->bit32 = true; > @@ -1892,7 +1911,7 @@ static int populate_sdt_note(Elf **elf, const char > *data, size_t len, > if (!gelf_getehdr(*elf, &ehdr)) { > pr_debug("%s : cannot get elf header.\n", __func__); > ret = -EBADF; > - goto out_free_name; > + goto out_free_args; > } > > /* Adjust the prelink effect : > @@ -1917,6 +1936,8 @@ static int populate_sdt_note(Elf **elf, const char > *data, size_t len, > list_add_tail(&tmp->note_list, sdt_notes); > return 0; > > +out_free_args: > + free(tmp->args); > out_free_name: > free(tmp->name); > out_free_prov: > diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h > index 6c358b7..9222c7e 100644 > --- a/tools/perf/util/symbol.h > +++ b/tools/perf/util/symbol.h > @@ -351,6 +351,7 @@ int arch__choose_best_symbol(struct symbol *syma, struct > symbol *symb); > struct sdt_note { > char *name; /* name of the note*/ > char *provider; /* provider name */ > + char *args; > bool bit32; /* whether the location is 32 bits? */ > union { /* location, base and semaphore addrs */ > Elf64_Addr a64[3]; > -- > 2.10.2 > -- Masami Hiramatsu
[PATCH 7/7] jbd2: make the whole kjournald2 kthread NOFS safe
From: Michal Hocko kjournald2 is central to the transaction commit processing. As such any potential allocation from this kernel thread has to be GFP_NOFS. Make sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save. Suggested-by: Jan Kara Reviewed-by: Jan Kara Signed-off-by: Michal Hocko --- fs/jbd2/journal.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index a1a359bfcc9c..78433ce1db40 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -43,6 +43,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -206,6 +207,13 @@ static int kjournald2(void *arg) wake_up(&journal->j_wait_done_commit); /* +* Make sure that no allocations from this kernel thread will ever recurse +* to the fs layer because we are responsible for the transaction commit +* and any fs involvement might get stuck waiting for the trasn. commit. +*/ + memalloc_nofs_save(); + + /* * And now, wait forever for commit wakeup events. */ write_lock(&journal->j_state_lock); -- 2.11.0
[PATCH 3/7] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
From: Michal Hocko xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite some time ago. We would like to make this concept more generic and use it for other filesystems as well. Let's start by giving the flag a more generic name PF_MEMALLOC_NOFS which is in line with an exiting PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO contexts. Replace all PF_FSTRANS usage from the xfs code in the first step before we introduce a full API for it as xfs uses the flag directly anyway. This patch doesn't introduce any functional change. Acked-by: Vlastimil Babka Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Michal Hocko --- fs/xfs/kmem.c | 4 ++-- fs/xfs/kmem.h | 2 +- fs/xfs/libxfs/xfs_btree.c | 2 +- fs/xfs/xfs_aops.c | 6 +++--- fs/xfs/xfs_trans.c| 12 ++-- include/linux/sched.h | 2 ++ 6 files changed, 15 insertions(+), 13 deletions(-) diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c index 2dfdc62f795e..e14da724a0b5 100644 --- a/fs/xfs/kmem.c +++ b/fs/xfs/kmem.c @@ -81,13 +81,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags) * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering * the filesystem here and potentially deadlocking. */ - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) noio_flag = memalloc_noio_save(); lflags = kmem_flags_convert(flags); ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) memalloc_noio_restore(noio_flag); return ptr; diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h index 689f746224e7..d973dbfc2bfa 100644 --- a/fs/xfs/kmem.h +++ b/fs/xfs/kmem.h @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags) lflags = GFP_ATOMIC | __GFP_NOWARN; } else { lflags = GFP_KERNEL | __GFP_NOWARN; - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS)) + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) lflags &= ~__GFP_FS; } diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index c3decedc9455..3059a3ec7ecb 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -2886,7 +2886,7 @@ xfs_btree_split_worker( struct xfs_btree_split_args *args = container_of(work, struct xfs_btree_split_args, work); unsigned long pflags; - unsigned long new_pflags = PF_FSTRANS; + unsigned long new_pflags = PF_MEMALLOC_NOFS; /* * we are in a transaction context here, but may also be doing work diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index bf65a9ea8642..330c6019120e 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc( * We hand off the transaction to the completion thread now, so * clear the flag here. */ - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); return 0; } @@ -252,7 +252,7 @@ xfs_setfilesize_ioend( * thus we need to mark ourselves as being in a transaction manually. * Similarly for freeze protection. */ - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS); + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS); /* we abort the update if there was an IO error */ @@ -1021,7 +1021,7 @@ xfs_do_writepage( * Given that we do not allow direct reclaim to call us, we should * never be called while in a filesystem transaction. */ - if (WARN_ON_ONCE(current->flags & PF_FSTRANS)) + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) goto redirty; /* diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 70f42ea86dfb..f5969c8274fc 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -134,7 +134,7 @@ xfs_trans_reserve( boolrsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0; /* Mark this thread as being in a transaction */ - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS); + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); /* * Attempt to reserve the needed disk blocks by decrementing @@ -144,7 +144,7 @@ xfs_trans_reserve( if (blocks > 0) { error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd); if (error != 0) { - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS); +
Re: [PATCH v2] arm64: kvm: Use has_vhe() instead of hyp_alternate_select()
Hi Marc, On 03/06/2017 02:34 AM, Marc Zyngier wrote: Hi Shanker, On Mon, Mar 06 2017 at 2:33:18 am GMT, Shanker Donthineni wrote: Now all the cpu_hwcaps features have their own static keys. We don't need a separate function hyp_alternate_select() to patch the vhe/nvhe code. We can achieve the same functionality by using has_vhe(). It improves the code readability, uses the jump label instructions, and also compiler generates the better code with a fewer instructions. How do you define "better"? Which compiler? Do you have any benchmarking data? I'm using gcc version 5.2.0. With has_vhe() it shows the smaller code size as shown below. I tried to benchmark the code changes using Cristiffer's microbench tool, but not seeing a noticeable difference on QDF2400 platform. hyp_alternate_select() uses BR/BLR instructions to patch vhe/mvhe code, which is not good for branch prediction purpose. compiler treats patched code as a function call, so the contents of the registers x0-x18 are not reusable after vhe/nvhe call. Current code: arch/arm64/kvm/hyp/switch.o: file format elf64-littleaarch64 Sections: Idx Name Size VMA LMA File off Algn 0 .text 0040 2**0 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .data 0040 2**0 CONTENTS, ALLOC, LOAD, DATA 2 .bss 0040 2**0 ALLOC 3 .hyp.text 0550 0040 2**3 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE New code: arch/arm64/kvm/hyp/switch.o: file format elf64-littleaarch64 Sections: Idx Name Size VMA LMA File off Algn 0 .text 0040 2**0 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .data 0040 2**0 CONTENTS, ALLOC, LOAD, DATA 2 .bss 0040 2**0 ALLOC 3 .hyp.text 0488 0040 2**3 CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE Signed-off-by: Shanker Donthineni --- v2: removed 'Change-Id: Ia8084189833f2081ff13c392deb5070c46a64038' from commit arch/arm64/kvm/hyp/debug-sr.c | 12 ++ arch/arm64/kvm/hyp/switch.c| 50 +++--- arch/arm64/kvm/hyp/sysreg-sr.c | 23 +-- 3 files changed, 43 insertions(+), 42 deletions(-) diff --git a/arch/arm64/kvm/hyp/debug-sr.c b/arch/arm64/kvm/hyp/debug-sr.c index f5154ed..e5642c2 100644 --- a/arch/arm64/kvm/hyp/debug-sr.c +++ b/arch/arm64/kvm/hyp/debug-sr.c @@ -109,9 +109,13 @@ static void __hyp_text __debug_save_spe_nvhe(u64 *pmscr_el1) dsb(nsh); } -static hyp_alternate_select(__debug_save_spe, - __debug_save_spe_nvhe, __debug_save_spe_vhe, - ARM64_HAS_VIRT_HOST_EXTN); +static void __hyp_text __debug_save_spe(u64 *pmscr_el1) +{ + if (has_vhe()) + __debug_save_spe_vhe(pmscr_el1); + else + __debug_save_spe_nvhe(pmscr_el1); +} I have two worries about this kind of thing: - Not all compilers do support jump labels, leading to a memory access on each static key (GCC 4.8, for example). This would immediately introduce a pretty big regression - The hyp_alternate_select() method doesn't introduce a fast/slow path duality. Each path has the exact same cost. I'm not keen on choosing what is supposed to be the fast path, really. Yes, it'll require a runtime check if the compiler doesn't support ASM GOTO labels. Agree, hyp_alternate_select() has a constant branch over head but it might cause a branch prediction penality. Thanks, M. -- Shanker Donthineni Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v4 7/7] perf/sdt: Remove stale warning
Perf was showing warning if user tries to record sdt event without creating a probepoint. Now we are allowing direct record on sdt events, remove this stale warning/hint. Signed-off-by: Ravi Bangoria --- tools/lib/api/fs/tracing_path.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/tools/lib/api/fs/tracing_path.c b/tools/lib/api/fs/tracing_path.c index 3e606b9..fa52e67 100644 --- a/tools/lib/api/fs/tracing_path.c +++ b/tools/lib/api/fs/tracing_path.c @@ -103,19 +103,10 @@ int tracing_path__strerror_open_tp(int err, char *buf, size_t size, * - jirka */ if (debugfs__configured() || tracefs__configured()) { - /* sdt markers */ - if (!strncmp(filename, "sdt_", 4)) { - snprintf(buf, size, - "Error:\tFile %s/%s not found.\n" - "Hint:\tSDT event cannot be directly recorded on.\n" - "\tPlease first use 'perf probe %s:%s' before recording it.\n", - tracing_events_path, filename, sys, name); - } else { - snprintf(buf, size, -"Error:\tFile %s/%s not found.\n" -"Hint:\tPerhaps this kernel misses some CONFIG_ setting to enable this feature?.\n", -tracing_events_path, filename); - } + snprintf(buf, size, +"Error:\tFile %s/%s not found.\n" +"Hint:\tPerhaps this kernel misses some CONFIG_ setting to enable this feature?.\n", +tracing_events_path, filename); break; } snprintf(buf, size, "%s", -- 2.9.3
[PATCH 6/7] jbd2: mark the transaction context with the scope GFP_NOFS context
From: Michal Hocko now that we have memalloc_nofs_{save,restore} api we can mark the whole transaction context as implicitly GFP_NOFS. All allocations will automatically inherit GFP_NOFS this way. This means that we do not have to mark any of those requests with GFP_NOFS and moreover all the ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded GFP_KERNEL allocations deep inside the vmalloc will be NOFS now. Reviewed-by: Jan Kara Signed-off-by: Michal Hocko --- fs/jbd2/transaction.c | 12 include/linux/jbd2.h | 2 ++ 2 files changed, 14 insertions(+) diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index 5e659ee08d6a..d8f09f34285f 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -29,6 +29,7 @@ #include #include #include +#include #include @@ -388,6 +389,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle, rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_); jbd2_journal_free_transaction(new_transaction); + /* +* Make sure that no allocations done while the transaction is +* open is going to recurse back to the fs layer. +*/ + handle->saved_alloc_context = memalloc_nofs_save(); return 0; } @@ -466,6 +472,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks, trace_jbd2_handle_start(journal->j_fs_dev->bd_dev, handle->h_transaction->t_tid, type, line_no, nblocks); + return handle; } EXPORT_SYMBOL(jbd2__journal_start); @@ -1760,6 +1767,11 @@ int jbd2_journal_stop(handle_t *handle) if (handle->h_rsv_handle) jbd2_journal_free_reserved(handle->h_rsv_handle); free_and_exit: + /* +* scope of th GFP_NOFS context is over here and so we can +* restore the original alloc context. +*/ + memalloc_nofs_restore(handle->saved_alloc_context); jbd2_free_handle(handle); return err; } diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index dfaa1f4dcb0c..606b6bce3a5b 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -491,6 +491,8 @@ struct jbd2_journal_handle unsigned long h_start_jiffies; unsigned inth_requested_credits; + + unsigned intsaved_alloc_context; }; -- 2.11.0
[PATCH 1/7] lockdep: teach lockdep about memalloc_noio_save
From: Nikolay Borisov Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O during memory allocation") added the memalloc_noio_(save|restore) functions to enable people to modify the MM behavior by disabling I/O during memory allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent allocation paths recursing back into the filesystem without explicitly changing the flags for every allocation site. However, lockdep hasn't been keeping up with the changes and it entirely misses handling the memalloc_noio adjustments. Instead, it is left to the callers of __lockdep_trace_alloc to call the function after they have shaven the respective GFP flags which can lead to false positives: [ 644.173373] = [ 644.174012] [ INFO: inconsistent lock state ] [ 644.174012] 4.10.0-nbor #134 Not tainted [ 644.174012] - [ 644.174012] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. [ 644.174012] fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 644.174012] (&xfs_nondir_ilock_class){?.}, at: [] xfs_ilock+0x141/0x230 [ 644.174012] {IN-RECLAIM_FS-W} state was registered at: [ 644.174012] __lock_acquire+0x62a/0x17c0 [ 644.174012] lock_acquire+0xc5/0x220 [ 644.174012] down_write_nested+0x4f/0x90 [ 644.174012] xfs_ilock+0x141/0x230 [ 644.174012] xfs_reclaim_inode+0x12a/0x320 [ 644.174012] xfs_reclaim_inodes_ag+0x2c8/0x4e0 [ 644.174012] xfs_reclaim_inodes_nr+0x33/0x40 [ 644.174012] xfs_fs_free_cached_objects+0x19/0x20 [ 644.174012] super_cache_scan+0x191/0x1a0 [ 644.174012] shrink_slab+0x26f/0x5f0 [ 644.174012] shrink_node+0xf9/0x2f0 [ 644.174012] kswapd+0x356/0x920 [ 644.174012] kthread+0x10c/0x140 [ 644.174012] ret_from_fork+0x31/0x40 [ 644.174012] irq event stamp: 173777 [ 644.174012] hardirqs last enabled at (173777): [] __local_bh_enable_ip+0x70/0xc0 [ 644.174012] hardirqs last disabled at (173775): [] __local_bh_enable_ip+0x37/0xc0 [ 644.174012] softirqs last enabled at (173776): [] _xfs_buf_find+0x67a/0xb70 [ 644.174012] softirqs last disabled at (173774): [] _xfs_buf_find+0x5db/0xb70 [ 644.174012] [ 644.174012] other info that might help us debug this: [ 644.174012] Possible unsafe locking scenario: [ 644.174012] [ 644.174012]CPU0 [ 644.174012] [ 644.174012] lock(&xfs_nondir_ilock_class); [ 644.174012] [ 644.174012] lock(&xfs_nondir_ilock_class); [ 644.174012] [ 644.174012] *** DEADLOCK *** [ 644.174012] [ 644.174012] 4 locks held by fsstress/3365: [ 644.174012] #0: (sb_writers#10){++}, at: [] mnt_want_write+0x24/0x50 [ 644.174012] #1: (&sb->s_type->i_mutex_key#12){++}, at: [] vfs_setxattr+0x6f/0xb0 [ 644.174012] #2: (sb_internal#2){++}, at: [] xfs_trans_alloc+0xfc/0x140 [ 644.174012] #3: (&xfs_nondir_ilock_class){?.}, at: [] xfs_ilock+0x141/0x230 [ 644.174012] [ 644.174012] stack backtrace: [ 644.174012] CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134 [ 644.174012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 644.174012] Call Trace: [ 644.174012] dump_stack+0x85/0xc9 [ 644.174012] print_usage_bug.part.37+0x284/0x293 [ 644.174012] ? print_shortest_lock_dependencies+0x1b0/0x1b0 [ 644.174012] mark_lock+0x27e/0x660 [ 644.174012] mark_held_locks+0x66/0x90 [ 644.174012] lockdep_trace_alloc+0x6f/0xd0 [ 644.174012] kmem_cache_alloc_node_trace+0x3a/0x2c0 [ 644.174012] ? vm_map_ram+0x2a1/0x510 [ 644.174012] vm_map_ram+0x2a1/0x510 [ 644.174012] ? vm_map_ram+0x46/0x510 [ 644.174012] _xfs_buf_map_pages+0x77/0x140 [ 644.174012] xfs_buf_get_map+0x185/0x2a0 [ 644.174012] xfs_attr_rmtval_set+0x233/0x430 [ 644.174012] xfs_attr_leaf_addname+0x2d2/0x500 [ 644.174012] xfs_attr_set+0x214/0x420 [ 644.174012] xfs_xattr_set+0x59/0xb0 [ 644.174012] __vfs_setxattr+0x76/0xa0 [ 644.174012] __vfs_setxattr_noperm+0x5e/0xf0 [ 644.174012] vfs_setxattr+0xae/0xb0 [ 644.174012] ? __might_fault+0x43/0xa0 [ 644.174012] setxattr+0x15e/0x1a0 [ 644.174012] ? __lock_is_held+0x53/0x90 [ 644.174012] ? rcu_read_lock_sched_held+0x93/0xa0 [ 644.174012] ? rcu_sync_lockdep_assert+0x2f/0x60 [ 644.174012] ? __sb_start_write+0x130/0x1d0 [ 644.174012] ? mnt_want_write+0x24/0x50 [ 644.174012] path_setxattr+0x8f/0xc0 [ 644.174012] SyS_lsetxattr+0x11/0x20 [ 644.174012] entry_SYSCALL_64_fastpath+0x23/0xc6 Let's fix this by making lockdep explicitly do the shaving of respective GFP flags. Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set") Acked-by: Michal Hocko Acked-by: Peter Zijlstra (Intel) Signed-off-by: Nikolay Borisov Signed-off-by: Michal Hocko --- kernel/locking/lockdep.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 12e38
[PATCH 2/7] lockdep: allow to disable reclaim lockup detection
From: Michal Hocko The current implementation of the reclaim lockup detection can lead to false positives and those even happen and usually lead to tweak the code to silence the lockdep by using GFP_NOFS even though the context can use __GFP_FS just fine. See http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example. = [ INFO: inconsistent lock state ] 4.5.0-rc2+ #4 Tainted: G O - inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage. kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes: (&xfs_nondir_ilock_class){-+}, at: [] xfs_ilock+0x177/0x200 [xfs] {RECLAIM_FS-ON-R} state was registered at: [] mark_held_locks+0x79/0xa0 [] lockdep_trace_alloc+0xb3/0x100 [] kmem_cache_alloc+0x33/0x230 [] kmem_zone_alloc+0x81/0x120 [xfs] [] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs] [] __xfs_refcount_find_shared+0x75/0x580 [xfs] [] xfs_refcount_find_shared+0x84/0xb0 [xfs] [] xfs_getbmap+0x608/0x8c0 [xfs] [] xfs_vn_fiemap+0xab/0xc0 [xfs] [] do_vfs_ioctl+0x498/0x670 [] SyS_ioctl+0x79/0x90 [] entry_SYSCALL_64_fastpath+0x12/0x6f CPU0 lock(&xfs_nondir_ilock_class); lock(&xfs_nondir_ilock_class); *** DEADLOCK *** 3 locks held by kswapd0/543: stack backtrace: CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O4.5.0-rc2+ #4 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 82a34f10 88003aa078d0 813a14f9 88003d8551c0 88003aa07920 8110ec65 0001 8801 000b 0008 88003d855aa0 Call Trace: [] dump_stack+0x4b/0x72 [] print_usage_bug+0x215/0x240 [] mark_lock+0x1f5/0x660 [] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [] __lock_acquire+0xa80/0x1e50 [] ? kmem_cache_alloc+0x15e/0x230 [] ? kmem_zone_alloc+0x81/0x120 [xfs] [] lock_acquire+0xd8/0x1e0 [] ? xfs_ilock+0x177/0x200 [xfs] [] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [] down_write_nested+0x5e/0xc0 [] ? xfs_ilock+0x177/0x200 [xfs] [] xfs_ilock+0x177/0x200 [xfs] [] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs] [] xfs_fs_evict_inode+0xdc/0x1e0 [xfs] [] evict+0xc5/0x190 [] dispose_list+0x39/0x60 [] prune_icache_sb+0x4b/0x60 [] super_cache_scan+0x14f/0x1a0 [] shrink_slab.part.63.constprop.79+0x1e9/0x4e0 [] shrink_zone+0x15e/0x170 [] kswapd+0x4f1/0xa80 [] ? zone_reclaim+0x230/0x230 [] kthread+0xf2/0x110 [] ? kthread_create_on_node+0x220/0x220 [] ret_from_fork+0x3f/0x70 [] ? kthread_create_on_node+0x220/0x220 To quote Dave: " Ignoring whether reflink should be doing anything or not, that's a "xfs_refcountbt_init_cursor() gets called both outside and inside transactions" lockdep false positive case. The problem here is lockdep has seen this allocation from within a transaction, hence a GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context. Also note that we have an active reference to this inode. So, because the reclaim annotations overload the interrupt level detections and it's seen the inode ilock been taken in reclaim ("interrupt") context, this triggers a reclaim context warning where it thinks it is unsafe to do this allocation in GFP_KERNEL context holding the inode ilock... " This sounds like a fundamental problem of the reclaim lock detection. It is really impossible to annotate such a special usecase IMHO unless the reclaim lockup detection is reworked completely. Until then it is much better to provide a way to add "I know what I am doing flag" and mark problematic places. This would prevent from abusing GFP_NOFS flag which has a runtime effect even on configurations which have lockdep disabled. Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to skip the current allocation request. While we are at it also make sure that the radix tree doesn't accidentaly override tags stored in the upper part of the gfp_mask. Suggested-by: Peter Zijlstra Acked-by: Peter Zijlstra (Intel) Acked-by: Vlastimil Babka Signed-off-by: Michal Hocko --- include/linux/gfp.h | 10 +- kernel/locking/lockdep.c | 4 lib/radix-tree.c | 2 ++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index db373b9d3223..978232a3b4ae 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -40,6 +40,11 @@ struct vm_area_struct; #define ___GFP_DIRECT_RECLAIM 0x40u #define ___GFP_WRITE 0x80u #define ___GFP_KSWAPD_RECLAIM 0x100u +#ifdef CONFIG_LOCKDEP +#define ___GFP_NOLOCKDEP 0x400u +#else +#define ___GFP_NOLOCKDEP 0 +#endif /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -179,8 +184,11 @@ struct vm_area_struct; #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) +/* Disable lockdep for GFP context tracking */ +#define __GFP_NOLOCKDEP ((__force gf
Re: [PATCH] x86, kasan: add KASAN checks to atomic operations
On Mon, Mar 06, 2017 at 01:58:51PM +0100, Peter Zijlstra wrote: > On Mon, Mar 06, 2017 at 01:50:47PM +0100, Dmitry Vyukov wrote: > > On Mon, Mar 6, 2017 at 1:42 PM, Dmitry Vyukov wrote: > > > KASAN uses compiler instrumentation to intercept all memory accesses. > > > But it does not see memory accesses done in assembly code. > > > One notable user of assembly code is atomic operations. Frequently, > > > for example, an atomic reference decrement is the last access to an > > > object and a good candidate for a racy use-after-free. > > > > > > Add manual KASAN checks to atomic operations. > > > Note: we need checks only before asm blocks and don't need them > > > in atomic functions composed of other atomic functions > > > (e.g. load-cmpxchg loops). > > > > Peter, also pointed me at arch/x86/include/asm/bitops.h. Will add them in > > v2. > > > > > > static __always_inline void atomic_add(int i, atomic_t *v) > > > { > > > + kasan_check_write(v, sizeof(*v)); > > > asm volatile(LOCK_PREFIX "addl %1,%0" > > > : "+m" (v->counter) > > > : "ir" (i)); > > > So the problem is doing load/stores from asm bits, and GCC > (traditionally) doesn't try and interpret APP asm bits. > > However, could we not write a GCC plugin that does exactly that? > Something that interprets the APP asm bits and generates these KASAN > bits that go with it? Another suspect is the per-cpu stuff, that's all asm foo as well.
Re: [PATCH v5 02/11] phy: exynos-ufs: add UFS PHY driver for EXYNOS SoC
Hi, On Monday 06 March 2017 05:12 PM, Alim Akhtar wrote: > Hi Kishon > > On 03/01/2017 10:07 AM, Kishon Vijay Abraham I wrote: >> Hi, >> >> On Tuesday 28 February 2017 01:51 PM, Alim Akhtar wrote: >>> Hi Kishon, >>> >>> On 02/28/2017 09:04 AM, Kishon Vijay Abraham I wrote: Hi, On Monday 27 February 2017 07:40 PM, Alim Akhtar wrote: > Hi Kishon, > > On 02/27/2017 10:56 AM, Kishon Vijay Abraham I wrote: >> Hi, >> >> On Thursday 23 February 2017 12:20 AM, Alim Akhtar wrote: >>> On Fri, Feb 3, 2017 at 2:49 PM, Alim Akhtar >>> wrote: Hi Kishon, On 11/19/2015 07:09 PM, Kishon Vijay Abraham I wrote: > > Hi, > > On Tuesday 17 November 2015 01:41 PM, Alim Akhtar wrote: >> >> Hi >> Thanks again for looking into this. >> >> On 11/17/2015 11:46 AM, Kishon Vijay Abraham I wrote: >>> >>> Hi, >>> >>> On Monday 09 November 2015 10:56 AM, Alim Akhtar wrote: From: Seungwon Jeon This patch introduces Exynos UFS PHY driver. This driver supports to deal with phy calibration and power control according to UFS host driver's behavior. Signed-off-by: Seungwon Jeon Signed-off-by: Alim Akhtar Cc: Kishon Vijay Abraham I --- drivers/phy/Kconfig|7 ++ drivers/phy/Makefile |1 + drivers/phy/phy-exynos-ufs.c | 241 drivers/phy/phy-exynos-ufs.h | 85 + drivers/phy/phy-exynos7-ufs.h | 89 + include/linux/phy/phy-exynos-ufs.h | 85 + 6 files changed, 508 insertions(+) create mode 100644 drivers/phy/phy-exynos-ufs.c create mode 100644 drivers/phy/phy-exynos-ufs.h create mode 100644 drivers/phy/phy-exynos7-ufs.h create mode 100644 include/linux/phy/phy-exynos-ufs.h diff --git a/drivers/phy/Kconfig b/drivers/phy/Kconfig index 7eb5859dd035..7d38a92e0297 100644 --- a/drivers/phy/Kconfig +++ b/drivers/phy/Kconfig @@ -389,4 +389,11 @@ config PHY_CYGNUS_PCIE Enable this to support the Broadcom Cygnus PCIe PHY. If unsure, say N. +config PHY_EXYNOS_UFS +tristate "EXYNOS SoC series UFS PHY driver" +depends on OF && ARCH_EXYNOS || COMPILE_TEST +select GENERIC_PHY +help + Support for UFS PHY on Samsung EXYNOS chipsets. + endmenu diff --git a/drivers/phy/Makefile b/drivers/phy/Makefile index 075db1a81aa5..9bec4d1a89e1 100644 --- a/drivers/phy/Makefile +++ b/drivers/phy/Makefile @@ -10,6 +10,7 @@ obj-$(CONFIG_ARMADA375_USBCLUSTER_PHY)+= phy-armada375-usb2.o obj-$(CONFIG_BCM_KONA_USB2_PHY)+= phy-bcm-kona-usb2.o obj-$(CONFIG_PHY_EXYNOS_DP_VIDEO)+= phy-exynos-dp-video.o obj-$(CONFIG_PHY_EXYNOS_MIPI_VIDEO)+= phy-exynos-mipi-video.o +obj-$(CONFIG_PHY_EXYNOS_UFS)+= phy-exynos-ufs.o obj-$(CONFIG_PHY_LPC18XX_USB_OTG)+= phy-lpc18xx-usb-otg.o obj-$(CONFIG_PHY_PXA_28NM_USB2)+= phy-pxa-28nm-usb2.o obj-$(CONFIG_PHY_PXA_28NM_HSIC)+= phy-pxa-28nm-hsic.o diff --git a/drivers/phy/phy-exynos-ufs.c b/drivers/phy/phy-exynos-ufs.c new file mode 100644 index ..cb1aeaa3d4eb --- /dev/null +++ b/drivers/phy/phy-exynos-ufs.c @@ -0,0 +1,241 @@ +/* + * UFS PHY driver for Samsung EXYNOS SoC + * + * Copyright (C) 2015 Samsung Electronics Co., Ltd. + * Author: Seungwon Jeon + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "phy-exynos-ufs.h" >>
Re: [RFC PATCH 10/12] staging: android: ion: Use CMA APIs directly
Hi Daniel, On Monday 06 Mar 2017 11:32:04 Daniel Vetter wrote: > On Fri, Mar 03, 2017 at 10:50:20AM -0800, Laura Abbott wrote: > > On 03/03/2017 08:41 AM, Laurent Pinchart wrote: > >> On Thursday 02 Mar 2017 13:44:42 Laura Abbott wrote: > >>> When CMA was first introduced, its primary use was for DMA allocation > >>> and the only way to get CMA memory was to call dma_alloc_coherent. This > >>> put Ion in an awkward position since there was no device structure > >>> readily available and setting one up messed up the coherency model. > >>> These days, CMA can be allocated directly from the APIs. Switch to > >>> using this model to avoid needing a dummy device. This also avoids > >>> awkward caching questions. > >> > >> If the DMA mapping API isn't suitable for today's requirements anymore, > >> I believe that's what needs to be fixed, instead of working around the > >> problem by introducing another use-case-specific API. > > > > I don't think this is a usecase specific API. CMA has been decoupled from > > DMA already because it's used in other places. The trying to go through > > DMA was just another layer of abstraction, especially since there isn't > > a device available for allocation. > > Also, we've had separation of allocation and dma-mapping since forever, > that's how it works almost everywhere. Not exactly sure why/how arm-soc > ecosystem ended up focused so much on dma_alloc_coherent. I believe because that was the easy way to specify memory constraints. The API receives a device pointer and will allocate memory suitable for DMA for that device. The fact that it maps it to the device is a side-effect in my opinion. > I think separating allocation from dma mapping/coherency is perfectly > fine, and the way to go. Especially given that in many cases we'll want to share buffers between multiple devices, so we'll need to map them multiple times. My point still stands though, if we want to move towards a model where allocation and mapping are decoupled, we need an allocation function that takes constraints (possibly implemented with two layers, a constraint resolution layer on top of a pool/heap/type/foo-based allocator), and a mapping API. IOMMU handling being integrated in the DMA mapping API we're currently stuck with it, which might call for brushing up that API. -- Regards, Laurent Pinchart
[PATCH] LOCAL / input: touchscreen: fix semicolon.cocci warnings
Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci CC: Beomho Seo Signed-off-by: Julia Lawall Signed-off-by: Fengguang Wu --- I also received the following warning from kbuild, without any other information: drivers/input/touchscreen/fts_ts.c:750:1-6: WARNING: invalid free of devm_ allocated data tree: https://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos.git exynos-drm-next-tm2 head: 41f00580dc0f947b7788a1b5f57f793dea49ee9a commit: 15a1244b5349543dfc629b1eda799f0008dbd8bd [7/38] LOCAL / input: touchscreen: Add FTS_TS touchsreen driver fts_ts.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/drivers/input/touchscreen/fts_ts.c +++ b/drivers/input/touchscreen/fts_ts.c @@ -558,12 +558,12 @@ static struct fts_i2c_platform_data *fts if (of_property_read_u32(np, "x-size", &pdata->max_x)) { dev_err(dev, "failed to get x-size property\n"); return NULL; - }; + } if (of_property_read_u32(np, "y-size", &pdata->max_y)) { dev_err(dev, "failed to get y-size property\n"); return NULL; - }; + } pdata->keys_en = of_property_read_bool(np, "touch-key-connected");
Re: [PATCH] HID: usbhid: Use pr_ and remove unnecessary OOM messages
On Wed, 1 Mar 2017, Joe Perches wrote: > Use a more common logging style and remove the unnecessary > OOM messages as there is default dump_stack when OOM. > > Miscellanea: > > o Hoist an assignment in an if > o Realign arguments > o Realign a deeply indented if descendent above a printk > > Signed-off-by: Joe Perches Applied to for-4.12/upstream. Thanks, -- Jiri Kosina SUSE Labs
Re: [PATCH 0/5] perf/sdt: Argument support for x86 and powepc
On Mon, 6 Mar 2017 13:23:30 +0530 Ravi Bangoria wrote: > > > On Tuesday 07 February 2017 08:25 AM, Masami Hiramatsu wrote: > > On Thu, 2 Feb 2017 16:41:38 +0530 > > Ravi Bangoria wrote: > > > >> The v5 patchset for sdt marker argument support for x86 [1] has > >> couple of issues. For example, it still has x86 specific code > >> in general code. It lacks support for rNN (with size postfix > >> b/w/d), %rsp, %esp, %sil etc. registers and such sdt markers > >> are failing at 'perf probe'. It also fails to convert arguments > >> having no offset but still surrounds register with parenthesis > >> for ex. 8@(%rdi) is converted to +(%di):u64 which is rejected > >> by uprobe_events. It's causing failure at 'perf probe' for all > >> SDT events on all archs except x86. With this patchset, I've > >> solved these issues. (patch 2,3) > >> > >> Also, existing perf shows misleading message when user tries to > >> record sdt event without probing it. I've prepared patch for > >> the same. (patch 1) > >> > >> Apart from that, I've also added logic to support arguments with > >> sdt marker on powerpc. (patch 4) > >> > >> There are cases where uprobe definition of sdt event goes beyond > >> current limit MAX_CMDLEN (256) and in such case perf fails with > >> seg fault. I've solve this issue. (patch 5) > >> > >> Note: This patchset is prepared on top of Alexis' v5 series.[1] > >> > >> [1] > >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1292251.html > > Hmm, I must missed it. I'll check it... > > > > Hi Masami, > > Can you please review this. Thanks for kicking me :) -- Masami Hiramatsu
[PATCH 0/7 v5] scope GFP_NOFS api
Hi, I have posted the previous version here [1]. There are no real changes in the implementation since then. I've just added "lockdep: teach lockdep about memalloc_noio_save" from Nikolay which is a lockdep bugfix developed independently but "mm: introduce memalloc_nofs_{save,restore} API" depends on it so I added it here. Then I've rebased the series on top of 4.11-rc1 which contains sched.h split up which required to add sched/mm.h include. There didn't seem to be any real objections and so I think we should go and finally merge this - ideally in this release cycle as it doesn't really introduce any functional changes. Those were separated out and will be posted later. The risk of regressions should really be small because we do not remove any real GFP_NOFS users yet. Diffstat says fs/jbd2/journal.c | 8 fs/jbd2/transaction.c | 12 fs/xfs/kmem.c | 12 ++-- fs/xfs/kmem.h | 2 +- fs/xfs/libxfs/xfs_btree.c | 2 +- fs/xfs/xfs_aops.c | 6 +++--- fs/xfs/xfs_buf.c | 8 fs/xfs/xfs_trans.c| 12 ++-- include/linux/gfp.h | 18 +- include/linux/jbd2.h | 2 ++ include/linux/sched.h | 6 +++--- include/linux/sched/mm.h | 26 +++--- kernel/locking/lockdep.c | 11 +-- lib/radix-tree.c | 2 ++ mm/page_alloc.c | 10 ++ mm/vmscan.c | 6 +++--- 16 files changed, 106 insertions(+), 37 deletions(-) Shortlog: Michal Hocko (6): lockdep: allow to disable reclaim lockup detection xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS mm: introduce memalloc_nofs_{save,restore} API xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio* jbd2: mark the transaction context with the scope GFP_NOFS context jbd2: make the whole kjournald2 kthread NOFS safe Nikolay Borisov (1): lockdep: teach lockdep about memalloc_noio_save [1] http://lkml.kernel.org/r/20170206140718.16222-1-mho...@kernel.org [2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojz...@thunk.org
[PATCH] irqchip: crossbar: Fix incorrect type of register size
The 'size' variable is unsigned according to the dt-bindings. As this variable is used as integer in other places, create a new variable that allows to fix the following sparse issue (-Wtypesign): drivers/irqchip/irq-crossbar.c:279:52: warning: incorrect type in argument 3 (different signedness) drivers/irqchip/irq-crossbar.c:279:52:expected unsigned int [usertype] *out_value drivers/irqchip/irq-crossbar.c:279:52:got int * Signed-off-by: Franck Demathieu --- drivers/irqchip/irq-crossbar.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/irqchip/irq-crossbar.c b/drivers/irqchip/irq-crossbar.c index 05bbf17..1070b7b 100644 --- a/drivers/irqchip/irq-crossbar.c +++ b/drivers/irqchip/irq-crossbar.c @@ -199,7 +199,7 @@ static const struct irq_domain_ops crossbar_domain_ops = { static int __init crossbar_of_init(struct device_node *node) { int i, size, reserved = 0; - u32 max = 0, entry; + u32 max = 0, entry, reg_size; const __be32 *irqsr; int ret = -ENOMEM; @@ -276,9 +276,9 @@ static int __init crossbar_of_init(struct device_node *node) if (!cb->register_offsets) goto err_irq_map; - of_property_read_u32(node, "ti,reg-size", &size); + of_property_read_u32(node, "ti,reg-size", ®_size); - switch (size) { + switch (reg_size) { case 1: cb->write = crossbar_writeb; break; @@ -304,7 +304,7 @@ static int __init crossbar_of_init(struct device_node *node) continue; cb->register_offsets[i] = reserved; - reserved += size; + reserved += reg_size; } of_property_read_u32(node, "ti,irqs-safe-map", &cb->safe_map); -- 2.10.1
Re: [PATCH] HID: i2c-hid: Fix error handling
On Sun, 19 Feb 2017, Christophe JAILLET wrote: > According to error handling in this function, it is likely that some > resources should be freed before returning. > Replace 'return ret', with 'goto err'. > > While at it, remove some spaces at the beginning of the lines to be more > consistent. > > > Fixes: ead0687fe304a ("HID: i2c-hid: support regulator power on/off") > > Signed-off-by: Christophe JAILLET > --- > drivers/hid/i2c-hid/i2c-hid.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/drivers/hid/i2c-hid/i2c-hid.c b/drivers/hid/i2c-hid/i2c-hid.c > index d5288f3fb5ee..1a57ac2d8524 100644 > --- a/drivers/hid/i2c-hid/i2c-hid.c > +++ b/drivers/hid/i2c-hid/i2c-hid.c > @@ -1058,13 +1058,13 @@ static int i2c_hid_probe(struct i2c_client *client, > } > > ihid->pdata.supply = devm_regulator_get(&client->dev, "vdd"); > - if (IS_ERR(ihid->pdata.supply)) { > - ret = PTR_ERR(ihid->pdata.supply); > - if (ret != -EPROBE_DEFER) > - dev_err(&client->dev, "Failed to get regulator: %d\n", > - ret); > - return ret; > - } > + if (IS_ERR(ihid->pdata.supply)) { > + ret = PTR_ERR(ihid->pdata.supply); > + if (ret != -EPROBE_DEFER) > + dev_err(&client->dev, "Failed to get regulator: %d\n", > + ret); > + goto err; > + } I don't see any spaces at the beginning of lines in the version that's in my tree ... o_O? Therefore I've converted this patch into simple 'return ret -> goto err' transformation and applied on top for-4.12/i2c-hid. Thanks, -- Jiri Kosina SUSE Labs
Re: [PATCH v17 2/3] usb: USB Type-C connector class
Hi Mats, On Fri, Mar 03, 2017 at 08:27:08PM +0100, Mats Karrman wrote: > On 2017-03-03 13:59, Heikki Krogerus wrote: > > > On Fri, Mar 03, 2017 at 08:29:18AM +0100, Mats Karrman wrote: > > > > > How would something like that sound to you guys? > > Complicated... Need to marinate on that for a while ;) Sorry about the bad explanation :-). Let me try again.. I'm simply looking for a method that is as scalable as possible to handle the alternate modes, basically how to couple the different components involved. Bus would feel like the best approach at the moment. > > > My system is a bit different. It's an i.MX6 SoC with the typec phy and DP > > > controller connected > > > directly to the SoC and it's using DTB/OF. > > Is this "DP controller" a controller that is capable of taking care of > > the USB Power Delivery communication with the partner regarding > > DisplayPort alternate mode? > > No, the "DP controller" just talks DP and knows nothing about Type-C or USB > PD. > It takes a video stream from the SoC and turns it into a DP link, set up and > orchestrated > by the corresponding driver. And all the driver needs from Type-C is the > plugged in / interrupt / > plugged out events. Got it. > The analog switching between USB / safe / DP signal levels in the Type-C > connector is, I think, > best handled by the software doing the USB PD negotiation / Altmode handling > (using some GPIOs). > > > > Do we need to further standardize attributes under (each) specific > > > alternate mode to > > > include things such as HPD for the DP mode? > > I'm not completely sure what kind of system you have, but I would > > imagine that if we had the bus, your DP controller driver would be the > > port (and partner) alternate mode driver. The bus would bind you to > > the typec phy. > > So, both the DP controller and the USB PD phy are I2C devices, and now I have > to make them both > attach to the AM bus as well? The DP controller would provide the driver and the USB PD phy (actually, the typec class) the device. Would it be a problem to register these I2C devices with some other subsystem, was it extcon or something like AM bus? It really would not be that uncommon. Or have I misunderstood your question? Thanks, -- heikki
[PATCH v2 2/8] irqchip/gic-v3-its: Initialize MSIs with subsys_initcalls
This allows us to use kernel core functionality (e.g. cma) for ITS initialization. MSIs must be up before the device_initcalls (pci and platform device probe) and after arch_initcalls (dma init), so subsys_initcall is fine. Signed-off-by: Robert Richter --- drivers/irqchip/irq-gic-v3-its-pci-msi.c | 2 +- drivers/irqchip/irq-gic-v3-its-platform-msi.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c index aee1c60d7ab5..dace9bc4ef8d 100644 --- a/drivers/irqchip/irq-gic-v3-its-pci-msi.c +++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c @@ -194,4 +194,4 @@ static int __init its_pci_msi_init(void) return 0; } -early_initcall(its_pci_msi_init); +subsys_initcall(its_pci_msi_init); diff --git a/drivers/irqchip/irq-gic-v3-its-platform-msi.c b/drivers/irqchip/irq-gic-v3-its-platform-msi.c index 470b4aa7d62c..7d8c19973766 100644 --- a/drivers/irqchip/irq-gic-v3-its-platform-msi.c +++ b/drivers/irqchip/irq-gic-v3-its-platform-msi.c @@ -103,4 +103,4 @@ static int __init its_pmsi_init(void) return 0; } -early_initcall(its_pmsi_init); +subsys_initcall(its_pmsi_init); -- 2.11.0
Re: perf: use-after-free in perf_release
On Mon, Mar 6, 2017 at 2:14 PM, Peter Zijlstra wrote: > On Mon, Mar 06, 2017 at 10:57:07AM +0100, Dmitry Vyukov wrote: > >> == >> BUG: KASAN: use-after-free in atomic_dec_and_test >> arch/x86/include/asm/atomic.h:123 [inline] at addr 880079c30158 >> BUG: KASAN: use-after-free in put_task_struct >> include/linux/sched/task.h:93 [inline] at addr 880079c30158 >> BUG: KASAN: use-after-free in put_ctx+0xcf/0x110 > > FWIW, this output is very confusing, is this a result of your > post-processing replicating the line for every 'inlined' part? Yes. We probably should not do this inlining in the header line. But the problem is that it is very difficult to understand that it is a header line in general. >> kernel/events/core.c:1131 at addr 880079c30158 >> Write of size 4 by task syz-executor6/25698 > >> atomic_dec_and_test arch/x86/include/asm/atomic.h:123 [inline] >> put_task_struct include/linux/sched/task.h:93 [inline] >> put_ctx+0xcf/0x110 kernel/events/core.c:1131 >> perf_event_release_kernel+0x3ad/0xc90 kernel/events/core.c:4322 >> perf_release+0x37/0x50 kernel/events/core.c:4338 >> __fput+0x332/0x800 fs/file_table.c:209 >> fput+0x15/0x20 fs/file_table.c:245 >> task_work_run+0x197/0x260 kernel/task_work.c:116 >> exit_task_work include/linux/task_work.h:21 [inline] >> do_exit+0xb38/0x29c0 kernel/exit.c:880 >> do_group_exit+0x149/0x420 kernel/exit.c:984 >> get_signal+0x7e0/0x1820 kernel/signal.c:2318 >> do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808 >> exit_to_usermode_loop+0x200/0x2a0 arch/x86/entry/common.c:157 >> syscall_return_slowpath arch/x86/entry/common.c:191 [inline] >> do_syscall_64+0x6fc/0x930 arch/x86/entry/common.c:286 >> entry_SYSCALL64_slow_path+0x25/0x25 > > So this is fput().. > > >> Freed: >> PID = 25681 >> save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 >> save_stack+0x43/0xd0 mm/kasan/kasan.c:513 >> set_track mm/kasan/kasan.c:525 [inline] >> kasan_slab_free+0x6f/0xb0 mm/kasan/kasan.c:589 >> __cache_free mm/slab.c:3514 [inline] >> kmem_cache_free+0x71/0x240 mm/slab.c:3774 >> free_task_struct kernel/fork.c:158 [inline] >> free_task+0x151/0x1d0 kernel/fork.c:370 >> copy_process.part.38+0x18e5/0x4aa0 kernel/fork.c:1931 >> copy_process kernel/fork.c:1531 [inline] >> _do_fork+0x200/0x1010 kernel/fork.c:1994 >> SYSC_clone kernel/fork.c:2104 [inline] >> SyS_clone+0x37/0x50 kernel/fork.c:2098 >> do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281 >> return_from_SYSCALL_64+0x0/0x7a > > and this is a failed fork(). > > > However, inherited events don't have a filedesc to fput(), and > similarly, a task that fails for has never been visible to attach a perf > event to because it never hits the pid-hash. > > Or so it is assumed. > > I'm forever getting lost in the PID code. Oleg, is there any way > find_task_by_vpid() can return a task that can still fail fork() ? FWIW here are 2 syzkaller programs that triggered the bug: https://gist.githubusercontent.com/dvyukov/d67f980050589775237a7fbdff226bec/raw/4bca72861cb2ede64059b6dad403e19f425a361f/gistfile1.txt They look very similar, so most likely they are a mutation of the same program. Which may suggest that there is something in that program that provokes the bug. Note that the calls in these programs are executed potentially in multiple threads. But at least it can give some idea wrt e.g. flags passed to perf_event_open.
Re: [RFC PATCH 00/12] Ion cleanup in preparation for moving out of staging
On Mon 06-03-17 11:40:41, Daniel Vetter wrote: > On Mon, Mar 06, 2017 at 08:42:59AM +0100, Michal Hocko wrote: > > On Fri 03-03-17 09:37:55, Laura Abbott wrote: > > > On 03/03/2017 05:29 AM, Michal Hocko wrote: > > > > On Thu 02-03-17 13:44:32, Laura Abbott wrote: > > > >> Hi, > > > >> > > > >> There's been some recent discussions[1] about Ion-like frameworks. > > > >> There's > > > >> apparently interest in just keeping Ion since it works reasonablly > > > >> well. > > > >> This series does what should be the final clean ups for it to possibly > > > >> be > > > >> moved out of staging. > > > >> > > > >> This includes the following: > > > >> - Some general clean up and removal of features that never got a lot > > > >> of use > > > >> as far as I can tell. > > > >> - Fixing up the caching. This is the series I proposed back in > > > >> December[2] > > > >> but never heard any feedback on. It will certainly break existing > > > >> applications that rely on the implicit caching. I'd rather make an > > > >> effort > > > >> to move to a model that isn't going directly against the > > > >> establishement > > > >> though. > > > >> - Fixing up the platform support. The devicetree approach was never > > > >> well > > > >> recieved by DT maintainers. The proposal here is to think of Ion > > > >> less as > > > >> specifying requirements and more of a framework for exposing memory > > > >> to > > > >> userspace. > > > >> - CMA allocations now happen without the need of a dummy device > > > >> structure. > > > >> This fixes a bunch of the reasons why I attempted to add devicetree > > > >> support before. > > > >> > > > >> I've had problems getting feedback in the past so if I don't hear any > > > >> major > > > >> objections I'm going to send out with the RFC dropped to be picked up. > > > >> The only reason there isn't a patch to come out of staging is to > > > >> discuss any > > > >> other changes to the ABI people might want. Once this comes out of > > > >> staging, > > > >> I really don't want to mess with the ABI. > > > > > > > > Could you recapitulate concerns preventing the code being merged > > > > normally rather than through the staging tree and how they were > > > > addressed? > > > > > > > > > > Sorry, I'm really not understanding your question here, can you > > > clarify? > > > > There must have been a reason why this code ended up in the staging > > tree, right? So my question is what those reasons were and how they were > > handled in order to move the code from the staging subtree. > > No one gave a thing about android in upstream, so Greg KH just dumped it > all into staging/android/. We've discussed ION a bunch of times, recorded > anything we'd like to fix in staging/android/TODO, and Laura's patch > series here addresses a big chunk of that. Thanks for the TODO reference. I was looking exactly at something like that in drivers/staging/android/ion/. To bad I didn't look one directory up. Thanks for the clarification! -- Michal Hocko SUSE Labs
Re: Question Regarding ERMS memcpy
On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote: > Well honestly my issue was solved by fixing my kernel config. I have no > idea why I had optimize for size in there in the first place. I still think that we should address the iomem memcpy Linus mentioned. So how about this partial revert. I've made 32-bit use the same special __memcpy() version. Hmmm? --- diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h index 7afb0e2f07f4..9e378a10796d 100644 --- a/arch/x86/include/asm/io.h +++ b/arch/x86/include/asm/io.h @@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void); #ifdef __KERNEL__ #include +#include /* * Convert a virtual cached pointer to an uncached pointer @@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned char val, size_t count) * @src: The (I/O memory) source for the data * @count: The number of bytes to copy * - * Copy a block of data from I/O memory. + * Copy a block of data from I/O memory. IO memory is different from + * cached memory so we use special memcpy version. */ static inline void memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count) { - memcpy(dst, (const void __force *)src, count); + __inline_memcpy(dst, (const void __force *)src, count); } /** @@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count) * @src: The (RAM) source for the data * @count: The number of bytes to copy * - * Copy a block of data to I/O memory. + * Copy a block of data to I/O memory. IO memory is different from + * cached memory so we use special memcpy version. */ static inline void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count) { - memcpy((void __force *)dst, src, count); + __inline_memcpy((void __force *)dst, src, count); } /* diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h index 3d3e8353ee5c..556fa4a975ff 100644 --- a/arch/x86/include/asm/string_32.h +++ b/arch/x86/include/asm/string_32.h @@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c); #define __HAVE_ARCH_STRLEN extern size_t strlen(const char *s); +#define __inline_memcpy __memcpy static __always_inline void *__memcpy(void *to, const void *from, size_t n) { int d0, d1, d2; -- Regards/Gruss, Boris. SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) --
Re: perf: use-after-free in perf_release
On Mon, Mar 06, 2017 at 10:57:07AM +0100, Dmitry Vyukov wrote: > == > BUG: KASAN: use-after-free in atomic_dec_and_test > arch/x86/include/asm/atomic.h:123 [inline] at addr 880079c30158 > BUG: KASAN: use-after-free in put_task_struct > include/linux/sched/task.h:93 [inline] at addr 880079c30158 > BUG: KASAN: use-after-free in put_ctx+0xcf/0x110 FWIW, this output is very confusing, is this a result of your post-processing replicating the line for every 'inlined' part? > kernel/events/core.c:1131 at addr 880079c30158 > Write of size 4 by task syz-executor6/25698 > atomic_dec_and_test arch/x86/include/asm/atomic.h:123 [inline] > put_task_struct include/linux/sched/task.h:93 [inline] > put_ctx+0xcf/0x110 kernel/events/core.c:1131 > perf_event_release_kernel+0x3ad/0xc90 kernel/events/core.c:4322 > perf_release+0x37/0x50 kernel/events/core.c:4338 > __fput+0x332/0x800 fs/file_table.c:209 > fput+0x15/0x20 fs/file_table.c:245 > task_work_run+0x197/0x260 kernel/task_work.c:116 > exit_task_work include/linux/task_work.h:21 [inline] > do_exit+0xb38/0x29c0 kernel/exit.c:880 > do_group_exit+0x149/0x420 kernel/exit.c:984 > get_signal+0x7e0/0x1820 kernel/signal.c:2318 > do_signal+0xd2/0x2190 arch/x86/kernel/signal.c:808 > exit_to_usermode_loop+0x200/0x2a0 arch/x86/entry/common.c:157 > syscall_return_slowpath arch/x86/entry/common.c:191 [inline] > do_syscall_64+0x6fc/0x930 arch/x86/entry/common.c:286 > entry_SYSCALL64_slow_path+0x25/0x25 So this is fput().. > Freed: > PID = 25681 > save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 > save_stack+0x43/0xd0 mm/kasan/kasan.c:513 > set_track mm/kasan/kasan.c:525 [inline] > kasan_slab_free+0x6f/0xb0 mm/kasan/kasan.c:589 > __cache_free mm/slab.c:3514 [inline] > kmem_cache_free+0x71/0x240 mm/slab.c:3774 > free_task_struct kernel/fork.c:158 [inline] > free_task+0x151/0x1d0 kernel/fork.c:370 > copy_process.part.38+0x18e5/0x4aa0 kernel/fork.c:1931 > copy_process kernel/fork.c:1531 [inline] > _do_fork+0x200/0x1010 kernel/fork.c:1994 > SYSC_clone kernel/fork.c:2104 [inline] > SyS_clone+0x37/0x50 kernel/fork.c:2098 > do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281 > return_from_SYSCALL_64+0x0/0x7a and this is a failed fork(). However, inherited events don't have a filedesc to fput(), and similarly, a task that fails for has never been visible to attach a perf event to because it never hits the pid-hash. Or so it is assumed. I'm forever getting lost in the PID code. Oleg, is there any way find_task_by_vpid() can return a task that can still fail fork() ?
Re: [PATCH v17 2/3] usb: USB Type-C connector class
Hi Peter, On Mon, Mar 06, 2017 at 09:15:51AM +0800, Peter Chen wrote: > > > What interface you use when you receive this event to handle > > > dual-role switch? I am wonder if a common dual-role class is > > > needed, then we can have a common user utility. > > > > > > Eg, if "data_role" has changed, the udev can echo "data_role" to > > > /sys/class/usb-dual-role/role > > > > No. If the partner executes successfully for example DR_Swap message, > > the kernel has to take care everything that is needed for the role to > > be what ever was negotiated on its own. User space can't be involved > > with that. > > > > Would you give me an example how kernel handle this? How type-C event > triggers role switch? On our boards, the firmware or EC (or ACPI) configures the hardware as needed and also notifies the components using ACPI if needed. It's often not even possible to directly configure the components/hardware for a particular role. I'm not commenting on Roger's dual role patch series, but I don't really think it should be mixed with Type-C. USB Type-C and USB Power Delivery define their own ways of handling the roles, and they are not limited to the data role only. Things like OTG for example will, and actually can not be supported. With Type-C we will have competing state machines compared to OTG. The dual-role framework may be useful on systems that provide more traditional connectors, which possibly have the ID-pin like micro-AB, and possibly also support OTG. It can also be something that exist in parallel with the Type-C class, but there just can not be any dependencies between the two. Thanks, -- heikki
[PATCH v2] f2fs: combine nat_bits and free_nid_bitmap cache
Both nat_bits cache and free_nid_bitmap cache provide same functionality as a intermediate cache between free nid cache and disk, but with different granularity of indicating free nid range, and different persistence policy. nat_bits cache provides better persistence ability, and free_nid_bitmap provides better granularity. In this patch we combine advantage of both caches, so finally policy of the intermediate cache would be: - init: load free nid status from nat_bits into free_nid_bitmap - lookup: scan free_nid_bitmap before load NAT blocks - update: update free_nid_bitmap in real-time - persistence: udpate and persist nat_bits in checkpoint Signed-off-by: Chao Yu --- fs/f2fs/node.c | 105 +++-- 1 file changed, 35 insertions(+), 70 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 1a759d45b7e4..625b46bc55ad 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -338,9 +338,6 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, set_nat_flag(e, IS_CHECKPOINTED, false); __set_nat_cache_dirty(nm_i, e); - if (enabled_nat_bits(sbi, NULL) && new_blkaddr == NEW_ADDR) - clear_bit_le(NAT_BLOCK_OFFSET(ni->nid), nm_i->empty_nat_bits); - /* update fsync_mark if its inode nat entry is still alive */ if (ni->nid != ni->ino) e = __lookup_nat_cache(nm_i, ni->ino); @@ -1920,58 +1917,6 @@ static void scan_free_nid_bits(struct f2fs_sb_info *sbi) up_read(&nm_i->nat_tree_lock); } -static int scan_nat_bits(struct f2fs_sb_info *sbi) -{ - struct f2fs_nm_info *nm_i = NM_I(sbi); - struct page *page; - unsigned int i = 0; - nid_t nid; - - if (!enabled_nat_bits(sbi, NULL)) - return -EAGAIN; - - down_read(&nm_i->nat_tree_lock); -check_empty: - i = find_next_bit_le(nm_i->empty_nat_bits, nm_i->nat_blocks, i); - if (i >= nm_i->nat_blocks) { - i = 0; - goto check_partial; - } - - for (nid = i * NAT_ENTRY_PER_BLOCK; nid < (i + 1) * NAT_ENTRY_PER_BLOCK; - nid++) { - if (unlikely(nid >= nm_i->max_nid)) - break; - add_free_nid(sbi, nid, true); - } - - if (nm_i->nid_cnt[FREE_NID_LIST] >= MAX_FREE_NIDS) - goto out; - i++; - goto check_empty; - -check_partial: - i = find_next_zero_bit_le(nm_i->full_nat_bits, nm_i->nat_blocks, i); - if (i >= nm_i->nat_blocks) { - disable_nat_bits(sbi, true); - up_read(&nm_i->nat_tree_lock); - return -EINVAL; - } - - nid = i * NAT_ENTRY_PER_BLOCK; - page = get_current_nat_page(sbi, nid); - scan_nat_page(sbi, page, nid); - f2fs_put_page(page, 1); - - if (nm_i->nid_cnt[FREE_NID_LIST] < MAX_FREE_NIDS) { - i++; - goto check_partial; - } -out: - up_read(&nm_i->nat_tree_lock); - return 0; -} - static void __build_free_nids(struct f2fs_sb_info *sbi, bool sync, bool mount) { struct f2fs_nm_info *nm_i = NM_I(sbi); @@ -1993,21 +1938,6 @@ static void __build_free_nids(struct f2fs_sb_info *sbi, bool sync, bool mount) if (nm_i->nid_cnt[FREE_NID_LIST]) return; - - /* try to find free nids with nat_bits */ - if (!scan_nat_bits(sbi) && nm_i->nid_cnt[FREE_NID_LIST]) - return; - } - - /* find next valid candidate */ - if (enabled_nat_bits(sbi, NULL)) { - int idx = find_next_zero_bit_le(nm_i->full_nat_bits, - nm_i->nat_blocks, 0); - - if (idx >= nm_i->nat_blocks) - set_sbi_flag(sbi, SBI_NEED_FSCK); - else - nid = idx * NAT_ENTRY_PER_BLOCK; } /* readahead nat pages to be scanned */ @@ -2590,6 +2520,38 @@ static int __get_nat_bitmaps(struct f2fs_sb_info *sbi) return 0; } +inline void load_free_nid_bitmap(struct f2fs_sb_info *sbi) +{ + struct f2fs_nm_info *nm_i = NM_I(sbi); + unsigned int i = 0; + nid_t nid, last_nid; + + if (!enabled_nat_bits(sbi, NULL)) + return; + + for (i = 0; i < nm_i->nat_blocks; i++) { + i = find_next_bit_le(nm_i->empty_nat_bits, nm_i->nat_blocks, i); + if (i >= nm_i->nat_blocks) + break; + + __set_bit_le(i, nm_i->nat_block_bitmap); + + nid = i * NAT_ENTRY_PER_BLOCK; + last_nid = (i + 1) * NAT_ENTRY_PER_BLOCK; + + for (; nid < last_nid; nid++) + update_free_nid_bitmap(sbi, nid, true, true); + } + + for (i = 0; i < nm_i->nat_blocks; i++) { + i = find_next_bit_le(nm_i->full_nat_bits,
[PATCH 4/7] mm: introduce memalloc_nofs_{save,restore} API
From: Michal Hocko GFP_NOFS context is used for the following 5 reasons currently - to prevent from deadlocks when the lock held by the allocation context would be needed during the memory reclaim - to prevent from stack overflows during the reclaim because the allocation is performed from a deep context already - to prevent lockups when the allocation context depends on other reclaimers to make a forward progress indirectly - just in case because this would be safe from the fs POV - silence lockdep false positives Unfortunately overuse of this allocation context brings some problems to the MM. Memory reclaim is much weaker (especially during heavy FS metadata workloads), OOM killer cannot be invoked because the MM layer doesn't have enough information about how much memory is freeable by the FS layer. In many cases it is far from clear why the weaker context is even used and so it might be used unnecessarily. We would like to get rid of those as much as possible. One way to do that is to use the flag in scopes rather than isolated cases. Such a scope is declared when really necessary, tracked per task and all the allocation requests from within the context will simply inherit the GFP_NOFS semantic. Not only this is easier to understand and maintain because there are much less problematic contexts than specific allocation requests, this also helps code paths where FS layer interacts with other layers (e.g. crypto, security modules, MM etc...) and there is no easy way to convey the allocation context between the layers. Introduce memalloc_nofs_{save,restore} API to control the scope of GFP_NOFS allocation context. This is basically copying memalloc_noio_{save,restore} API we have for other restricted allocation context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is just an alias for PF_FSTRANS which has been xfs specific until recently. There are no more PF_FSTRANS users anymore so let's just drop it. PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags is renamed to current_gfp_context because it now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve their semantic. kmem_flags_convert() doesn't need to evaluate the flag anymore. This patch shouldn't introduce any functional changes. Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS) usage as much as possible and only use a properly documented memalloc_nofs_{save,restore} checkpoints where they are appropriate. Acked-by: Vlastimil Babka Signed-off-by: Michal Hocko --- fs/xfs/kmem.h| 2 +- include/linux/gfp.h | 8 include/linux/sched.h| 8 +++- include/linux/sched/mm.h | 26 +++--- kernel/locking/lockdep.c | 6 +++--- mm/page_alloc.c | 10 ++ mm/vmscan.c | 6 +++--- 7 files changed, 47 insertions(+), 19 deletions(-) diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h index d973dbfc2bfa..ae08cfd9552a 100644 --- a/fs/xfs/kmem.h +++ b/fs/xfs/kmem.h @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags) lflags = GFP_ATOMIC | __GFP_NOWARN; } else { lflags = GFP_KERNEL | __GFP_NOWARN; - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) + if (flags & KM_NOFS) lflags &= ~__GFP_FS; } diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 978232a3b4ae..2bfcfd33e476 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -210,8 +210,16 @@ struct vm_area_struct; * * GFP_NOIO will use direct reclaim to discard clean pages or slab pages * that do not require the starting of any physical IO. + * Please try to avoid using this flag directly and instead use + * memalloc_noio_{save,restore} to mark the whole scope which cannot + * perform any IO with a short explanation why. All allocation requests + * will inherit GFP_NOIO implicitly. * * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces. + * Please try to avoid using this flag directly and instead use + * memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't + * recurse into the FS layer with a short explanation why. All allocation + * requests will inherit GFP_NOFS implicitly. * * GFP_USER is for userspace allocations that also need to be directly * accessibly by the kernel or hardware. It is typically used by hardware diff --git a/include/linux/sched.h b/include/linux/sched.h index 4528f7c9789f..9c3ee2281a56 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1211,9 +1211,9 @@ extern struct pid *cad_pid; #define PF_USED_ASYNC 0x4000 /* Used async_schedule*(), used by module init */ #define PF_NOFREEZE0x8000 /* This thread should not be f
Re: [PATCH] pinctrl: samsung: fix segfault when using external interrupts on s3c24xx
Hi Krzysztof, > > This is a regression from commit 8b1bd11c1f8f529057369c5b3702d13fd24e2765. > > Checkpatch should complain here about commit format. > > > > > Tested on FriendlyARM mini2440. > > > > Please add: > Fixes: 8b1bd11c1f8f ("pinctrl: samsung: Add the support the multiple > IORESOURCE_MEM for one pin-bank") > Cc: > OK. > > Signed-off-by: Sergio Prado > > --- > > drivers/pinctrl/samsung/pinctrl-s3c24xx.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c > > b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c > > index b82a003546ae..1b8d887796e8 100644 > > --- a/drivers/pinctrl/samsung/pinctrl-s3c24xx.c > > +++ b/drivers/pinctrl/samsung/pinctrl-s3c24xx.c > > @@ -356,8 +356,8 @@ static inline void s3c24xx_demux_eint(struct irq_desc > > *desc, > > { > > struct s3c24xx_eint_data *data = irq_desc_get_handler_data(desc); > > struct irq_chip *chip = irq_desc_get_chip(desc); > > - struct irq_data *irqd = irq_desc_get_irq_data(desc); > > - struct samsung_pin_bank *bank = irq_data_get_irq_chip_data(irqd); > > + struct samsung_pinctrl_drv_data *d = data->drvdata; > > + struct samsung_pin_bank *bank = d->pin_banks; > > I think 'pin_banks' point to all banks of given controller not to the > currently accessed one. Understood. I think it worked in my tests because on s3c2440 all banks have the same eint base address. So what do you think is the best approach to solve this problem? > > > Best regards, > Krzysztof > -- Sergio Prado Embedded Labworks Office: +55 11 2628-3461 Mobile: +55 11 97123-3420
[PATCH 5/7] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
From: Michal Hocko kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore} API to prevent from reclaim recursion into the fs because vmalloc can invoke unconditional GFP_KERNEL allocations and these functions might be called from the NOFS contexts. The memalloc_noio_save will enforce GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should provide exactly what we need here - implicit GFP_NOFS context. Changes since v1 - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages as per Brian Foster Acked-by: Vlastimil Babka Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Michal Hocko --- fs/xfs/kmem.c| 12 ++-- fs/xfs/xfs_buf.c | 8 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c index e14da724a0b5..6b7b04468aa8 100644 --- a/fs/xfs/kmem.c +++ b/fs/xfs/kmem.c @@ -66,7 +66,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags) void * kmem_zalloc_large(size_t size, xfs_km_flags_t flags) { - unsigned noio_flag = 0; + unsigned nofs_flag = 0; void*ptr; gfp_t lflags; @@ -78,17 +78,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags) * __vmalloc() will allocate data pages and auxillary structures (e.g. * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context * here. Hence we need to tell memory reclaim that we are in such a -* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering +* context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering * the filesystem here and potentially deadlocking. */ - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) - noio_flag = memalloc_noio_save(); + if (flags & KM_NOFS) + nofs_flag = memalloc_nofs_save(); lflags = kmem_flags_convert(flags); ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS)) - memalloc_noio_restore(noio_flag); + if (flags & KM_NOFS) + memalloc_nofs_restore(nofs_flag); return ptr; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index b6208728ba39..ca09061369cb 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -443,17 +443,17 @@ _xfs_buf_map_pages( bp->b_addr = NULL; } else { int retried = 0; - unsigned noio_flag; + unsigned nofs_flag; /* * vm_map_ram() will allocate auxillary structures (e.g. * pagetables) with GFP_KERNEL, yet we are likely to be under * GFP_NOFS context here. Hence we need to tell memory reclaim -* that we are in such a context via PF_MEMALLOC_NOIO to prevent +* that we are in such a context via PF_MEMALLOC_NOFS to prevent * memory reclaim re-entering the filesystem here and * potentially deadlocking. */ - noio_flag = memalloc_noio_save(); + nofs_flag = memalloc_nofs_save(); do { bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count, -1, PAGE_KERNEL); @@ -461,7 +461,7 @@ _xfs_buf_map_pages( break; vm_unmap_aliases(); } while (retried++ <= 1); - memalloc_noio_restore(noio_flag); + memalloc_nofs_restore(nofs_flag); if (!bp->b_addr) return -ENOMEM; -- 2.11.0
Re: [PATCH v2 1/2] HID: reject input outside logical range only if null state is set
On Tue, 14 Feb 2017, Tomasz Kramkowski wrote: > From: Valtteri Heikkilä > > This patch fixes an issue in drivers/hid/hid-input.c where USB HID > control null state flag is not checked upon rejecting inputs outside > logical minimum-maximum range. The check should be made according to USB > HID specification 1.11, section 6.2.2.5, p.31. The fix will resolve > issues with some game controllers, such as: > https://bugzilla.kernel.org/show_bug.cgi?id=68621 > > [t...@the-tk.com: shortened and fixed spelling in commit message] > Signed-off-by: Valtteri Heikkilä > Signed-off-by: Tomasz Kramkowski Applied to for-4.12/hid-core-null-state-handling. Thanks, -- Jiri Kosina SUSE Labs
Re: [PATCH 1/2] xfs: allow kmem_zalloc_greedy to fail
On Sat 04-03-17 09:54:44, Dave Chinner wrote: > On Thu, Mar 02, 2017 at 04:45:40PM +0100, Michal Hocko wrote: > > From: Michal Hocko > > > > Even though kmem_zalloc_greedy is documented it might fail the current > > code doesn't really implement this properly and loops on the smallest > > allowed size for ever. This is a problem because vzalloc might fail > > permanently - we might run out of vmalloc space or since 5d17a73a2ebe > > ("vmalloc: back off when the current task is killed") when the current > > task is killed. The later one makes the failure scenario much more > > probable than it used to be because it makes vmalloc() failures > > permanent for tasks with fatal signals pending.. Fix this by bailing out > > if the minimum size request failed. > > > > This has been noticed by a hung generic/269 xfstest by Xiong Zhou. > > > > fsstress: vmalloc: allocation failure, allocated 12288 of 20480 bytes, > > mode:0x14080c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_ZERO), nodemask=(null) > > fsstress cpuset=/ mems_allowed=0-1 > > CPU: 1 PID: 23460 Comm: fsstress Not tainted 4.10.0-master-45554b2+ #21 > > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 > > 10/05/2016 > > Call Trace: > > dump_stack+0x63/0x87 > > warn_alloc+0x114/0x1c0 > > ? alloc_pages_current+0x88/0x120 > > __vmalloc_node_range+0x250/0x2a0 > > ? kmem_zalloc_greedy+0x2b/0x40 [xfs] > > ? free_hot_cold_page+0x21f/0x280 > > vzalloc+0x54/0x60 > > ? kmem_zalloc_greedy+0x2b/0x40 [xfs] > > kmem_zalloc_greedy+0x2b/0x40 [xfs] > > xfs_bulkstat+0x11b/0x730 [xfs] > > ? xfs_bulkstat_one_int+0x340/0x340 [xfs] > > ? selinux_capable+0x20/0x30 > > ? security_capable+0x48/0x60 > > xfs_ioc_bulkstat+0xe4/0x190 [xfs] > > xfs_file_ioctl+0x9dd/0xad0 [xfs] > > ? do_filp_open+0xa5/0x100 > > do_vfs_ioctl+0xa7/0x5e0 > > SyS_ioctl+0x79/0x90 > > do_syscall_64+0x67/0x180 > > entry_SYSCALL64_slow_path+0x25/0x25 > > > > fsstress keeps looping inside kmem_zalloc_greedy without any way out > > because vmalloc keeps failing due to fatal_signal_pending. > > > > Reported-by: Xiong Zhou > > Analyzed-by: Tetsuo Handa > > Signed-off-by: Michal Hocko > > --- > > fs/xfs/kmem.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c > > index 339c696bbc01..ee95f5c6db45 100644 > > --- a/fs/xfs/kmem.c > > +++ b/fs/xfs/kmem.c > > @@ -34,6 +34,8 @@ kmem_zalloc_greedy(size_t *size, size_t minsize, size_t > > maxsize) > > size_t kmsize = maxsize; > > > > while (!(ptr = vzalloc(kmsize))) { > > + if (kmsize == minsize) > > + break; > > if ((kmsize >>= 1) <= minsize) > > kmsize = minsize; > > } > > Seems wrong to me - this function used to have lots of callers and > over time we've slowly removed them or replaced them with something > else. I'd suggest removing it completely, replacing the call sites > with kmem_zalloc_large(). I do not really care how this gets fixed. Dropping kmem_zalloc_greedy sounds like a way to go. I am not familiar with xfs_bulkstat to do an edicated guess which allocation size to use. So I guess I have to postpone this to you guys if you prefer that route though. Thanks! -- Michal Hocko SUSE Labs