Re: [PATCH -next] crypto: marvell - Remove set but not used variable 'ivsize'
On Mon, Feb 18, 2019 at 08:59:47AM +, YueHaibing wrote: > Fixes gcc '-Wunused-but-set-variable' warning: > > drivers/crypto/marvell/cipher.c: In function 'mv_cesa_skcipher_dma_req_init': > drivers/crypto/marvell/cipher.c:325:15: warning: > variable 'ivsize' set but not used [-Wunused-but-set-variable] > > It's not used any more after 0c99620f0ac1 ("crypto: marvell - Use an unique > pool to copy results of requests") > > Signed-off-by: YueHaibing > --- > drivers/crypto/marvell/cipher.c | 2 -- > 1 file changed, 2 deletions(-) Patch applied. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH V7 1/2] arm64: dts: freescale: imx8qxp: add cpu opp table
Hi, Shawn Best Regards! Anson Huang > -Original Message- > From: Shawn Guo [mailto:shawn...@kernel.org] > Sent: 2019年2月28日 11:19 > To: Anson Huang > Cc: robh...@kernel.org; mark.rutl...@arm.com; s.ha...@pengutronix.de; > ker...@pengutronix.de; feste...@gmail.com; mturque...@baylibre.com; > sb...@kernel.org; Aisheng Dong ; Daniel Baluta > ; devicet...@vger.kernel.org; linux-arm- > ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux- > c...@vger.kernel.org; dl-linux-imx > Subject: Re: [PATCH V7 1/2] arm64: dts: freescale: imx8qxp: add cpu opp > table > > On Tue, Feb 26, 2019 at 05:17:31AM +, Anson Huang wrote: > > Add i.MX8QXP CPU opp table to support cpufreq. > > > > Signed-off-by: Anson Huang > > Acked-by: Viresh Kumar > > Prefix 'arm64: dts: imx8qxp: ' would already be clear enough. I dropped > 'freescale' from there and applied patch. > > > --- > > No changes since V6. > > --- > > arch/arm64/boot/dts/freescale/imx8qxp.dtsi | 30 > > ++ > > 1 file changed, 30 insertions(+) > > > > diff --git a/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > > b/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > > index 4c3dd95..41bf0ce 100644 > > --- a/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > > +++ b/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > > @@ -34,6 +34,9 @@ > > reg = <0x0 0x0>; > > enable-method = "psci"; > > next-level-cache = <&A35_L2>; > > + clocks = <&clk IMX_A35_CLK>; > > + operating-points-v2 = <&a35_0_opp_table>; > > + #cooling-cells = <2>; > > }; > > > > A35_1: cpu@1 { > > @@ -42,6 +45,9 @@ > > reg = <0x0 0x1>; > > enable-method = "psci"; > > next-level-cache = <&A35_L2>; > > + clocks = <&clk IMX_A35_CLK>; > > + operating-points-v2 = <&a35_0_opp_table>; > > + #cooling-cells = <2>; > > }; > > > > A35_2: cpu@2 { > > @@ -50,6 +56,9 @@ > > reg = <0x0 0x2>; > > enable-method = "psci"; > > next-level-cache = <&A35_L2>; > > + clocks = <&clk IMX_A35_CLK>; > > + operating-points-v2 = <&a35_0_opp_table>; > > + #cooling-cells = <2>; > > }; > > > > A35_3: cpu@3 { > > @@ -58,6 +67,9 @@ > > reg = <0x0 0x3>; > > enable-method = "psci"; > > next-level-cache = <&A35_L2>; > > + clocks = <&clk IMX_A35_CLK>; > > + operating-points-v2 = <&a35_0_opp_table>; > > + #cooling-cells = <2>; > > }; > > > > A35_L2: l2-cache0 { > > @@ -65,6 +77,24 @@ > > }; > > }; > > > > + a35_0_opp_table: opp-table { > > What does the '0' in the label mean? Looks like the '0' in the label is NOT necessary, we can just use 'a35_opp_table', do you want me resend the patch to remove '0'? Anson. > > Shawn > > > + compatible = "operating-points-v2"; > > + opp-shared; > > + > > + opp-9 { > > + opp-hz = /bits/ 64 <9>; > > + opp-microvolt = <100>; > > + clock-latency-ns = <15>; > > + }; > > + > > + opp-12 { > > + opp-hz = /bits/ 64 <12>; > > + opp-microvolt = <110>; > > + clock-latency-ns = <15>; > > + opp-suspend; > > + }; > > + }; > > + > > gic: interrupt-controller@51a0 { > > compatible = "arm,gic-v3"; > > reg = <0x0 0x51a0 0 0x1>, /* GIC Dist */ > > -- > > 2.7.4 > > > > > > ___ > > linux-arm-kernel mailing list > > linux-arm-ker...@lists.infradead.org > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flist > > s.infradead.org%2Fmailman%2Flistinfo%2Flinux-arm- > kernel&data=02%7C > > > 01%7Canson.huang%40nxp.com%7Cfe4ff92a639e4739c2a108d69d2b8e12%7 > C686ea1 > > > d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C636869207728398808&sd > ata=oz%2 > > FPyAki6fkfQZqaSYkjWfo3m8S48sCzpDf2lxDIjKs%3D&reserved=0
RE: [PATCH V3 2/4] watchdog: imx_sc: Add i.MX system controller watchdog support
Hi, Guenter Best Regards! Anson Huang > -Original Message- > From: Guenter Roeck [mailto:groe...@gmail.com] On Behalf Of Guenter > Roeck > Sent: 2019年2月27日 6:39 > To: Anson Huang > Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; > s.ha...@pengutronix.de; ker...@pengutronix.de; feste...@gmail.com; > catalin.mari...@arm.com; will.dea...@arm.com; w...@linux-watchdog.org; > Aisheng Dong ; ulf.hans...@linaro.org; Daniel > Baluta ; Andy Gross ; > horms+rene...@verge.net.au; he...@sntech.de; a...@arndb.de; > o...@lixom.net; bjorn.anders...@linaro.org; ja...@amarulasolutions.com; > enric.balle...@collabora.com; marc.w.gonza...@free.fr; > devicet...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-arm- > ker...@lists.infradead.org; linux-watch...@vger.kernel.org; dl-linux-imx > > Subject: Re: [PATCH V3 2/4] watchdog: imx_sc: Add i.MX system controller > watchdog support > > On Mon, Feb 25, 2019 at 02:19:10AM +, Anson Huang wrote: > > i.MX8QXP is an ARMv8 SoC which has a Cortex-M4 system controller > > inside, the system controller is in charge of controlling power, clock > > and watchdog etc.. > > > > This patch adds i.MX system controller watchdog driver support, > > watchdog operation needs to be done in secure EL3 mode via > > ARM-Trusted-Firmware, using SMC call, CPU will trap into > > ARM-Trusted-Firmware and then it will request system controller to do > > watchdog operation via IPC. > > > > Signed-off-by: Anson Huang > > --- > > Changes since V2: > > - improve watchdog_init_timeout() operation and error check, > setting it > > via module parameter "timeout", if it is invalid, roll back to use > defaul > > timeout value DEFAULT_TIMEOUT; > > - change compatible string to "fsl,imx-sc-wdt" to make driver more > generic > > for all i.MX platforms with system controller watchdog inside. > > --- > > drivers/watchdog/Kconfig | 13 +++ > > drivers/watchdog/Makefile | 1 + > > drivers/watchdog/imx_sc_wdt.c | 183 > > ++ > > 3 files changed, 197 insertions(+) > > create mode 100644 drivers/watchdog/imx_sc_wdt.c > > > > diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index > > 65c3c42..5c5b8ba 100644 > > --- a/drivers/watchdog/Kconfig > > +++ b/drivers/watchdog/Kconfig > > @@ -625,6 +625,19 @@ config IMX2_WDT > > To compile this driver as a module, choose M here: the > > module will be called imx2_wdt. > > > > +config IMX_SC_WDT > > + tristate "IMX SC Watchdog" > > + depends on ARCH_MXC || COMPILE_TEST > > Wondering: Did you test on other architectures ? The code calls > arm_smccc_smc(), which is unlikely to be available on non-ARM platforms, > and I don't see a dummy declaration for those. Although the ARCH_MXC means for Freescale i.MX SoCs which are all with ARM platforms, but since this driver is targeted for i.MX SoC ARMv8 with system controller inside, so I will add ARM64 dependency here. + depends on (ARCH_MXC && ARM64) || COMPILE_TEST > > > + select WATCHDOG_CORE > > + help > > + This is the driver for the system controller watchdog > > + on the NXP i.MX SoCs with system controller inside. > > + If you have one of these processors and wish to have > > + watchdog support enabled, say Y, otherwise say N. > > + > > + To compile this driver as a module, choose M here: the > > + module will be called imx_sc_wdt. > > + > > config UX500_WATCHDOG > > tristate "ST-Ericsson Ux500 watchdog" > > depends on MFD_DB8500_PRCMU > > diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile > > index 4e78a8c..0c9da63 100644 > > --- a/drivers/watchdog/Makefile > > +++ b/drivers/watchdog/Makefile > > @@ -68,6 +68,7 @@ obj-$(CONFIG_NUC900_WATCHDOG) += > nuc900_wdt.o > > obj-$(CONFIG_TS4800_WATCHDOG) += ts4800_wdt.o > > obj-$(CONFIG_TS72XX_WATCHDOG) += ts72xx_wdt.o > > obj-$(CONFIG_IMX2_WDT) += imx2_wdt.o > > +obj-$(CONFIG_IMX_SC_WDT) += imx_sc_wdt.o > > obj-$(CONFIG_UX500_WATCHDOG) += ux500_wdt.o > > obj-$(CONFIG_RETU_WATCHDOG) += retu_wdt.o > > obj-$(CONFIG_BCM2835_WDT) += bcm2835_wdt.o diff --git > > a/drivers/watchdog/imx_sc_wdt.c b/drivers/watchdog/imx_sc_wdt.c new > > file mode 100644 index 000..7b22575 > > --- /dev/null > > +++ b/drivers/watchdog/imx_sc_wdt.c > > @@ -0,0 +1,183 @@ > > +// SPDX-License-Identifier: GPL-2.0+ > > +/* > > + * Copyright 2018-2019 NXP. > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#define DEFAULT_TIMEOUT 60 > > +/* > > + * Software timer tick implemented in scfw side, support 10ms to > > +0x ms > > + * in theory, but for normal case, 1s~128s is enough, you can change > > +this max > > + * value in case it's not enough. > > + */ > > +#define MAX_TIMEOUT 128 > > + > > +#define IMX_SIP_TIMER 0xC202 > > +#define IMX_SIP_TIMER_STA
Re: [RFC PATCH 1/4] X.509: Parse public key parameters from x509 for akcipher
On Sun, Feb 24, 2019 at 09:48:40AM +0300, Vitaly Chikunov wrote: > > If we pass SubjectPublicKeyInfo into set_pub_key itself (making > set_params not needed) we will break ABI and compatibility with RSA > drivers, because whole SubjectPublicKeyInfo is not expected by the This compatibility does not matter. We can always add translating layers into the crypto API to deal with this. The only ABI that matters is the one to user-space. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH v3 5/5] tracing/probe: Support user-space dereference
On Wed, 27 Feb 2019 21:42:45 -0500 Steven Rostedt wrote: > On Wed, 27 Feb 2019 23:44:42 +0900 > Masami Hiramatsu wrote: > > > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c > > index a7012de37a00..0efef172db17 100644 > > --- a/kernel/trace/trace_probe.c > > +++ b/kernel/trace/trace_probe.c > > @@ -239,6 +239,7 @@ parse_probe_arg(char *arg, const struct fetch_type > > *type, > > { > > struct fetch_insn *code = *pcode; > > unsigned long param; > > + int deref = FETCH_OP_DEREF; > > long offset = 0; > > char *tmp; > > int ret = 0; > > @@ -301,8 +302,17 @@ parse_probe_arg(char *arg, const struct fetch_type > > *type, > > break; > > > > case '+': /* deref memory */ > > - arg++; /* Skip '+', because kstrtol() rejects it. */ > > case '-': > > + if (arg[0] == '+') { > > + arg++; /* Skip '+', because kstrtol() rejects it. */ > > + if (arg[0] == 'u') { > > + deref = FETCH_OP_UDEREF; > > + arg++; > > + } > > + } else if (arg[1] == 'u') { /* Start with "-u" */ > > + deref = FETCH_OP_UDEREF; > > + *(++arg) = '-'; > > + } > > What about: > > if (arg[1] == 'u') { > deref = FETCH_OP_UDEREF; > arg[1] = arg[0]; > arg++; > } > if (arg[0] == '+') > arg++; /* Skip '+', because kstrtol() rejects it. */ > > A bit less messy. Ah, thanks! I'll take it. > > > > tmp = strchr(arg, '('); > > if (!tmp) > > return -EINVAL; > > @@ -328,7 +338,7 @@ parse_probe_arg(char *arg, const struct fetch_type > > *type, > > return -E2BIG; > > *pcode = code; > > > > - code->op = FETCH_OP_DEREF; > > + code->op = deref; > > code->offset = offset; > > } > > break; > > @@ -444,13 +454,14 @@ static int traceprobe_parse_probe_arg_body(char *arg, > > ssize_t *size, > > /* Store operation */ > > if (!strcmp(parg->type->name, "string") || > > !strcmp(parg->type->name, "ustring")) { > > - if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_IMM && > > - code->op != FETCH_OP_COMM) { > > + if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_UDEREF > > + && code->op != FETCH_OP_IMM && code->op != FETCH_OP_COMM) { > > pr_info("string only accepts memory or address.\n"); > > ret = -EINVAL; > > goto fail; > > } > > - if (code->op != FETCH_OP_DEREF || parg->count) { > > + if ((code->op == FETCH_OP_IMM && code->op == FETCH_OP_COMM) > > + || parg->count) { > > How would "code->op == FETCH_OP_IMM && code->op == FETCH_OP_COMM" ever be > true? > > Did you mean || ? Oops, yes. It is a simple mistake... Thank you! > > -- Steve > > > /* > > * IMM and COMM is pointing actual address, those must > > * be kept, and if parg->count != 0, this is an array > > @@ -463,7 +474,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, > > ssize_t *size, > > } > > } > > /* If op == DEREF, replace it with STRING */ > > - if (!strcmp(parg->type->name, "ustring")) > > + if (!strcmp(parg->type->name, "ustring") || > > + code->op == FETCH_OP_UDEREF) > > code->op = FETCH_OP_ST_USTRING; > > else > > code->op = FETCH_OP_ST_STRING; -- Masami Hiramatsu
Re: [PATCH 0/3] RPMPD for QCS404
On 2019-02-25 18:18, Rajendra Nayak wrote: On 2/21/2019 12:14 PM, Bjorn Andersson wrote: Reworkd the macros of the rpmpd driver and add qcs404 power domains, then add this to the dts. For the entire series, Reviewed-by: Rajendra Nayak I'll post v2 of this series with support for msm8998 and a few fixes. Bjorn Andersson (3): soc: qcom: rpmpd: Modify corner defining macros soc: qcom: rpmpd: Add QCS404 corners arm64: dts: qcom: qcs404: Add rpmpd node .../devicetree/bindings/power/qcom,rpmpd.txt | 1 + arch/arm64/boot/dts/qcom/qcs404.dtsi | 35 +++ drivers/soc/qcom/rpmpd.c | 63 +-- include/dt-bindings/power/qcom-rpmpd.h| 9 +++ 4 files changed, 88 insertions(+), 20 deletions(-) -- -- Sibi Sankar -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH v2] cifs: use correct format characters
merged into cifs-2.6.git for-next On Wed, Feb 27, 2019 at 4:27 PM Louis Taylor wrote: > > When compiling with -Wformat, clang emits the following warnings: > > fs/cifs/smb1ops.c:312:20: warning: format specifies type 'unsigned > short' but the argument has type 'unsigned int' [-Wformat] > tgt_total_cnt, total_in_tgt); > ^~~~ > > fs/cifs/cifs_dfs_ref.c:289:4: warning: format specifies type 'short' > but the argument has type 'int' [-Wformat] > ref->flags, ref->server_type); > ^~ > > fs/cifs/cifs_dfs_ref.c:289:16: warning: format specifies type 'short' > but the argument has type 'int' [-Wformat] > ref->flags, ref->server_type); > ^~~~ > > fs/cifs/cifs_dfs_ref.c:291:4: warning: format specifies type 'short' > but the argument has type 'int' [-Wformat] > ref->ref_flag, ref->path_consumed); > ^ > > fs/cifs/cifs_dfs_ref.c:291:19: warning: format specifies type 'short' > but the argument has type 'int' [-Wformat] > ref->ref_flag, ref->path_consumed); > ^~ > The types of these arguments are unconditionally defined, so this patch > updates the format character to the correct ones for ints and unsigned > ints. > > Link: https://github.com/ClangBuiltLinux/linux/issues/378 > Signed-off-by: Louis Taylor > --- > fs/cifs/cifs_dfs_ref.c | 4 ++-- > fs/cifs/smb1ops.c | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/cifs/cifs_dfs_ref.c b/fs/cifs/cifs_dfs_ref.c > index d9b99abe1243..5d83c924cc47 100644 > --- a/fs/cifs/cifs_dfs_ref.c > +++ b/fs/cifs/cifs_dfs_ref.c > @@ -285,9 +285,9 @@ static void dump_referral(const struct dfs_info3_param > *ref) > { > cifs_dbg(FYI, "DFS: ref path: %s\n", ref->path_name); > cifs_dbg(FYI, "DFS: node path: %s\n", ref->node_name); > - cifs_dbg(FYI, "DFS: fl: %hd, srv_type: %hd\n", > + cifs_dbg(FYI, "DFS: fl: %d, srv_type: %d\n", > ref->flags, ref->server_type); > - cifs_dbg(FYI, "DFS: ref_flags: %hd, path_consumed: %hd\n", > + cifs_dbg(FYI, "DFS: ref_flags: %d, path_consumed: %d\n", > ref->ref_flag, ref->path_consumed); > } > > diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c > index 32a6c020478f..20a88776f04d 100644 > --- a/fs/cifs/smb1ops.c > +++ b/fs/cifs/smb1ops.c > @@ -308,7 +308,7 @@ coalesce_t2(char *second_buf, struct smb_hdr *target_hdr) > remaining = tgt_total_cnt - total_in_tgt; > > if (remaining < 0) { > - cifs_dbg(FYI, "Server sent too much data. tgt_total_cnt=%hu > total_in_tgt=%hu\n", > + cifs_dbg(FYI, "Server sent too much data. tgt_total_cnt=%hu > total_in_tgt=%u\n", > tgt_total_cnt, total_in_tgt); > return -EPROTO; > } > -- > 2.20.1 > -- Thanks, Steve
[GIT] Crypto Fixes for 5.0
Hi Linus: This push fixes a compiler warning introduced by a previous fix, as well as two crash bugs on ARM. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Ard Biesheuvel (2): crypto: sha256/arm - fix crash bug in Thumb2 build crypto: sha512/arm - fix crash bug in Thumb2 build Gilad Ben-Yossef (1): crypto: ccree - add missing inline qualifier arch/arm/crypto/sha256-armv4.pl | 3 ++- arch/arm/crypto/sha256-core.S_shipped | 3 ++- arch/arm/crypto/sha512-armv4.pl | 3 ++- arch/arm/crypto/sha512-core.S_shipped | 3 ++- drivers/crypto/ccree/cc_pm.h | 2 +- 5 files changed, 9 insertions(+), 5 deletions(-) Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] powernv: powercap: Add hard minimum powercap
Hi, On 02/28/2019 10:14 AM, Daniel Axtens wrote: > Shilpasri G Bhat writes: > >> In POWER9, OCC(On-Chip-Controller) provides for hard and soft system >> powercapping range. The hard powercap range is guaranteed while soft >> powercap may or may not be asserted due to various power-thermal >> reasons based on system configuration and workloads. This patch adds >> a sysfs file to export the hard minimum powercap limit to allow the >> user to set the appropriate powercap value that can be managed by the >> system. > > Maybe it's common terminology and I'm just not aware of it, but what do > you mean by "asserted"? It doesn't appear elsewhere in the documentation > you're patching, and it's not a use of assert that I'm familiar with... > > Regards, > Daniel > I meant to say powercap will not be assured in the soft powercap range, i.e, system may or may not be throttled of CPU frequency to remain within the powercap. I can reword the document and commit message. Thanks and Regards, Shilpa >> >> Signed-off-by: Shilpasri G Bhat >> --- >> .../ABI/testing/sysfs-firmware-opal-powercap | 10 >> arch/powerpc/platforms/powernv/opal-powercap.c | 66 >> +- >> 2 files changed, 37 insertions(+), 39 deletions(-) >> >> diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-powercap >> b/Documentation/ABI/testing/sysfs-firmware-opal-powercap >> index c9b66ec..65db4c1 100644 >> --- a/Documentation/ABI/testing/sysfs-firmware-opal-powercap >> +++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap >> @@ -29,3 +29,13 @@ Description: System powercap directory and >> attributes applicable for >>creates a request for setting a new-powercap. The >>powercap requested must be between powercap-min >>and powercap-max. >> + >> +What: >> /sys/firmware/opal/powercap/system-powercap/powercap-hard-min >> +Date: Feb 2019 >> +Contact:Linux for PowerPC mailing list >> +Description:Hard minimum powercap >> + >> +This file provides the hard minimum powercap limit in >> +Watts. The powercap value above hard minimum is always >> +guaranteed to be asserted and the powercap value below >> +the hard minimum limit may or may not be guaranteed. >> diff --git a/arch/powerpc/platforms/powernv/opal-powercap.c >> b/arch/powerpc/platforms/powernv/opal-powercap.c >> index d90ee4f..38408e7 100644 >> --- a/arch/powerpc/platforms/powernv/opal-powercap.c >> +++ b/arch/powerpc/platforms/powernv/opal-powercap.c >> @@ -139,10 +139,24 @@ static void powercap_add_attr(int handle, const char >> *name, >> attr->handle = handle; >> sysfs_attr_init(&attr->attr.attr); >> attr->attr.attr.name = name; >> -attr->attr.attr.mode = 0444; >> + >> +if (!strncmp(name, "powercap-current", strlen(name))) { >> +attr->attr.attr.mode = 0664; >> +attr->attr.store = powercap_store; >> +} else { >> +attr->attr.attr.mode = 0444; >> +} >> + >> attr->attr.show = powercap_show; >> } >> >> +static const char * const powercap_strs[] = { >> +"powercap-max", >> +"powercap-min", >> +"powercap-current", >> +"powercap-hard-min", >> +}; >> + >> void __init opal_powercap_init(void) >> { >> struct device_node *powercap, *node; >> @@ -167,60 +181,34 @@ void __init opal_powercap_init(void) >> >> i = 0; >> for_each_child_of_node(powercap, node) { >> -u32 cur, min, max; >> -int j = 0; >> -bool has_cur = false, has_min = false, has_max = false; >> +u32 id; >> +int j, count = 0; >> >> -if (!of_property_read_u32(node, "powercap-min", &min)) { >> -j++; >> -has_min = true; >> -} >> - >> -if (!of_property_read_u32(node, "powercap-max", &max)) { >> -j++; >> -has_max = true; >> -} >> +for (j = 0; j < ARRAY_SIZE(powercap_strs); j++) >> +if (!of_property_read_u32(node, powercap_strs[j], &id)) >> +count++; >> >> -if (!of_property_read_u32(node, "powercap-current", &cur)) { >> -j++; >> -has_cur = true; >> -} >> - >> -pcaps[i].pattrs = kcalloc(j, sizeof(struct powercap_attr), >> +pcaps[i].pattrs = kcalloc(count, sizeof(struct powercap_attr), >>GFP_KERNEL); >> if (!pcaps[i].pattrs) >> goto out_pcaps_pattrs; >> >> -pcaps[i].pg.attrs = kcalloc(j + 1, sizeof(struct attribute *), >> +pcaps[i].pg.attrs = kcalloc(count + 1, >> +sizeof(struct attribute *), >> GFP_KERNEL); >> if (!pcaps[i].pg.attr
Re: [PATCH net-next] net: hns: use struct_size() in devm_kzalloc()
From: "Gustavo A. R. Silva" Date: Mon, 25 Feb 2019 18:27:57 -0600 > One of the more common cases of allocation size calculations is finding > the size of a structure that has a zero-sized array at the end, along > with memory for some number of elements for that array. For example: > > struct foo { > int stuff; > struct boo entry[]; > }; > > instance = devm_kzalloc(dev, sizeof(struct foo) + sizeof(struct boo) * count, > GFP_KERNEL); > > Instead of leaving these open-coded and prone to type mistakes, we can > now use the new struct_size() helper: > > instance = devm_kzalloc(dev, struct_size(instance, entry, count), GFP_KERNEL); > > This code was detected with the help of Coccinelle. > > Signed-off-by: Gustavo A. R. Silva Applied.
[PATCH] soc: qcom: cmd-db: Fix an error code in cmd_db_dev_probe()
The memremap() function doesn't return error pointers, it returns NULL. This code is returning "ret = PTR_ERR(NULL);" which is success, but it should return -ENOMEM. Fixes: 312416d9171a ("drivers: qcom: add command DB driver") Signed-off-by: Dan Carpenter --- drivers/soc/qcom/cmd-db.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/soc/qcom/cmd-db.c b/drivers/soc/qcom/cmd-db.c index c701b3b010f1..f6c3d17b05c7 100644 --- a/drivers/soc/qcom/cmd-db.c +++ b/drivers/soc/qcom/cmd-db.c @@ -248,8 +248,8 @@ static int cmd_db_dev_probe(struct platform_device *pdev) } cmd_db_header = memremap(rmem->base, rmem->size, MEMREMAP_WB); - if (IS_ERR_OR_NULL(cmd_db_header)) { - ret = PTR_ERR(cmd_db_header); + if (!cmd_db_header) { + ret = -ENOMEM; cmd_db_header = NULL; return ret; } -- 2.17.1
Re: Request for suggestion on net device re-naming/re-ordering based on DT alias
On Thu, Feb 28, 2019 at 12:10 AM Stephen Hemminger wrote: > > On Wed, 27 Feb 2019 17:24:03 +0530 > Harini Katakam wrote: > > > Device naming is a hard problem, and there is no perfect solution. > > Device tree should be providing hints to userspace policy for naming, not > trying to do it in the kernel. > See: > https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/ Thanks Stephen - I'll try this. Regards, Harini
Re: [PATCH v6 8/8] arm64: dts: qcom: sdm845: Add Q6V5 MSS node
Hey Doug, On 2019-02-28 02:33, Doug Anderson wrote: Hi, On Tue, Feb 26, 2019 at 3:54 PM Doug Anderson wrote: Hi, On Tue, Feb 5, 2019 at 9:13 PM Bjorn Andersson wrote: > > From: Sibi Sankar > > This patch adds Q6V5 MSS remoteproc node for SDM845 SoCs. > > Signed-off-by: Sibi Sankar > Reviewed-by: Douglas Anderson > Signed-off-by: Bjorn Andersson > --- > > Changes since v5: > - None > > arch/arm64/boot/dts/qcom/sdm845.dtsi | 58 > 1 file changed, 58 insertions(+) > > diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi > index 560c16616ee6..5c41f6fe3e1b 100644 > --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi > +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi > @@ -1612,6 +1612,64 @@ > }; > }; > > + mss_pil: remoteproc@408 { > + compatible = "qcom,sdm845-mss-pil"; > + reg = <0 0x0408 0 0x408>, <0 0x0418 0 0x48>; > + reg-names = "qdsp6", "rmb"; I found that when I disabled IOMMU bypass by booting with "arm-smmu.disable_bypass=y" that I'd get this failure: --- [ 13.633776] qcom-q6v5-mss 408.remoteproc: MBA booted, loading mpss [ 13.647694] arm-smmu 1500.iommu: Unexpected global fault, this could be serious [ 13.660278] arm-smmu 1500.iommu: GFSR 0x8002, GFSYNR0 0x, GFSYNR1 0x0781, GFSYNR2 0x ... [ 14.648830] qcom-q6v5-mss 408.remoteproc: MPSS header authentication timed out [ 14.657141] qcom-q6v5-mss 408.remoteproc: port failed halt [ 14.664983] remoteproc remoteproc0: can't start rproc 408.remoteproc: -110 --- Adding "iommus = <&apps_smmu 0x781 0>;" here fixed my problem. NOTE that I'm no expert on IOMMUs so you should confirm that this is right, but if it is then maybe you could include it in the next spin of the series? I got the "0x781" just by looking at the value of the GFSYNR1 in the above splat. I wasn't sure what to put for the mask so I put 0x0. Upon more testing the "iommus" line that I came up with avoids the global fault but doesn't actually work. I just get: qcom-q6v5-mss 408.remoteproc: failed to allocate mdt buffer AFAIK modem does not have a smmu in front of it. We have reserved memory nodes for mba and mpss, however for mdt authentication we just rely on dma_alloc_attrs with DMA_ATTR_FORCE_CONTIGUOUS for allocation which seems to fail here. I'm hoping someone from Qualcomm can help out here and say how this should be solved. Thanks! Yeah I'll enable arm-smmu.disable_bypass and have a go at booting modem. -Doug -- -- Sibi Sankar -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH net-next 0/2] net: phy: marvell10g: Clean .get_features by using C45 helpers
From: Maxime Chevallier Date: Mon, 25 Feb 2019 17:14:05 +0100 > Recent work on C45 helpers by Heiner made the > genphy_c45_pma_read_abilities function generic enough to use as a > default .get_featutes implementation. > > This series removes the remaining redundant code in > mv3310_get_features(), and makes the 2110 PHY use > genphy_c45_pma_read_abilities() directly, since it doesn't have the > issue with the wrong abilities being reported. Series applied, thanks Maxime.
Re: [PATCH v5 04/10] crypto: akcipher - new verify API for public key algorithms
On Wed, Feb 27, 2019 at 06:28:37PM -0500, Mimi Zohar wrote: > > On Sun, 2019-02-24 at 09:08 +0300, Vitaly Chikunov wrote: > > Previous akcipher .verify() just `decrypts' (using RSA encrypt which is > > using public key) signature to uncover message hash, which was then > > compared in upper level public_key_verify_signature() with the expected > > hash value, which itself was never passed into verify(). > > > > This approach was incompatible with EC-DSA family of algorithms, > > because, to verify a signature EC-DSA algorithm also needs a hash value > > as input; then it's used (together with a signature divided into halves > > `r||s') to produce a witness value, which is then compared with `r' to > > determine if the signature is correct. Thus, for EC-DSA, nor > > requirements of .verify() itself, nor its output expectations in > > public_key_verify_signature() wasn't sufficient. > > > > Make improved .verify() call which gets hash value as input and produce > > complete signature check without any output besides status. > > > > Now for the top level verification only crypto_akcipher_verify() needs > > to be called. > > All but this patch apply. Which branch is this patch based against? This patchest is over 920d7f7215d87005beb4aa2b90b9cb0b74b36947 in cryptodev/master. > > Mimi
Re: [RFC v9 5/5] Documentation: pstore/blk: create document for pstore_blk
On 2/19/19 3:52 AM, liaoweixiong wrote: > The document, at Documentation/admin-guide/pstore-block.rst, > tells user how to use pstore_blk and the attentions about panic > read/write > > Signed-off-by: liaoweixiong > --- > Documentation/admin-guide/pstore-block.rst | 233 > + > MAINTAINERS| 1 + > fs/pstore/Kconfig | 4 + > 3 files changed, 238 insertions(+) > create mode 100644 Documentation/admin-guide/pstore-block.rst > > diff --git a/Documentation/admin-guide/pstore-block.rst > b/Documentation/admin-guide/pstore-block.rst > new file mode 100644 > index 000..a828274 > --- /dev/null > +++ b/Documentation/admin-guide/pstore-block.rst > @@ -0,0 +1,233 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +Pstore block oops/panic logger > +== > + > +Introduction > + > + > +Pstore block (pstore_blk) is an oops/panic logger that write its logs to > block to a block > +device before the system crashes. Pstore_blk needs block device driver needs the block > +registering a partition path of the block device, like /dev/mmcblk0p7 for mmc to register for MMC > +driver, and read/write APIs for this partition when on panic. > + > +Pstore block concepts > +- > + > +Pstore block begins at function ``blkz_register``, by which block driver by which a block driver > +registers to pstore_blk. Note that, block driver should register to > pstore_blk Note that the block driver should > +after block device has registered. Block driver transfers a structure The block driver > +``blkz_info`` which is defined in *linux/pstore_blk.h*. > + > +The following key members of ``struct blkz_info`` may be of interest to you. > + > +blkdev > +~~ > + > +The block device to use. Most of the time, it is a partition of block device. > +It's ok to keep it as NULL if you passing ``read`` and ``write`` in > blkz_info as if you are passing > +``blkdev`` is used by blkz_default_general_read/write. If both of ``blkdev``, > +``read`` and ``write`` are NULL, no block device is effective and the data > will > +be saved in ddr buffer. what is ddr buffer? > + > +It accept the following variants: > + > +1. device number in hexadecimal represents itself no itself; no > + leading 0x, for example b302. > +#. /dev/ represents the device number of disk > +#. /dev/ represents the device number of partition - > device > + number of disk plus the partition number > +#. /dev/p - same as the above, that form is used when > disk above; this form > + name of partitioned disk ends on a digit. ends with a digit. > +#. PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF representing the unique id > of > + a partition if the partition table provides it. The UUID may be either an > + EFI/GPT UUID, or refer to an MSDOS partition using the format -PP, > + where is a zero-filled hex representation of the 32-bit > + "NT disk signature", and PP is a zero-filled hex representation of the > + 1-based partition number. > +#. PARTUUID=/PARTNROFF= to select a partition in relation to a > + partition with a known unique id. > +#. : major and minor number of the device separated by a colon. > + > +See more on section **read/write**. in section > + > +total_size > +~~ > + > +The total size in bytes of block device used for pstore_blk. It **MUST** be > less > +than or equal to size of block device if ``blkdev`` valid. It **MUST** be a > +multiple of 4096. If ``total_size`` is zero with ``blkdev``, ``total_size`` > will be > +set to equal to size of ``blkdev``. > + > +The block device area is divided into many chunks, and each event writes a > chunk > +of information. > + > +dmesg_size > +~~ > + > +The chunk size in bytes for dmesg(oops/panic). It **MUST** be a multiple of > +SECTOR_SIZE (Most of the time, the SECTOR_SIZE is 512). If you don't need > dmesg, > +you are safely to set it to 0. you can safely > + > +NOTE that, the remaining space, except ``pmsg_size`` and others, belongs to > +dmesg. It means that there are multiple chunks for dmesg. > + > +Psotre_blk will log to dmesg chunks one by one, and always overwrite the > oldest Pstore_blk > +chunk if no free chunk. > + > +pmsg_size > +~ > + > +The chunk size in bytes for pmsg. It **MUST** be a multiple of SECTOR_SIZE > (Most > +of the time, the SECTOR_SIZE is 512). If you don't need pmsg, you ar
[PATCH][v2] time: Introduce jiffies64_to_msecs()
there is a similar helper in net/netfilter/nf_tables_api.c, this maybe become a common request someday, so move it to time.c Signed-off-by: Zhang Yu Signed-off-by: Li RongQing --- v1-->v2: using jiffies64_to_msecs in nf_tables_api.c include/linux/jiffies.h | 1 + kernel/time/time.c| 10 ++ net/netfilter/nf_tables_api.c | 4 +--- 3 files changed, 12 insertions(+), 3 deletions(-) diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index fa928242567d..1b6d31da7cbc 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -297,6 +297,7 @@ static inline u64 jiffies_to_nsecs(const unsigned long j) } extern u64 jiffies64_to_nsecs(u64 j); +extern u64 jiffies64_to_msecs(u64 j); extern unsigned long __msecs_to_jiffies(const unsigned int m); #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ) diff --git a/kernel/time/time.c b/kernel/time/time.c index 2edb5088a70b..0083eb711fb7 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -719,6 +719,16 @@ u64 jiffies64_to_nsecs(u64 j) } EXPORT_SYMBOL(jiffies64_to_nsecs); +u64 jiffies64_to_msecs(const u64 j) +{ +#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ) + return (MSEC_PER_SEC / HZ) * j; +#else + return div_u64(j * HZ_TO_MSEC_NUM, HZ_TO_MSEC_DEN); +#endif +} +EXPORT_SYMBOL(jiffies64_to_msecs); + /** * nsecs_to_jiffies64 - Convert nsecs in u64 to jiffies64 * diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index e1a88ba2249e..8763b2798788 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -3184,9 +3184,7 @@ static int nf_msecs_to_jiffies64(const struct nlattr *nla, u64 *result) static __be64 nf_jiffies64_to_msecs(u64 input) { - u64 ms = jiffies64_to_nsecs(input); - - return cpu_to_be64(div_u64(ms, NSEC_PER_MSEC)); + return cpu_to_be64(jiffies64_to_msecs(input)); } static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx, -- 2.16.2
[PATCH v5 perf,bpf 02/15] bpf: libbpf: introduce bpf_program__get_prog_info_linear()
Currently, bpf_prog_info includes 9 arrays. The user has the option to fetch any combination of these arrays. However, this requires a lot of handling of these arrays. This work becomes more tricky when we need to store bpf_prog_info to a file, because these arrays are allocated independently. This patch introduces struct bpf_prog_info_linear, which stores arrays of bpf_prog_info in continues memory. Helper functions are introduced to unify the work to get different information of bpf_prog_info. Specifically, bpf_program__get_prog_info_linear() allows the user to select which arrays to fetch, and handles details for the user. Plesae see the comments before enum bpf_prog_info_array for more details and examples. Cc: Daniel Borkmann Cc: Alexei Starovoitov Signed-off-by: Song Liu --- tools/lib/bpf/libbpf.c | 251 +++ tools/lib/bpf/libbpf.h | 63 ++ tools/lib/bpf/libbpf.map | 3 + 3 files changed, 317 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index b38dcbe7460a..fa12729de283 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -112,6 +112,11 @@ void libbpf_print(enum libbpf_print_level level, const char *format, ...) # define LIBBPF_ELF_C_READ_MMAP ELF_C_READ #endif +static inline __u64 ptr_to_u64(const void *ptr) +{ + return (__u64) (unsigned long) ptr; +} + struct bpf_capabilities { /* v4.14: kernel support for program & map names. */ __u32 name:1; @@ -2997,3 +3002,249 @@ bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t page_size, ring_buffer_write_tail(header, data_tail); return ret; } + +struct bpf_prog_info_array_desc { + int array_offset; /* e.g. offset of jited_prog_insns */ + int count_offset; /* e.g. offset of jited_prog_len */ + int size_offset;/* > 0: offset of rec size, +* < 0: fix size of -size_offset +*/ +}; + +static struct bpf_prog_info_array_desc bpf_prog_info_array_desc[] = { + [BPF_PROG_INFO_JITED_INSNS] = { + offsetof(struct bpf_prog_info, jited_prog_insns), + offsetof(struct bpf_prog_info, jited_prog_len), + -1, + }, + [BPF_PROG_INFO_XLATED_INSNS] = { + offsetof(struct bpf_prog_info, xlated_prog_insns), + offsetof(struct bpf_prog_info, xlated_prog_len), + -1, + }, + [BPF_PROG_INFO_MAP_IDS] = { + offsetof(struct bpf_prog_info, map_ids), + offsetof(struct bpf_prog_info, nr_map_ids), + -(int)sizeof(__u32), + }, + [BPF_PROG_INFO_JITED_KSYMS] = { + offsetof(struct bpf_prog_info, jited_ksyms), + offsetof(struct bpf_prog_info, nr_jited_ksyms), + -(int)sizeof(__u64), + }, + [BPF_PROG_INFO_JITED_FUNC_LENS] = { + offsetof(struct bpf_prog_info, jited_func_lens), + offsetof(struct bpf_prog_info, nr_jited_func_lens), + -(int)sizeof(__u32), + }, + [BPF_PROG_INFO_FUNC_INFO] = { + offsetof(struct bpf_prog_info, func_info), + offsetof(struct bpf_prog_info, nr_func_info), + offsetof(struct bpf_prog_info, func_info_rec_size), + }, + [BPF_PROG_INFO_LINE_INFO] = { + offsetof(struct bpf_prog_info, line_info), + offsetof(struct bpf_prog_info, nr_line_info), + offsetof(struct bpf_prog_info, line_info_rec_size), + }, + [BPF_PROG_INFO_JITED_LINE_INFO] = { + offsetof(struct bpf_prog_info, jited_line_info), + offsetof(struct bpf_prog_info, nr_jited_line_info), + offsetof(struct bpf_prog_info, jited_line_info_rec_size), + }, + [BPF_PROG_INFO_PROG_TAGS] = { + offsetof(struct bpf_prog_info, prog_tags), + offsetof(struct bpf_prog_info, nr_prog_tags), + -(int)sizeof(__u8) * BPF_TAG_SIZE, + }, + +}; + +static __u32 bpf_prog_info_read_offset_u32(struct bpf_prog_info *info, int offset) +{ + __u32 *array = (__u32 *)info; + + if (offset >= 0) + return array[offset / sizeof(__u32)]; + return -(int)offset; +} + +static __u64 bpf_prog_info_read_offset_u64(struct bpf_prog_info *info, int offset) +{ + __u64 *array = (__u64 *)info; + + if (offset >= 0) + return array[offset / sizeof(__u64)]; + return -(int)offset; +} + +static void bpf_prog_info_set_offset_u32(struct bpf_prog_info *info, int offset, +__u32 val) +{ + __u32 *array = (__u32 *)info; + + if (offset >= 0) + array[offset / sizeof(__u32)] = val; +} + +static void bpf_prog_info_set_offset_u64(struct bpf_prog_info *info, int offset, +__u64 val) +{ + __u6
[PATCH v5 perf,bpf 01/15] perf, bpf: consider events with attr.bpf_event as side-band events
Events with bpf_event should be considered as side-band event, as they carry information about BPF programs. Fixes: 6ee52e2a3fe4 ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT") Signed-off-by: Song Liu --- kernel/events/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 795bfab6f8a3..6a17584810ca 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -4238,7 +4238,8 @@ static bool is_sb_event(struct perf_event *event) if (attr->mmap || attr->mmap_data || attr->mmap2 || attr->comm || attr->comm_exec || attr->task || attr->ksymbol || - attr->context_switch) + attr->context_switch || + attr->bpf_event) return true; return false; } -- 2.17.1
[PATCH v5 perf,bpf 05/15] perf: change prototype of perf_event__synthesize_bpf_events()
This patch changes the arguments of perf_event__synthesize_bpf_events() to include perf_session* instead of perf_tool*. perf_session will be used in the next patch. Signed-off-by: Song Liu --- tools/perf/builtin-record.c | 2 +- tools/perf/builtin-top.c| 2 +- tools/perf/util/bpf-event.c | 8 +--- tools/perf/util/bpf-event.h | 4 ++-- 4 files changed, 9 insertions(+), 7 deletions(-) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 88ea11d57c6f..2355e0a9eda0 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -1083,7 +1083,7 @@ static int record__synthesize(struct record *rec, bool tail) return err; } - err = perf_event__synthesize_bpf_events(tool, process_synthesized_event, + err = perf_event__synthesize_bpf_events(session, process_synthesized_event, machine, opts); if (err < 0) pr_warning("Couldn't synthesize bpf events.\n"); diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c index 5a486d4de56e..27d8d42e0a4d 100644 --- a/tools/perf/builtin-top.c +++ b/tools/perf/builtin-top.c @@ -1216,7 +1216,7 @@ static int __cmd_top(struct perf_top *top) init_process_thread(top); - ret = perf_event__synthesize_bpf_events(&top->tool, perf_event__process, + ret = perf_event__synthesize_bpf_events(top->session, perf_event__process, &top->session->machines.host, &top->record_opts); if (ret < 0) diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c index e6dfb95029e5..ff7ee149ec46 100644 --- a/tools/perf/util/bpf-event.c +++ b/tools/perf/util/bpf-event.c @@ -10,6 +10,7 @@ #include "debug.h" #include "symbol.h" #include "machine.h" +#include "session.h" #define ptr_to_u64(ptr)((__u64)(unsigned long)(ptr)) @@ -42,7 +43,7 @@ int machine__process_bpf_event(struct machine *machine __maybe_unused, * -1 for failures; * -2 for lack of kernel support. */ -static int perf_event__synthesize_one_bpf_prog(struct perf_tool *tool, +static int perf_event__synthesize_one_bpf_prog(struct perf_session *session, perf_event__handler_t process, struct machine *machine, int fd, @@ -52,6 +53,7 @@ static int perf_event__synthesize_one_bpf_prog(struct perf_tool *tool, struct ksymbol_event *ksymbol_event = &event->ksymbol_event; struct bpf_event *bpf_event = &event->bpf_event; struct bpf_prog_info_linear *info_linear; + struct perf_tool *tool = session->tool; struct bpf_prog_info *info; struct btf *btf = NULL; bool has_btf = false; @@ -175,7 +177,7 @@ static int perf_event__synthesize_one_bpf_prog(struct perf_tool *tool, return err ? -1 : 0; } -int perf_event__synthesize_bpf_events(struct perf_tool *tool, +int perf_event__synthesize_bpf_events(struct perf_session *session, perf_event__handler_t process, struct machine *machine, struct record_opts *opts) @@ -209,7 +211,7 @@ int perf_event__synthesize_bpf_events(struct perf_tool *tool, continue; } - err = perf_event__synthesize_one_bpf_prog(tool, process, + err = perf_event__synthesize_one_bpf_prog(session, process, machine, fd, event, opts); close(fd); diff --git a/tools/perf/util/bpf-event.h b/tools/perf/util/bpf-event.h index 7890067e1a37..6698683612a7 100644 --- a/tools/perf/util/bpf-event.h +++ b/tools/perf/util/bpf-event.h @@ -15,7 +15,7 @@ struct record_opts; int machine__process_bpf_event(struct machine *machine, union perf_event *event, struct perf_sample *sample); -int perf_event__synthesize_bpf_events(struct perf_tool *tool, +int perf_event__synthesize_bpf_events(struct perf_session *session, perf_event__handler_t process, struct machine *machine, struct record_opts *opts); @@ -27,7 +27,7 @@ static inline int machine__process_bpf_event(struct machine *machine __maybe_unu return 0; } -static inline int perf_event__synthesize_bpf_events(struct perf_tool *tool __maybe_unused, +static inline int perf_event__synthesize_bpf_events(struct perf_session *session __maybe_unused, perf_event__handler_t process __maybe_unused, struct machine *machine __maybe_unused,
[PATCH v5 perf,bpf 00/15] perf annotation of BPF programs
Changes v4 to v5: 1. Rebase to latest bpf-next; 2. Add dependency of 94816add0005 from Arnaldo's tree; 3. More details in change logs; 4. Add perf_env__init() to init bpf related lock and rbtrees; 5. Small clean ups. Changes v3 to v4: 1. Incorporate feedbacks from Jiri and Namhyung; 2. Fixed compilation error with different feature-disassembler-four-args; 3. Split some patches to smaller patches; 4. Improved error handleing in symbol__disassemble_bpf(); 5. Made side band thread more generic; 6. Added comments as suggested. Changes v2 to v3: 1. Remove unnecessary include in header files; 2. Improved error handling; 3. Better naming of functions, variables, etc.; 4. Enable bpf events by default for perf-top. Changes v1 to v2: 1. Fix compilation error with different feature-disassembler-four-args; 2. Fix a segfault in perf-record; 3. Split patches 5/9 and 6/9 so that perf_env changes and perf.data changes are in separate patches. This series enables annotation of BPF programs in perf. perf tool gathers information via sys_bpf and (optionally) stores them in perf.data as headers. Patch 1/15 fixes a minor issue in kernel; Patch 2/15 to 4/15 introduce new helper functions and use them in perf and bpftool; Patch 5/15 to 9/15 saves information of bpf program in perf_env; Patch 10/15 adds --bpf-event options to perf-top; Patch 11/15 to 13/15 enables annotation of bpf progs based on information gathered in 5/15 to 9/15; Patch 14/15 introduces side band polling thread that gathers information for special kernel events during perf-record or perf-top. Patch 15/15 handles information of short living BPF program using the new side band polling thread. Commands tested during developments are perf-top, perf-record, perf-report, and perf-annotate. = Note on patch dependency This set has dependency in both bpf-next tree and tip/perf/core. Current version is developed on bpf-next tree with the following commits cherry-picked from tip/perf/core (or acme/perf/core): (from 1/11 to 11/11) commit 76193a94522f ("perf, bpf: Introduce PERF_RECORD_KSYMBOL") commit d764ac646491 ("tools headers uapi: Sync tools/include/uapi/linux/perf_event.h") commit 6ee52e2a3fe4 ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT") commit df063c83aa2c ("tools headers uapi: Sync tools/include/uapi/linux/perf_event.h") commit 9aa0bfa370b2 ("perf tools: Handle PERF_RECORD_KSYMBOL") commit 45178a928a4b ("perf tools: Handle PERF_RECORD_BPF_EVENT") commit 7b612e291a5a ("perf tools: Synthesize PERF_RECORD_* for loaded BPF programs") commit a40b95bcd30c ("perf top: Synthesize BPF events for pre-existing loaded BPF programs") commit 6934058d9fb6 ("bpf: Add module name [bpf] to ksymbols for bpf programs") commit 811184fb6977 ("perf bpf: Fix synthesized PERF_RECORD_KSYMBOL/BPF_EVENT") comimt 94816add0005 ("perf tools: Add perf_exe() helper to find perf binary") This set is also available at: https://github.com/liu-song-6/linux/tree/bpf-annotation Thanks!! Song Liu (15): perf, bpf: consider events with attr.bpf_event as side-band events bpf: libbpf: introduce bpf_program__get_prog_info_linear() bpf: bpftool: use bpf_program__get_prog_info_linear() in prog.c:do_dump() perf, bpf: synthesize bpf events with bpf_program__get_prog_info_linear() perf: change prototype of perf_event__synthesize_bpf_events() perf, bpf: save bpf_prog_info in a rbtree in perf_env perf, bpf: save bpf_prog_info information as headers to perf.data perf, bpf: save btf in a rbtree in perf_env perf, bpf: save btf information as headers to perf.data perf-top: add option --no-bpf-event perf: add -lopcodes to feature-libbfd perf, bpf: enable annotation of bpf program perf, bpf: process PERF_BPF_EVENT_PROG_LOAD for annotation perf: introduce side band thread perf, bpf: save bpf_prog_info and btf of short living bpf programs kernel/events/core.c | 3 +- tools/bpf/bpftool/prog.c | 266 +++ tools/build/Makefile.feature | 6 +- tools/lib/bpf/libbpf.c | 251 + tools/lib/bpf/libbpf.h | 63 tools/lib/bpf/libbpf.map | 3 + tools/perf/Makefile.config | 6 +- tools/perf/builtin-record.c | 12 +- tools/perf/builtin-top.c | 15 +- tools/perf/perf.c| 1 + tools/perf/util/annotate.c | 150 - tools/perf/util/bpf-event.c | 301 +-- tools/perf/util/bpf-event.h | 36 - tools/perf/util/dso.c| 1 + tools/perf/util/dso.h| 33 ++-- tools/perf/util/env.c| 149 + tools/perf/util/env.h| 22 +++ tools/perf/util/evlist.c | 105 tools/perf/util/evlist.h | 13 ++ tools/perf/util/header.c | 249 - tools/perf/util/header.h | 2 + tools/perf/util/session
[PATCH v5 perf,bpf 09/15] perf, bpf: save btf information as headers to perf.data
This patch enables perf-record to save btf information as headers to perf.data A new header type HEADER_BPF_BTF is introduced for this data. Signed-off-by: Song Liu --- tools/perf/util/header.c | 108 ++- tools/perf/util/header.h | 1 + 2 files changed, 108 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c index 16f5bedb0b7d..3df8428c4c91 100644 --- a/tools/perf/util/header.c +++ b/tools/perf/util/header.c @@ -1121,6 +1121,45 @@ static int write_bpf_prog_info(struct feat_fd *ff, return ret; } +static int write_bpf_btf(struct feat_fd *ff, +struct perf_evlist *evlist __maybe_unused) +{ + struct perf_env *env = &ff->ph->env; + struct rb_root *root; + struct rb_node *next; + u32 count = 0; + int ret; + + down_read(&env->bpf_progs.lock); + + root = &env->bpf_progs.btfs; + next = rb_first(root); + while (next) { + ++count; + next = rb_next(next); + } + + ret = do_write(ff, &count, sizeof(count)); + + if (ret < 0) + goto out; + + next = rb_first(root); + while (next) { + struct btf_node *node; + + node = rb_entry(next, struct btf_node, rb_node); + next = rb_next(&node->rb_node); + ret = do_write(ff, &node->id, + sizeof(u32) * 2 + node->data_size); + if (ret < 0) + goto out; + } +out: + up_read(&env->bpf_progs.lock); + return ret; +} + static int cpu_cache_level__sort(const void *a, const void *b) { struct cpu_cache_level *cache_a = (struct cpu_cache_level *)a; @@ -1624,6 +1663,28 @@ static void print_bpf_prog_info(struct feat_fd *ff, FILE *fp) up_read(&env->bpf_progs.lock); } +static void print_bpf_btf(struct feat_fd *ff, FILE *fp) +{ + struct perf_env *env = &ff->ph->env; + struct rb_root *root; + struct rb_node *next; + + down_read(&env->bpf_progs.lock); + + root = &env->bpf_progs.btfs; + next = rb_first(root); + + while (next) { + struct btf_node *node; + + node = rb_entry(next, struct btf_node, rb_node); + next = rb_next(&node->rb_node); + fprintf(fp, "# btf info of id %u\n", node->id); + } + + up_read(&env->bpf_progs.lock); +} + static void free_event_desc(struct perf_evsel *events) { struct perf_evsel *evsel; @@ -2726,6 +2787,50 @@ static int process_bpf_prog_info(struct feat_fd *ff, return err; } +static int process_bpf_btf(struct feat_fd *ff, void *data __maybe_unused) +{ + struct perf_env *env = &ff->ph->env; + u32 count, i; + + if (do_read_u32(ff, &count)) + return -1; + + down_write(&env->bpf_progs.lock); + + for (i = 0; i < count; ++i) { + struct btf_node *node; + u32 id, data_size; + + if (do_read_u32(ff, &id)) + return -1; + if (do_read_u32(ff, &data_size)) + return -1; + + node = malloc(sizeof(struct btf_node) + data_size); + if (!node) + return -1; + + node->id = id; + node->data_size = data_size; + + if (__do_read(ff, node->data, data_size)) { + free(node); + return -1; + } + + /* endian mismatch, drop the btf, continue */ + if (ff->ph->needs_swap) { + free(node); + continue; + } + + perf_env__insert_btf(env, node); + } + + up_write(&env->bpf_progs.lock); + return 0; +} + struct feature_ops { int (*write)(struct feat_fd *ff, struct perf_evlist *evlist); void (*print)(struct feat_fd *ff, FILE *fp); @@ -2786,7 +2891,8 @@ static const struct feature_ops feat_ops[HEADER_LAST_FEATURE] = { FEAT_OPR(SAMPLE_TIME, sample_time,false), FEAT_OPR(MEM_TOPOLOGY, mem_topology, true), FEAT_OPR(CLOCKID, clockid,false), - FEAT_OPR(BPF_PROG_INFO, bpf_prog_info, false) + FEAT_OPR(BPF_PROG_INFO, bpf_prog_info, false), + FEAT_OPR(BPF_BTF, bpf_btf,false) }; struct header_print_data { diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h index 0785c91b4c3a..9e7d931f7c0d 100644 --- a/tools/perf/util/header.h +++ b/tools/perf/util/header.h @@ -40,6 +40,7 @@ enum { HEADER_MEM_TOPOLOGY, HEADER_CLOCKID, HEADER_BPF_PROG_INFO, + HEADER_BPF_BTF, HEADER_LAST_FEATURE, HEADER_FEAT_BITS= 256, }; -- 2.17.1
[PATCH v5 perf,bpf 08/15] perf, bpf: save btf in a rbtree in perf_env
btf contains information necessary to annotate bpf programs. This patch saves btf for bpf programs loaded in the system. Signed-off-by: Song Liu --- tools/perf/util/bpf-event.c | 24 ++ tools/perf/util/bpf-event.h | 7 tools/perf/util/env.c | 65 + tools/perf/util/env.h | 4 +++ 4 files changed, 100 insertions(+) diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c index ce81b2c43a51..370b830f2433 100644 --- a/tools/perf/util/bpf-event.c +++ b/tools/perf/util/bpf-event.c @@ -34,6 +34,29 @@ int machine__process_bpf_event(struct machine *machine __maybe_unused, return 0; } +static int perf_env__fetch_btf(struct perf_env *env, + u32 btf_id, + struct btf *btf) +{ + struct btf_node *node; + u32 data_size; + const void *data; + + data = btf__get_raw_data(btf, &data_size); + + node = malloc(data_size + sizeof(struct btf_node)); + + if (!node) + return -1; + + node->id = btf_id; + node->data_size = data_size; + memcpy(node->data, data, data_size); + + perf_env__insert_btf(env, node); + return 0; +} + /* * Synthesize PERF_RECORD_KSYMBOL and PERF_RECORD_BPF_EVENT for one bpf * program. One PERF_RECORD_BPF_EVENT is generated for the program. And @@ -113,6 +136,7 @@ static int perf_event__synthesize_one_bpf_prog(struct perf_session *session, goto out; } has_btf = true; + perf_env__fetch_btf(env, info->btf_id, btf); } /* Synthesize PERF_RECORD_KSYMBOL */ diff --git a/tools/perf/util/bpf-event.h b/tools/perf/util/bpf-event.h index fad932f7404f..b9ec394dc7c7 100644 --- a/tools/perf/util/bpf-event.h +++ b/tools/perf/util/bpf-event.h @@ -16,6 +16,13 @@ struct bpf_prog_info_node { struct rb_node rb_node; }; +struct btf_node { + struct rb_node rb_node; + u32 id; + u32 data_size; + chardata[]; +}; + #ifdef HAVE_LIBBPF_SUPPORT int machine__process_bpf_event(struct machine *machine, union perf_event *event, struct perf_sample *sample); diff --git a/tools/perf/util/env.c b/tools/perf/util/env.c index cdda46f2633f..9e4ee7b0a3d1 100644 --- a/tools/perf/util/env.c +++ b/tools/perf/util/env.c @@ -63,6 +63,57 @@ struct bpf_prog_info_node *perf_env__find_bpf_prog_info(struct perf_env *env, return node; } +void perf_env__insert_btf(struct perf_env *env, struct btf_node *btf_node) +{ + struct rb_node *parent = NULL; + __u32 btf_id = btf_node->id; + struct btf_node *node; + struct rb_node **p; + + down_write(&env->bpf_progs.lock); + p = &env->bpf_progs.btfs.rb_node; + + while (*p != NULL) { + parent = *p; + node = rb_entry(parent, struct btf_node, rb_node); + if (btf_id < node->id) { + p = &(*p)->rb_left; + } else if (btf_id > node->id) { + p = &(*p)->rb_right; + } else { + pr_debug("duplicated btf %u\n", btf_id); + goto out; + } + } + + rb_link_node(&btf_node->rb_node, parent, p); + rb_insert_color(&btf_node->rb_node, &env->bpf_progs.btfs); +out: + up_write(&env->bpf_progs.lock); +} + +struct btf_node *perf_env__find_btf(struct perf_env *env, __u32 btf_id) +{ + struct btf_node *node = NULL; + struct rb_node *n; + + down_read(&env->bpf_progs.lock); + n = env->bpf_progs.btfs.rb_node; + + while (n) { + node = rb_entry(n, struct btf_node, rb_node); + if (btf_id < node->id) + n = n->rb_left; + else if (btf_id > node->id) + n = n->rb_right; + else + break; + } + + up_read(&env->bpf_progs.lock); + return node; +} + /* purge data in bpf_progs.infos tree */ static void perf_env__purge_bpf(struct perf_env *env) { @@ -82,6 +133,19 @@ static void perf_env__purge_bpf(struct perf_env *env) rb_erase_init(&node->rb_node, root); free(node); } + + root = &env->bpf_progs.btfs; + next = rb_first(root); + + while (next) { + struct btf_node *node; + + node = rb_entry(next, struct btf_node, rb_node); + next = rb_next(&node->rb_node); + rb_erase_init(&node->rb_node, root); + free(node); + } + up_write(&env->bpf_progs.lock); } @@ -119,6 +183,7 @@ void perf_env__exit(struct perf_env *env) void perf_env__init(struct perf_env *env) { env->bpf_progs.infos = RB_ROOT; + env->bpf_progs.btfs = RB_ROOT; init_rwsem(&env->bpf_progs.lock); }
[PATCH v5 perf,bpf 07/15] perf, bpf: save bpf_prog_info information as headers to perf.data
This patch enables perf-record to save bpf_prog_info information as headers to perf.data. A new header type HEADER_BPF_PROG_INFO is introduced for this data. Signed-off-by: Song Liu --- tools/perf/util/header.c | 143 ++- tools/perf/util/header.h | 1 + 2 files changed, 143 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c index 4b88de5e9192..16f5bedb0b7d 100644 --- a/tools/perf/util/header.c +++ b/tools/perf/util/header.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "evlist.h" #include "evsel.h" @@ -39,6 +40,7 @@ #include "tool.h" #include "time-utils.h" #include "units.h" +#include "bpf-event.h" #include "sane_ctype.h" @@ -1074,6 +1076,51 @@ static int write_clockid(struct feat_fd *ff, sizeof(ff->ph->env.clockid_res_ns)); } +static int write_bpf_prog_info(struct feat_fd *ff, + struct perf_evlist *evlist __maybe_unused) +{ + struct perf_env *env = &ff->ph->env; + struct rb_root *root; + struct rb_node *next; + u32 count = 0; + int ret; + + down_read(&env->bpf_progs.lock); + + root = &env->bpf_progs.infos; + next = rb_first(root); + while (next) { + ++count; + next = rb_next(next); + } + + ret = do_write(ff, &count, sizeof(count)); + + if (ret < 0) + goto out; + + next = rb_first(root); + while (next) { + struct bpf_prog_info_node *node; + size_t len; + + node = rb_entry(next, struct bpf_prog_info_node, rb_node); + next = rb_next(&node->rb_node); + len = sizeof(struct bpf_prog_info_linear) + + node->info_linear->data_len; + + /* before writing to file, translate address to offset */ + bpf_program__bpil_addr_to_offs(node->info_linear); + ret = do_write(ff, node->info_linear, len); + bpf_program__bpil_offs_to_addr(node->info_linear); + if (ret < 0) + goto out; + } +out: + up_read(&env->bpf_progs.lock); + return ret; +} + static int cpu_cache_level__sort(const void *a, const void *b) { struct cpu_cache_level *cache_a = (struct cpu_cache_level *)a; @@ -1554,6 +1601,29 @@ static void print_clockid(struct feat_fd *ff, FILE *fp) ff->ph->env.clockid_res_ns * 1000); } +static void print_bpf_prog_info(struct feat_fd *ff, FILE *fp) +{ + struct perf_env *env = &ff->ph->env; + struct rb_root *root; + struct rb_node *next; + + down_read(&env->bpf_progs.lock); + + root = &env->bpf_progs.infos; + next = rb_first(root); + + while (next) { + struct bpf_prog_info_node *node; + + node = rb_entry(next, struct bpf_prog_info_node, rb_node); + next = rb_next(&node->rb_node); + fprintf(fp, "# bpf_prog_info of id %u\n", + node->info_linear->info.id); + } + + up_read(&env->bpf_progs.lock); +} + static void free_event_desc(struct perf_evsel *events) { struct perf_evsel *evsel; @@ -2586,6 +2656,76 @@ static int process_clockid(struct feat_fd *ff, return 0; } +static int process_bpf_prog_info(struct feat_fd *ff, +void *data __maybe_unused) +{ + struct bpf_prog_info_linear *info_linear; + struct bpf_prog_info_node *info_node; + struct perf_env *env = &ff->ph->env; + u32 count, i; + int err = -1; + + if (do_read_u32(ff, &count)) + return -1; + + down_write(&env->bpf_progs.lock); + + for (i = 0; i < count; ++i) { + u32 info_len, data_len; + + info_linear = NULL; + info_node = NULL; + if (do_read_u32(ff, &info_len)) + goto out; + if (do_read_u32(ff, &data_len)) + goto out; + + if (info_len > sizeof(struct bpf_prog_info)) { + pr_warning("detected invalid bpf_prog_info\n"); + goto out; + } + + info_linear = malloc(sizeof(struct bpf_prog_info_linear) + +data_len); + if (!info_linear) + goto out; + info_linear->info_len = sizeof(struct bpf_prog_info); + info_linear->data_len = data_len; + if (do_read_u64(ff, (u64 *)(&info_linear->arrays))) + goto out; + if (__do_read(ff, &info_linear->info, info_len)) + goto out; + if (info_len < sizeof(struct bpf_prog_info)) + memset(((void *)(&info_linear->info)) + info_len, 0, + sizeof(struct bpf_prog_info) -
[PATCH v5 perf,bpf 06/15] perf, bpf: save bpf_prog_info in a rbtree in perf_env
bpf_prog_info contains information necessary to annotate bpf programs. This patch saves bpf_prog_info for bpf programs loaded in the system. Some big picture of the next few patches: To fully annotate BPF programs with source code mapping, 4 different information are needed: 1) PERF_RECORD_KSYMBOL 2) PERF_RECORD_BPF_EVENT 3) bpf_prog_info 4) btf Before this set, 1) and 2) in the list are already saved to perf.data file. For BPF programs that are already loaded before perf run, 1) and 2) are synthesized by perf_event__synthesize_bpf_events(). For short living BPF programs, 1) and 2) are generated by kernel. This set handles 3) and 4) from the list. Again, it is necessary to handle existing BPF program and short living program separately. This patch handles 3) for exising BPF programs while synthesizing 1) and 2) in perf_event__synthesize_bpf_events(). These data are stored in perf_env. The next patch saves these data from perf_env to perf.data as headers. Similarly, the two patches after the next saves 4) of existing BPF programs to perf_env and perf.data. Another patch later will handle 3) and 4) for short living BPF programs by monitoring 1) and 2) in a dedicate thread. Signed-off-by: Song Liu --- tools/perf/perf.c | 1 + tools/perf/util/bpf-event.c | 32 +- tools/perf/util/bpf-event.h | 7 +++- tools/perf/util/env.c | 84 + tools/perf/util/env.h | 18 tools/perf/util/session.c | 1 + 6 files changed, 141 insertions(+), 2 deletions(-) diff --git a/tools/perf/perf.c b/tools/perf/perf.c index a11cb006f968..72df4b6fa36f 100644 --- a/tools/perf/perf.c +++ b/tools/perf/perf.c @@ -298,6 +298,7 @@ static int run_builtin(struct cmd_struct *p, int argc, const char **argv) use_pager = 1; commit_pager_choice(); + perf_env__init(&perf_env); perf_env__set_cmdline(&perf_env, argc, argv); status = p->fn(argc, argv); perf_config__exit(); diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c index ff7ee149ec46..ce81b2c43a51 100644 --- a/tools/perf/util/bpf-event.c +++ b/tools/perf/util/bpf-event.c @@ -10,6 +10,7 @@ #include "debug.h" #include "symbol.h" #include "machine.h" +#include "env.h" #include "session.h" #define ptr_to_u64(ptr)((__u64)(unsigned long)(ptr)) @@ -54,17 +55,28 @@ static int perf_event__synthesize_one_bpf_prog(struct perf_session *session, struct bpf_event *bpf_event = &event->bpf_event; struct bpf_prog_info_linear *info_linear; struct perf_tool *tool = session->tool; + struct bpf_prog_info_node *info_node; struct bpf_prog_info *info; struct btf *btf = NULL; bool has_btf = false; + struct perf_env *env; u32 sub_prog_cnt, i; int err = 0; u64 arrays; + /* +* for perf-record and perf-report use header.env; +* otherwise, use global perf_env. +*/ + env = session->data ? &session->header.env : &perf_env; + arrays = 1UL << BPF_PROG_INFO_JITED_KSYMS; arrays |= 1UL << BPF_PROG_INFO_JITED_FUNC_LENS; arrays |= 1UL << BPF_PROG_INFO_FUNC_INFO; arrays |= 1UL << BPF_PROG_INFO_PROG_TAGS; + arrays |= 1UL << BPF_PROG_INFO_JITED_INSNS; + arrays |= 1UL << BPF_PROG_INFO_LINE_INFO; + arrays |= 1UL << BPF_PROG_INFO_JITED_LINE_INFO; info_linear = bpf_program__get_prog_info_linear(fd, arrays); if (IS_ERR_OR_NULL(info_linear)) { @@ -153,8 +165,8 @@ static int perf_event__synthesize_one_bpf_prog(struct perf_session *session, machine, process); } - /* Synthesize PERF_RECORD_BPF_EVENT */ if (opts->bpf_event) { + /* Synthesize PERF_RECORD_BPF_EVENT */ *bpf_event = (struct bpf_event){ .header = { .type = PERF_RECORD_BPF_EVENT, @@ -167,6 +179,24 @@ static int perf_event__synthesize_one_bpf_prog(struct perf_session *session, memcpy(bpf_event->tag, info->tag, BPF_TAG_SIZE); memset((void *)event + event->header.size, 0, machine->id_hdr_size); event->header.size += machine->id_hdr_size; + + /* save bpf_prog_info to env */ + info_node = malloc(sizeof(struct bpf_prog_info_node)); + + /* +* Do not bail out for !info_node, as we still want to +* call perf_tool__process_synth_event() +*/ + if (info_node) { + info_node->info_linear = info_linear; + perf_env__insert_bpf_prog_info(env, info_node); + info_linear = NULL; + } + + /* +* process after saving bpf_prog_info to env, so that +* required information is ready for look up +
[PATCH v5 perf,bpf 14/15] perf: introduce side band thread
This patch introduces side band thread that captures extended information for events like PERF_RECORD_BPF_EVENT. This new thread uses its own evlist that uses ring buffer with very low watermark for lower latency. In the next patch, we uses this thread to handle PERF_RECORD_BPF_EVENT. Signed-off-by: Song Liu --- tools/perf/builtin-record.c | 7 +++ tools/perf/builtin-top.c| 7 +++ tools/perf/util/evlist.c| 100 tools/perf/util/evlist.h| 13 + 4 files changed, 127 insertions(+) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 2355e0a9eda0..d10c1d5a9e89 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -1106,6 +1106,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv) struct perf_data *data = &rec->data; struct perf_session *session; bool disabled = false, draining = false; + struct perf_evlist_sb_poll_args poll_args; int fd; atexit(record__sig_exit); @@ -1206,6 +1207,10 @@ static int __cmd_record(struct record *rec, int argc, const char **argv) goto out_child; } + poll_args.env = &session->header.env; + poll_args.done = &done; + perf_evlist__start_polling_thread(&rec->opts.target, &poll_args); + err = record__synthesize(rec, false); if (err < 0) goto out_child; @@ -1456,6 +1461,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv) out_delete_session: perf_session__delete(session); + + perf_evlist__stop_polling_thread(); return status; } diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c index ccdf5689452f..f41545445917 100644 --- a/tools/perf/builtin-top.c +++ b/tools/perf/builtin-top.c @@ -1524,6 +1524,7 @@ int cmd_top(int argc, const char **argv) "number of thread to run event synthesize"), OPT_END() }; + struct perf_evlist_sb_poll_args poll_args; const char * const top_usage[] = { "perf top []", NULL @@ -1654,8 +1655,14 @@ int cmd_top(int argc, const char **argv) top.record_opts.bpf_event = !top.no_bpf_event; + poll_args.env = &perf_env; + poll_args.done = &done; + perf_evlist__start_polling_thread(target, &poll_args); + status = __cmd_top(&top); + perf_evlist__stop_polling_thread(); + out_delete_evlist: perf_evlist__delete(top.evlist); diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index 8c902276d4b4..61b87c8111e6 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -19,6 +19,7 @@ #include "debug.h" #include "units.h" #include "asm/bug.h" +#include "bpf-event.h" #include #include @@ -1841,3 +1842,102 @@ struct perf_evsel *perf_evlist__reset_weak_group(struct perf_evlist *evsel_list, } return leader; } + +static struct perf_evlist *sb_evlist; +pthread_t poll_thread; + +int perf_evlist__new_side_band_event(struct perf_event_attr *attr) +{ + struct perf_evsel *evsel; + + if (!sb_evlist) + sb_evlist = perf_evlist__new(); + + if (!sb_evlist) + return -1; + + evsel = perf_evsel__new_idx(attr, sb_evlist->nr_entries); + if (!evsel) + goto out_err; + + perf_evlist__add(sb_evlist, evsel); + return 0; + +out_err: + perf_evlist__delete(sb_evlist); + return -1; +} + +static void *perf_evlist__poll_thread(void *arg) +{ + struct perf_evlist_sb_poll_args *args = arg; + int i; + + while (!*(args->done)) { + perf_evlist__poll(sb_evlist, 1000); + + for (i = 0; i < sb_evlist->nr_mmaps; i++) { + struct perf_mmap *map = &sb_evlist->mmap[i]; + union perf_event *event; + + if (perf_mmap__read_init(map)) + continue; + while ((event = perf_mmap__read_event(map)) != NULL) { + pr_debug("processing vip event of type %d\n", +event->header.type); + switch (event->header.type) { + default: + break; + } + perf_mmap__consume(map); + } + perf_mmap__read_done(map); + } + } + return NULL; +} + +int perf_evlist__start_polling_thread(struct target *target, + struct perf_evlist_sb_poll_args *args) +{ + struct perf_evsel *counter; + + if (sb_evlist == NULL) + return 0; + + if (perf_evlist__create_maps(sb_evlist, target)) + goto out_delete_evlist; + + evlist__for_each_entry(sb_
[PATCH v5 perf,bpf 12/15] perf, bpf: enable annotation of bpf program
This patch enables the annotation of bpf program. A new dso type DSO_BINARY_TYPE__BPF_PROG_INFO is introduced to for BPF programs. In symbol__disassemble(), DSO_BINARY_TYPE__BPF_PROG_INFO dso calls into a new function symbol__disassemble_bpf(), where annotation line information is filled based bpf_prog_info and btf saved in given perf_env. symbol__disassemble_bpf() uses libbfd to disassemble bpf programs. Signed-off-by: Song Liu --- tools/build/Makefile.feature | 6 +- tools/perf/Makefile.config | 4 + tools/perf/util/annotate.c | 150 ++- tools/perf/util/dso.c| 1 + tools/perf/util/dso.h| 33 +--- tools/perf/util/symbol.c | 1 + 6 files changed, 181 insertions(+), 14 deletions(-) diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature index 5467c6bf9ceb..4f35e9ff1e00 100644 --- a/tools/build/Makefile.feature +++ b/tools/build/Makefile.feature @@ -71,7 +71,8 @@ FEATURE_TESTS_BASIC := \ sdt\ setns \ libopencsd \ -libaio +libaio \ +disassembler-four-args # FEATURE_TESTS_BASIC + FEATURE_TESTS_EXTRA is the complete list # of all feature tests @@ -118,7 +119,8 @@ FEATURE_DISPLAY ?= \ lzma \ get_cpuid \ bpf \ - libaio + libaio\ + disassembler-four-args # Set FEATURE_CHECK_(C|LD)FLAGS-all for all FEATURE_TESTS features. # If in the future we need per-feature checks/flags for features not diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index e0bafbc273af..ab223239f1fb 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -796,6 +796,10 @@ ifdef HAVE_KVM_STAT_SUPPORT CFLAGS += -DHAVE_KVM_STAT_SUPPORT endif +ifeq ($(feature-disassembler-four-args), 1) +CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE +endif + ifeq (${IS_64_BIT}, 1) ifndef NO_PERF_READ_VDSO32 $(call feature_check,compile-32) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 70de8f6b3aee..e467aa0ef0c2 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -9,6 +9,10 @@ #include #include +#include +#include +#include +#include #include "util.h" #include "ui/ui.h" #include "sort.h" @@ -22,6 +26,7 @@ #include "annotate.h" #include "evsel.h" #include "evlist.h" +#include "bpf-event.h" #include "block-range.h" #include "string2.h" #include "arch/common.h" @@ -29,6 +34,9 @@ #include #include #include +#include +#include +#include /* FIXME: For the HE_COLORSET */ #include "ui/browser.h" @@ -1672,6 +1680,144 @@ static int dso__disassemble_filename(struct dso *dso, char *filename, size_t fil return 0; } +static int symbol__disassemble_bpf(struct symbol *sym, + struct annotate_args *args) +{ + struct annotation *notes = symbol__annotation(sym); + struct annotation_options *opts = args->options; + struct bpf_prog_info_linear *info_linear; + struct bpf_prog_linfo *prog_linfo = NULL; + struct bpf_prog_info_node *info_node; + int len = sym->end - sym->start; + disassembler_ftype disassemble; + struct map *map = args->ms.map; + struct disassemble_info info; + struct dso *dso = map->dso; + int pc = 0, count, sub_id; + struct btf *btf = NULL; + char tpath[PATH_MAX]; + size_t buf_size; + int nr_skip = 0; + int ret = -1; + char *buf; + bfd *bfdf; + FILE *s; + + if (dso->binary_type != DSO_BINARY_TYPE__BPF_PROG_INFO) + return -1; + + pr_debug("%s: handling sym %s addr %lx len %lx\n", __func__, +sym->name, sym->start, sym->end - sym->start); + + memset(tpath, 0, sizeof(tpath)); + perf_exe(tpath, sizeof(tpath)); + + bfdf = bfd_openr(tpath, NULL); + assert(bfdf); + assert(bfd_check_format(bfdf, bfd_object)); + + s = open_memstream(&buf, &buf_size); + if (!s) + goto out; + init_disassemble_info(&info, s, + (fprintf_ftype) fprintf); + + info.arch = bfd_get_arch(bfdf); + info.mach = bfd_get_mach(bfdf); + + info_node = perf_env__find_bpf_prog_info(dso->bpf_prog.env, +dso->bpf_prog.id); + if (!info_node) + goto out; + info_linear = info_node->info_linear; + sub_id = dso->bpf_prog.sub_id; + + info.buffer = (void *)(info_linear->info.jited_prog_insns); + info.buffer_length = info_linear->info.jited_prog_len; + + if (info_linear->info.nr_line_info) + prog_linfo = bpf_prog_linfo__new(&info_linear->info); + + if (info_linear->info.btf_id)
[PATCH v5 perf,bpf 15/15] perf, bpf: save bpf_prog_info and btf of short living bpf programs
To fully annotate BPF programs with source code mapping, 4 different information are needed: 1) PERF_RECORD_KSYMBOL 2) PERF_RECORD_BPF_EVENT 3) bpf_prog_info 4) btf This patch handles 3) and 4) for short living BPF programs. For timely process of these information, a dedicated event is added to the side band evlist. When PERF_RECORD_BPF_EVENT is received via the side band event, the polling thread gathers 3) and 4) vis sys_bpf and store them in perf_env. These information are saved to perf.data at the end of perf-record. Signed-off-by: Song Liu --- tools/perf/builtin-record.c | 3 ++ tools/perf/builtin-top.c| 3 ++ tools/perf/util/bpf-event.c | 66 + tools/perf/util/bpf-event.h | 18 ++ tools/perf/util/evlist.c| 5 +++ 5 files changed, 95 insertions(+) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index d10c1d5a9e89..ce26d37c2871 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -1207,6 +1207,9 @@ static int __cmd_record(struct record *rec, int argc, const char **argv) goto out_child; } + if (opts->bpf_event) + bpf_event__add_polling_event(); + poll_args.env = &session->header.env; poll_args.done = &done; perf_evlist__start_polling_thread(&rec->opts.target, &poll_args); diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c index f41545445917..851c4dcce66f 100644 --- a/tools/perf/builtin-top.c +++ b/tools/perf/builtin-top.c @@ -1655,6 +1655,9 @@ int cmd_top(int argc, const char **argv) top.record_opts.bpf_event = !top.no_bpf_event; + if (top.record_opts.bpf_event) + bpf_event__add_polling_event(); + poll_args.env = &perf_env; poll_args.done = &done; perf_evlist__start_polling_thread(target, &poll_args); diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c index 048ef00371ad..664906237ffb 100644 --- a/tools/perf/util/bpf-event.c +++ b/tools/perf/util/bpf-event.c @@ -12,6 +12,7 @@ #include "machine.h" #include "env.h" #include "session.h" +#include "evlist.h" #define ptr_to_u64(ptr)((__u64)(unsigned long)(ptr)) @@ -332,3 +333,68 @@ int perf_event__synthesize_bpf_events(struct perf_session *session, free(event); return err; } + +void perf_env__add_bpf_info(struct perf_env *env, u32 id) +{ + struct bpf_prog_info_linear *info_linear; + struct bpf_prog_info_node *info_node; + struct btf *btf = NULL; + u64 arrays; + u32 btf_id; + int fd; + + fd = bpf_prog_get_fd_by_id(id); + if (fd < 0) + return; + + arrays = 1UL << BPF_PROG_INFO_JITED_KSYMS; + arrays |= 1UL << BPF_PROG_INFO_JITED_FUNC_LENS; + arrays |= 1UL << BPF_PROG_INFO_FUNC_INFO; + arrays |= 1UL << BPF_PROG_INFO_PROG_TAGS; + arrays |= 1UL << BPF_PROG_INFO_JITED_INSNS; + arrays |= 1UL << BPF_PROG_INFO_LINE_INFO; + arrays |= 1UL << BPF_PROG_INFO_JITED_LINE_INFO; + + info_linear = bpf_program__get_prog_info_linear(fd, arrays); + if (IS_ERR_OR_NULL(info_linear)) { + pr_debug("%s: failed to get BPF program info. aborting\n", __func__); + goto out; + } + + btf_id = info_linear->info.btf_id; + + info_node = malloc(sizeof(struct bpf_prog_info_node)); + if (info_node) { + info_node->info_linear = info_linear; + perf_env__insert_bpf_prog_info(env, info_node); + } else + free(info_linear); + + if (btf_id == 0) + goto out; + + if (btf__get_from_id(btf_id, &btf)) { + pr_debug("%s: failed to get BTF of id %u, aborting\n", +__func__, btf_id); + goto out; + } + perf_env__fetch_btf(env, btf_id, btf); + +out: + free(btf); + close(fd); +} + +int bpf_event__add_polling_event(void) +{ + struct perf_event_attr attr = { + .type = PERF_TYPE_SOFTWARE, + .config = PERF_COUNT_SW_DUMMY, + .watermark= 1, + .bpf_event= 1, + .wakeup_watermark = 1, + .size = sizeof(attr), /* to capture ABI version */ + }; + + return perf_evlist__new_side_band_event(&attr); +} diff --git a/tools/perf/util/bpf-event.h b/tools/perf/util/bpf-event.h index b9ec394dc7c7..03a6f018e219 100644 --- a/tools/perf/util/bpf-event.h +++ b/tools/perf/util/bpf-event.h @@ -4,12 +4,17 @@ #include #include +#include +#include #include "event.h" struct machine; union perf_event; +struct perf_env; struct perf_sample; struct record_opts; +struct evlist; +struct target; struct bpf_prog_info_node { struct bpf_prog_info_linear *info_linear; @@ -31,6 +36,9 @@ int perf_event__synthesize_bpf_events(struct perf_session *session,
[PATCH v5 perf,bpf 11/15] perf: add -lopcodes to feature-libbfd
Both libbfd and libopcodes are distributed with binutil-dev/devel. When libbfd presents, it is OK to assume libopcodes also presents. This has been a safe assumption for bpftool. This patch adds -lopcodes to perf/Makefile.config. libopcodes will be used in the next commit for bpf annotation. Signed-off-by: Song Liu --- tools/perf/Makefile.config | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index b441c88cafa1..e0bafbc273af 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -701,7 +701,7 @@ else endif ifeq ($(feature-libbfd), 1) - EXTLIBS += -lbfd + EXTLIBS += -lbfd -lopcodes else # we are on a system that requires -liberty and (maybe) -lz # to link against -lbfd; test each case individually here -- 2.17.1
[PATCH v5 perf,bpf 13/15] perf, bpf: process PERF_BPF_EVENT_PROG_LOAD for annotation
This patch adds processing of PERF_BPF_EVENT_PROG_LOAD, which sets proper DSO type/id/etc of memory regions mapped to BPF programs to DSO_BINARY_TYPE__BPF_PROG_INFO Signed-off-by: Song Liu --- tools/perf/util/bpf-event.c | 53 + 1 file changed, 53 insertions(+) diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c index 370b830f2433..048ef00371ad 100644 --- a/tools/perf/util/bpf-event.c +++ b/tools/perf/util/bpf-event.c @@ -25,12 +25,65 @@ static int snprintf_hex(char *buf, size_t size, unsigned char *data, size_t len) return ret; } +static int machine__process_bpf_event_load(struct machine *machine __maybe_unused, + union perf_event *event, + struct perf_sample *sample __maybe_unused) +{ + struct bpf_prog_info_linear *info_linear; + struct bpf_prog_info_node *info_node; + struct perf_env *env = machine->env; + int id = event->bpf_event.id; + unsigned int i; + + /* perf-record, no need to handle bpf-event */ + if (env == NULL) + return 0; + + info_node = perf_env__find_bpf_prog_info(env, id); + if (!info_node) + return 0; + info_linear = info_node->info_linear; + + for (i = 0; i < info_linear->info.nr_jited_ksyms; i++) { + u64 *addrs = (u64 *)(info_linear->info.jited_ksyms); + u64 addr = addrs[i]; + struct map *map; + + map = map_groups__find(&machine->kmaps, addr); + + if (map) { + map->dso->binary_type = DSO_BINARY_TYPE__BPF_PROG_INFO; + map->dso->bpf_prog.id = id; + map->dso->bpf_prog.sub_id = i; + map->dso->bpf_prog.env = env; + } + } + return 0; +} + int machine__process_bpf_event(struct machine *machine __maybe_unused, union perf_event *event, struct perf_sample *sample __maybe_unused) { if (dump_trace) perf_event__fprintf_bpf_event(event, stdout); + + switch (event->bpf_event.type) { + case PERF_BPF_EVENT_PROG_LOAD: + return machine__process_bpf_event_load(machine, event, sample); + + case PERF_BPF_EVENT_PROG_UNLOAD: + /* +* Do not free bpf_prog_info and btf of the program here, +* as annotation still need them. They will be freed at +* the end of the session. +*/ + break; + default: + pr_debug("unexpected bpf_event type of %d\n", +event->bpf_event.type); + break; + } return 0; } -- 2.17.1
[PATCH v5 perf,bpf 03/15] bpf: bpftool: use bpf_program__get_prog_info_linear() in prog.c:do_dump()
This patches uses bpf_program__get_prog_info_linear() to simplify the logic in prog.c do_dump(). Cc: Daniel Borkmann Cc: Alexei Starovoitov Signed-off-by: Song Liu --- tools/bpf/bpftool/prog.c | 266 +-- 1 file changed, 59 insertions(+), 207 deletions(-) diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c index 0c35dd543d49..c09b960c0b67 100644 --- a/tools/bpf/bpftool/prog.c +++ b/tools/bpf/bpftool/prog.c @@ -401,41 +401,31 @@ static int do_show(int argc, char **argv) static int do_dump(int argc, char **argv) { - unsigned int finfo_rec_size, linfo_rec_size, jited_linfo_rec_size; - void *func_info = NULL, *linfo = NULL, *jited_linfo = NULL; - unsigned int nr_finfo, nr_linfo = 0, nr_jited_linfo = 0; + struct bpf_prog_info_linear *info_linear; struct bpf_prog_linfo *prog_linfo = NULL; - unsigned long *func_ksyms = NULL; - struct bpf_prog_info info = {}; - unsigned int *func_lens = NULL; + enum {DUMP_JITED, DUMP_XLATED} mode; const char *disasm_opt = NULL; - unsigned int nr_func_ksyms; - unsigned int nr_func_lens; + struct bpf_prog_info *info; struct dump_data dd = {}; - __u32 len = sizeof(info); + void *func_info = NULL; struct btf *btf = NULL; - unsigned int buf_size; char *filepath = NULL; bool opcodes = false; bool visual = false; char func_sig[1024]; unsigned char *buf; bool linum = false; - __u32 *member_len; - __u64 *member_ptr; + __u32 member_len; + __u64 arrays; ssize_t n; - int err; int fd; if (is_prefix(*argv, "jited")) { if (disasm_init()) return -1; - - member_len = &info.jited_prog_len; - member_ptr = &info.jited_prog_insns; + mode = DUMP_JITED; } else if (is_prefix(*argv, "xlated")) { - member_len = &info.xlated_prog_len; - member_ptr = &info.xlated_prog_insns; + mode = DUMP_XLATED; } else { p_err("expected 'xlated' or 'jited', got: %s", *argv); return -1; @@ -474,175 +464,50 @@ static int do_dump(int argc, char **argv) return -1; } - err = bpf_obj_get_info_by_fd(fd, &info, &len); - if (err) { - p_err("can't get prog info: %s", strerror(errno)); - return -1; - } - - if (!*member_len) { - p_info("no instructions returned"); - close(fd); - return 0; - } + if (mode == DUMP_JITED) + arrays = 1UL << BPF_PROG_INFO_JITED_INSNS; + else + arrays = 1UL << BPF_PROG_INFO_XLATED_INSNS; - buf_size = *member_len; + arrays |= 1UL << BPF_PROG_INFO_JITED_KSYMS; + arrays |= 1UL << BPF_PROG_INFO_JITED_FUNC_LENS; + arrays |= 1UL << BPF_PROG_INFO_FUNC_INFO; + arrays |= 1UL << BPF_PROG_INFO_LINE_INFO; + arrays |= 1UL << BPF_PROG_INFO_JITED_LINE_INFO; - buf = malloc(buf_size); - if (!buf) { - p_err("mem alloc failed"); - close(fd); + info_linear = bpf_program__get_prog_info_linear(fd, arrays); + close(fd); + if (IS_ERR_OR_NULL(info_linear)) { + p_err("can't get prog info: %s", strerror(errno)); return -1; } - nr_func_ksyms = info.nr_jited_ksyms; - if (nr_func_ksyms) { - func_ksyms = malloc(nr_func_ksyms * sizeof(__u64)); - if (!func_ksyms) { - p_err("mem alloc failed"); - close(fd); - goto err_free; - } - } - - nr_func_lens = info.nr_jited_func_lens; - if (nr_func_lens) { - func_lens = malloc(nr_func_lens * sizeof(__u32)); - if (!func_lens) { - p_err("mem alloc failed"); - close(fd); + info = &info_linear->info; + if (mode == DUMP_JITED) { + if (info->jited_prog_len == 0) { + p_info("no instructions returned"); goto err_free; } - } - - nr_finfo = info.nr_func_info; - finfo_rec_size = info.func_info_rec_size; - if (nr_finfo && finfo_rec_size) { - func_info = malloc(nr_finfo * finfo_rec_size); - if (!func_info) { - p_err("mem alloc failed"); - close(fd); + buf = (unsigned char *)(info->jited_prog_insns); + member_len = info->jited_prog_len; + } else {/* DUMP_XLATED */ + if (info->xlated_prog_len == 0) { + p_err("error retrieving insn dump: kernel.kptr_restrict set?"); goto err_free;
[PATCH v5 perf,bpf 10/15] perf-top: add option --no-bpf-event
bpf events should be tracked by default for perf-top. This patch makes it on by default, and adds option to disable bpf events. Signed-off-by: Song Liu --- tools/perf/builtin-top.c | 3 +++ tools/perf/util/top.h| 1 + 2 files changed, 4 insertions(+) diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c index 27d8d42e0a4d..ccdf5689452f 100644 --- a/tools/perf/builtin-top.c +++ b/tools/perf/builtin-top.c @@ -1492,6 +1492,7 @@ int cmd_top(int argc, const char **argv) "Display raw encoding of assembly instructions (default)"), OPT_BOOLEAN(0, "demangle-kernel", &symbol_conf.demangle_kernel, "Enable kernel symbol demangling"), + OPT_BOOLEAN(0, "no-bpf-event", &top.no_bpf_event, "do not record bpf events"), OPT_STRING(0, "objdump", &top.annotation_opts.objdump_path, "path", "objdump binary to use for disassembly and annotations"), OPT_STRING('M', "disassembler-style", &top.annotation_opts.disassembler_style, "disassembler style", @@ -1651,6 +1652,8 @@ int cmd_top(int argc, const char **argv) signal(SIGWINCH, winch_sig); } + top.record_opts.bpf_event = !top.no_bpf_event; + status = __cmd_top(&top); out_delete_evlist: diff --git a/tools/perf/util/top.h b/tools/perf/util/top.h index 19f95eaf75c8..862a37bd27ea 100644 --- a/tools/perf/util/top.h +++ b/tools/perf/util/top.h @@ -32,6 +32,7 @@ struct perf_top { bool use_tui, use_stdio; bool vmlinux_warned; bool dump_symtab; + bool no_bpf_event; struct hist_entry *sym_filter_entry; struct perf_evsel *sym_evsel; struct perf_session *session; -- 2.17.1
[PATCH v5 perf,bpf 04/15] perf, bpf: synthesize bpf events with bpf_program__get_prog_info_linear()
With bpf_program__get_prog_info_linear, we can simplify the logic that synthesizes bpf events. This patch doesn't change the behavior of the code. Signed-off-by: Song Liu --- tools/perf/util/bpf-event.c | 118 1 file changed, 40 insertions(+), 78 deletions(-) diff --git a/tools/perf/util/bpf-event.c b/tools/perf/util/bpf-event.c index 796ef793f4ce..e6dfb95029e5 100644 --- a/tools/perf/util/bpf-event.c +++ b/tools/perf/util/bpf-event.c @@ -3,7 +3,9 @@ #include #include #include +#include #include +#include #include "bpf-event.h" #include "debug.h" #include "symbol.h" @@ -49,99 +51,62 @@ static int perf_event__synthesize_one_bpf_prog(struct perf_tool *tool, { struct ksymbol_event *ksymbol_event = &event->ksymbol_event; struct bpf_event *bpf_event = &event->bpf_event; - u32 sub_prog_cnt, i, func_info_rec_size = 0; - u8 (*prog_tags)[BPF_TAG_SIZE] = NULL; - struct bpf_prog_info info = { .type = 0, }; - u32 info_len = sizeof(info); - void *func_infos = NULL; - u64 *prog_addrs = NULL; + struct bpf_prog_info_linear *info_linear; + struct bpf_prog_info *info; struct btf *btf = NULL; - u32 *prog_lens = NULL; bool has_btf = false; - char errbuf[512]; + u32 sub_prog_cnt, i; int err = 0; + u64 arrays; - /* Call bpf_obj_get_info_by_fd() to get sizes of arrays */ - err = bpf_obj_get_info_by_fd(fd, &info, &info_len); + arrays = 1UL << BPF_PROG_INFO_JITED_KSYMS; + arrays |= 1UL << BPF_PROG_INFO_JITED_FUNC_LENS; + arrays |= 1UL << BPF_PROG_INFO_FUNC_INFO; + arrays |= 1UL << BPF_PROG_INFO_PROG_TAGS; - if (err) { - pr_debug("%s: failed to get BPF program info: %s, aborting\n", -__func__, str_error_r(errno, errbuf, sizeof(errbuf))); + info_linear = bpf_program__get_prog_info_linear(fd, arrays); + if (IS_ERR_OR_NULL(info_linear)) { + info_linear = NULL; + pr_debug("%s: failed to get BPF program info. aborting\n", __func__); return -1; } - if (info_len < offsetof(struct bpf_prog_info, prog_tags)) { + + if (info_linear->info_len < offsetof(struct bpf_prog_info, prog_tags)) { pr_debug("%s: the kernel is too old, aborting\n", __func__); return -2; } + info = &info_linear->info; + /* number of ksyms, func_lengths, and tags should match */ - sub_prog_cnt = info.nr_jited_ksyms; - if (sub_prog_cnt != info.nr_prog_tags || - sub_prog_cnt != info.nr_jited_func_lens) + sub_prog_cnt = info->nr_jited_ksyms; + if (sub_prog_cnt != info->nr_prog_tags || + sub_prog_cnt != info->nr_jited_func_lens) return -1; /* check BTF func info support */ - if (info.btf_id && info.nr_func_info && info.func_info_rec_size) { + if (info->btf_id && info->nr_func_info && info->func_info_rec_size) { /* btf func info number should be same as sub_prog_cnt */ - if (sub_prog_cnt != info.nr_func_info) { + if (sub_prog_cnt != info->nr_func_info) { pr_debug("%s: mismatch in BPF sub program count and BTF function info count, aborting\n", __func__); - return -1; - } - if (btf__get_from_id(info.btf_id, &btf)) { - pr_debug("%s: failed to get BTF of id %u, aborting\n", __func__, info.btf_id); - return -1; + err = -1; + goto out; } - func_info_rec_size = info.func_info_rec_size; - func_infos = calloc(sub_prog_cnt, func_info_rec_size); - if (!func_infos) { - pr_debug("%s: failed to allocate memory for func_infos, aborting\n", __func__); - return -1; + if (btf__get_from_id(info->btf_id, &btf)) { + pr_debug("%s: failed to get BTF of id %u, aborting\n", __func__, info->btf_id); + err = -1; + btf = NULL; + goto out; } has_btf = true; } - /* -* We need address, length, and tag for each sub program. -* Allocate memory and call bpf_obj_get_info_by_fd() again -*/ - prog_addrs = calloc(sub_prog_cnt, sizeof(u64)); - if (!prog_addrs) { - pr_debug("%s: failed to allocate memory for prog_addrs, aborting\n", __func__); - goto out; - } - prog_lens = calloc(sub_prog_cnt, sizeof(u32)); - if (!prog_lens) { - pr_debug("%s: failed to allocate memory for prog_lens, aborting\n", __func__); - goto out; - } - prog_tags = calloc(sub_prog_cnt, BPF_TAG_SIZE); - if (
Re: [PATCH 4/5] ocxl: Remove superfluous 'extern' from headers
On 27/2/19 3:57 pm, Alastair D'Silva wrote: From: Alastair D'Silva The 'extern' keyword adds no value here. Signed-off-by: Alastair D'Silva Acked-by: Andrew Donnellan --- drivers/misc/ocxl/ocxl_internal.h | 54 +++ include/misc/ocxl.h | 36 ++--- 2 files changed, 44 insertions(+), 46 deletions(-) diff --git a/drivers/misc/ocxl/ocxl_internal.h b/drivers/misc/ocxl/ocxl_internal.h index a32f2151029f..321b29e77f45 100644 --- a/drivers/misc/ocxl/ocxl_internal.h +++ b/drivers/misc/ocxl/ocxl_internal.h @@ -16,7 +16,6 @@ extern struct pci_driver ocxl_pci_driver; - struct ocxl_fn { struct device dev; int bar_used[3]; @@ -92,41 +91,40 @@ struct ocxl_process_element { __be32 software_state; }; +struct ocxl_afu *ocxl_afu_get(struct ocxl_afu *afu); +void ocxl_afu_put(struct ocxl_afu *afu); -extern struct ocxl_afu *ocxl_afu_get(struct ocxl_afu *afu); -extern void ocxl_afu_put(struct ocxl_afu *afu); - -extern int ocxl_create_cdev(struct ocxl_afu *afu); -extern void ocxl_destroy_cdev(struct ocxl_afu *afu); -extern int ocxl_register_afu(struct ocxl_afu *afu); -extern void ocxl_unregister_afu(struct ocxl_afu *afu); +int ocxl_create_cdev(struct ocxl_afu *afu); +void ocxl_destroy_cdev(struct ocxl_afu *afu); +int ocxl_register_afu(struct ocxl_afu *afu); +void ocxl_unregister_afu(struct ocxl_afu *afu); -extern int ocxl_file_init(void); -extern void ocxl_file_exit(void); +int ocxl_file_init(void); +void ocxl_file_exit(void); -extern int ocxl_pasid_afu_alloc(struct ocxl_fn *fn, u32 size); -extern void ocxl_pasid_afu_free(struct ocxl_fn *fn, u32 start, u32 size); -extern int ocxl_actag_afu_alloc(struct ocxl_fn *fn, u32 size); -extern void ocxl_actag_afu_free(struct ocxl_fn *fn, u32 start, u32 size); +int ocxl_pasid_afu_alloc(struct ocxl_fn *fn, u32 size); +void ocxl_pasid_afu_free(struct ocxl_fn *fn, u32 start, u32 size); +int ocxl_actag_afu_alloc(struct ocxl_fn *fn, u32 size); +void ocxl_actag_afu_free(struct ocxl_fn *fn, u32 start, u32 size); -extern struct ocxl_context *ocxl_context_alloc(void); -extern int ocxl_context_init(struct ocxl_context *ctx, struct ocxl_afu *afu, +struct ocxl_context *ocxl_context_alloc(void); +int ocxl_context_init(struct ocxl_context *ctx, struct ocxl_afu *afu, struct address_space *mapping); -extern int ocxl_context_attach(struct ocxl_context *ctx, u64 amr); -extern int ocxl_context_mmap(struct ocxl_context *ctx, +int ocxl_context_attach(struct ocxl_context *ctx, u64 amr); +int ocxl_context_mmap(struct ocxl_context *ctx, struct vm_area_struct *vma); -extern int ocxl_context_detach(struct ocxl_context *ctx); -extern void ocxl_context_detach_all(struct ocxl_afu *afu); -extern void ocxl_context_free(struct ocxl_context *ctx); +int ocxl_context_detach(struct ocxl_context *ctx); +void ocxl_context_detach_all(struct ocxl_afu *afu); +void ocxl_context_free(struct ocxl_context *ctx); -extern int ocxl_sysfs_add_afu(struct ocxl_afu *afu); -extern void ocxl_sysfs_remove_afu(struct ocxl_afu *afu); +int ocxl_sysfs_add_afu(struct ocxl_afu *afu); +void ocxl_sysfs_remove_afu(struct ocxl_afu *afu); -extern int ocxl_afu_irq_alloc(struct ocxl_context *ctx, u64 *irq_offset); -extern int ocxl_afu_irq_free(struct ocxl_context *ctx, u64 irq_offset); -extern void ocxl_afu_irq_free_all(struct ocxl_context *ctx); -extern int ocxl_afu_irq_set_fd(struct ocxl_context *ctx, u64 irq_offset, +int ocxl_afu_irq_alloc(struct ocxl_context *ctx, u64 *irq_offset); +int ocxl_afu_irq_free(struct ocxl_context *ctx, u64 irq_offset); +void ocxl_afu_irq_free_all(struct ocxl_context *ctx); +int ocxl_afu_irq_set_fd(struct ocxl_context *ctx, u64 irq_offset, int eventfd); -extern u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, u64 irq_offset); +u64 ocxl_afu_irq_get_addr(struct ocxl_context *ctx, u64 irq_offset); #endif /* _OCXL_INTERNAL_H_ */ diff --git a/include/misc/ocxl.h b/include/misc/ocxl.h index 9ff6ddc28e22..4544573cc93c 100644 --- a/include/misc/ocxl.h +++ b/include/misc/ocxl.h @@ -53,7 +53,7 @@ struct ocxl_fn_config { * Read the configuration space of a function and fill in a * ocxl_fn_config structure with all the function details */ -extern int ocxl_config_read_function(struct pci_dev *dev, +int ocxl_config_read_function(struct pci_dev *dev, struct ocxl_fn_config *fn); /* @@ -62,14 +62,14 @@ extern int ocxl_config_read_function(struct pci_dev *dev, * AFU indexes can be sparse, so a driver should check all indexes up * to the maximum found in the function description */ -extern int ocxl_config_check_afu_index(struct pci_dev *dev, +int ocxl_config_check_afu_index(struct pci_dev *dev, struct ocxl_fn_config *fn, int afu_idx); /* * Read the configuration space of a function for the AFU specified by * the index 'afu_idx'. F
Re: [PATCH 3/5] ocxl: read_pasid never returns an error, so make it void
On 27/2/19 3:57 pm, Alastair D'Silva wrote: From: Alastair D'Silva No need for a return value in read_pasid as it only returns 0. Signed-off-by: Alastair D'Silva Reviewed-by: Greg Kurz Acked-by: Andrew Donnellan --- drivers/misc/ocxl/config.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c index 0ee7856b033d..026ac2ac4f9c 100644 --- a/drivers/misc/ocxl/config.c +++ b/drivers/misc/ocxl/config.c @@ -68,7 +68,7 @@ static int find_dvsec_afu_ctrl(struct pci_dev *dev, u8 afu_idx) return 0; } -static int read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn) +static void read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn) { u16 val; int pos; @@ -89,7 +89,6 @@ static int read_pasid(struct pci_dev *dev, struct ocxl_fn_config *fn) out: dev_dbg(&dev->dev, "PASID capability:\n"); dev_dbg(&dev->dev, " Max PASID log = %d\n", fn->max_pasid_log); - return 0; } static int read_dvsec_tl(struct pci_dev *dev, struct ocxl_fn_config *fn) @@ -205,11 +204,7 @@ int ocxl_config_read_function(struct pci_dev *dev, struct ocxl_fn_config *fn) { int rc; - rc = read_pasid(dev, fn); - if (rc) { - dev_err(&dev->dev, "Invalid PASID configuration: %d\n", rc); - return -ENODEV; - } + read_pasid(dev, fn); rc = read_dvsec_tl(dev, fn); if (rc) { -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
Re: [PATCH 2/5] ocxl: Clean up printf formats
On 27/2/19 3:57 pm, Alastair D'Silva wrote: From: Alastair D'Silva Use %# instead of using a literal '0x' Signed-off-by: Alastair D'Silva Not hugely fussed either way, but today I learned about %#... Acked-by: Andrew Donnellan --- drivers/misc/ocxl/config.c | 6 +++--- drivers/misc/ocxl/context.c | 2 +- drivers/misc/ocxl/trace.h | 10 +- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c index 8f2c5d8bd2ee..0ee7856b033d 100644 --- a/drivers/misc/ocxl/config.c +++ b/drivers/misc/ocxl/config.c @@ -178,9 +178,9 @@ static int read_dvsec_vendor(struct pci_dev *dev) pci_read_config_dword(dev, pos + OCXL_DVSEC_VENDOR_DLX_VERS, &dlx); dev_dbg(&dev->dev, "Vendor specific DVSEC:\n"); - dev_dbg(&dev->dev, " CFG version = 0x%x\n", cfg); - dev_dbg(&dev->dev, " TLX version = 0x%x\n", tlx); - dev_dbg(&dev->dev, " DLX version = 0x%x\n", dlx); + dev_dbg(&dev->dev, " CFG version = %#x\n", cfg); + dev_dbg(&dev->dev, " TLX version = %#x\n", tlx); + dev_dbg(&dev->dev, " DLX version = %#x\n", dlx); return 0; } diff --git a/drivers/misc/ocxl/context.c b/drivers/misc/ocxl/context.c index c10a940e3b38..3498a0199bde 100644 --- a/drivers/misc/ocxl/context.c +++ b/drivers/misc/ocxl/context.c @@ -134,7 +134,7 @@ static vm_fault_t ocxl_mmap_fault(struct vm_fault *vmf) vm_fault_t ret; offset = vmf->pgoff << PAGE_SHIFT; - pr_debug("%s: pasid %d address 0x%lx offset 0x%llx\n", __func__, + pr_debug("%s: pasid %d address %#lx offset %#llx\n", __func__, ctx->pasid, vmf->address, offset); if (offset < ctx->afu->irq_base_offset) diff --git a/drivers/misc/ocxl/trace.h b/drivers/misc/ocxl/trace.h index bcb7ff330c1e..8d2f53812edd 100644 --- a/drivers/misc/ocxl/trace.h +++ b/drivers/misc/ocxl/trace.h @@ -28,7 +28,7 @@ DECLARE_EVENT_CLASS(ocxl_context, __entry->tidr = tidr; ), - TP_printk("linux pid=%d spa=0x%p pasid=0x%x pidr=0x%x tidr=0x%x", + TP_printk("linux pid=%d spa=%p pasid=%#x pidr=%#x tidr=%#x", __entry->pid, __entry->spa, __entry->pasid, @@ -61,7 +61,7 @@ TRACE_EVENT(ocxl_terminate_pasid, __entry->rc = rc; ), - TP_printk("pasid=0x%x rc=%d", + TP_printk("pasid=%#x rc=%d", __entry->pasid, __entry->rc ) @@ -87,7 +87,7 @@ DECLARE_EVENT_CLASS(ocxl_fault_handler, __entry->tfc = tfc; ), - TP_printk("spa=%p pe=0x%llx dsisr=0x%llx dar=0x%llx tfc=0x%llx", + TP_printk("spa=%p pe=%#llx dsisr=%#llx dar=%#llx tfc=%#llx", __entry->spa, __entry->pe, __entry->dsisr, @@ -127,7 +127,7 @@ TRACE_EVENT(ocxl_afu_irq_alloc, __entry->irq_offset = irq_offset; ), - TP_printk("pasid=0x%x irq_id=%d virq=%u hw_irq=%d irq_offset=0x%llx", + TP_printk("pasid=%#x irq_id=%d virq=%u hw_irq=%d irq_offset=0x%llx", __entry->pasid, __entry->irq_id, __entry->virq, @@ -150,7 +150,7 @@ TRACE_EVENT(ocxl_afu_irq_free, __entry->irq_id = irq_id; ), - TP_printk("pasid=0x%x irq_id=%d", + TP_printk("pasid=%#x irq_id=%d", __entry->pasid, __entry->irq_id ) -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
linux-next: manual merge of the staging tree with the net-next tree
Hi all, Today's linux-next merge of the staging tree got a conflict in: drivers/staging/fsl-dpaa2/ethsw/ethsw.c between commit: 570b68c8ddde ("staging: fsl-dpaa2: ethsw: Handle SWITCHDEV_PORT_ATTR_SET") from the net-next tree and commit: 11f27765f611 ("staging: fsl-dpaa2: ethsw: Add missing netdevice check") from the staging tree. I fixed it up (the former is a superset of the latter) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell pgp7qjxvVA5lb.pgp Description: OpenPGP digital signature
[LKP] [rcu] 7e28c5af4e: will-it-scale.per_process_ops 84.7% improvement
Greeting, FYI, we noticed a 84.7% improvement of will-it-scale.per_process_ops due to commit: commit: 7e28c5af4ef6b539334aa5de40feca0c041c94df ("rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structure") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: will-it-scale on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory with following parameters: nr_task: 100% mode: process test: brk1 cpufreq_governor: performance test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two. test-url: https://github.com/antonblanchard/will-it-scale Details are as below: --> To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml = compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-knm01/brk1/will-it-scale commit: c5bacd9417 ("rcu: Motivate Tiny RCU forward progress") 7e28c5af4e ("rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structure") c5bacd94173ec49d 7e28c5af4ef6b539334aa5de40 -- %stddev %change %stddev \ |\ 48605 +84.7% 89756 ± 4% will-it-scale.per_process_ops 13998509 +84.7% 25850040 ± 4% will-it-scale.workload 0.01 ± 63% -0.00.00 ±131% mpstat.cpu.soft% 11.86 ± 3% -1.2 10.65 ± 4% mpstat.cpu.usr% 662910 +15.8% 767829 ± 3% numa-numastat.node0.local_node 662889 +15.8% 767801 ± 3% numa-numastat.node0.numa_hit 2149 ± 4% -16.9% 1785 ± 2% vmstat.system.cs 727237 ± 3% -14.9% 618668vmstat.system.in 284156 ± 4% +32.5% 376625numa-meminfo.node0.Active 283908 ± 4% +32.6% 376374numa-meminfo.node0.Active(anon) 110629 ± 15% -77.7% 24625 ± 18% numa-meminfo.node0.Inactive 110393 ± 15% -77.9% 24400 ± 19% numa-meminfo.node0.Inactive(anon) 284255 ± 4% +32.5% 37meminfo.Active 284007 ± 4% +32.5% 376415meminfo.Active(anon) 110855 ± 15% -77.8% 24620 ± 18% meminfo.Inactive 110619 ± 15% -77.9% 24395 ± 19% meminfo.Inactive(anon) 8.00 ±173% +11303.1% 912.25 ± 87% meminfo.Mlocked 11072 -15.7% 9338 ± 3% meminfo.max_used_kB 71004 ± 4% +32.6% 94124numa-vmstat.node0.nr_active_anon 27812 ± 15% -78.1% 6100 ± 19% numa-vmstat.node0.nr_inactive_anon 1.00 ±173% +13100.0% 132.00 ± 87% numa-vmstat.node0.nr_mlock 71002 ± 4% +32.6% 94122 numa-vmstat.node0.nr_zone_active_anon 27811 ± 15% -78.1% 6100 ± 19% numa-vmstat.node0.nr_zone_inactive_anon 1.00 ±173% +9450.0% 95.50 ± 87% numa-vmstat.node1.nr_mlock 70939 ± 4% +32.8% 94186proc-vmstat.nr_active_anon 27523 ± 15% -77.8% 6101 ± 19% proc-vmstat.nr_inactive_anon 2.25 ±173% +10044.4% 228.25 ± 86% proc-vmstat.nr_mlock 70939 ± 4% +32.8% 94186proc-vmstat.nr_zone_active_anon 27523 ± 15% -77.8% 6101 ± 19% proc-vmstat.nr_zone_inactive_anon 680060 +15.8% 787707 ± 2% proc-vmstat.numa_hit 680060 +15.8% 787707 ± 2% proc-vmstat.numa_local 140509 ± 10% -45.3% 76820 ± 24% proc-vmstat.numa_pte_updates 895.50 ± 30%+480.0% 5193 ± 36% proc-vmstat.pgactivate 774283 +13.9% 882060 ± 3% proc-vmstat.pgalloc_normal 653457 +21.9% 796258 ± 3% proc-vmstat.pgfault 610954 ± 2% +32.4% 808662 ± 3% proc-vmstat.pgfree 5299 ± 6%+106.4% 10939 ± 36% softirqs.CPU10.RCU 106293 ± 5% +28.2% 136274 ± 15% softirqs.CPU126.TIMER 108234 ± 6% +26.3% 136696 ± 15% softirqs.CPU127.TIMER 109747 ± 7% +26.4% 138696 ± 11% softirqs.CPU135.TIMER 111732 ± 8% +24.4% 138967 ± 11% softirqs.CPU138.TIMER 107031 ± 8% +26.7% 135558 ± 14% softirqs.CPU148.TIMER 100841 +23.1% 124155 ± 15% softirqs.CPU15.TIMER 104640 ± 4% +27.8% 133719 ± 16% softirqs.CPU172.TIMER 103142 ± 3% +30.9% 135023 ± 13% softirqs.CPU190.TIMER 106698 ± 10% +27.5% 136023 ± 13% softirqs.CPU197.TIMER
Re: [PATCH] powernv: powercap: Add hard minimum powercap
Shilpasri G Bhat writes: > In POWER9, OCC(On-Chip-Controller) provides for hard and soft system > powercapping range. The hard powercap range is guaranteed while soft > powercap may or may not be asserted due to various power-thermal > reasons based on system configuration and workloads. This patch adds > a sysfs file to export the hard minimum powercap limit to allow the > user to set the appropriate powercap value that can be managed by the > system. Maybe it's common terminology and I'm just not aware of it, but what do you mean by "asserted"? It doesn't appear elsewhere in the documentation you're patching, and it's not a use of assert that I'm familiar with... Regards, Daniel > > Signed-off-by: Shilpasri G Bhat > --- > .../ABI/testing/sysfs-firmware-opal-powercap | 10 > arch/powerpc/platforms/powernv/opal-powercap.c | 66 > +- > 2 files changed, 37 insertions(+), 39 deletions(-) > > diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-powercap > b/Documentation/ABI/testing/sysfs-firmware-opal-powercap > index c9b66ec..65db4c1 100644 > --- a/Documentation/ABI/testing/sysfs-firmware-opal-powercap > +++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap > @@ -29,3 +29,13 @@ Description: System powercap directory and > attributes applicable for > creates a request for setting a new-powercap. The > powercap requested must be between powercap-min > and powercap-max. > + > +What: > /sys/firmware/opal/powercap/system-powercap/powercap-hard-min > +Date:Feb 2019 > +Contact: Linux for PowerPC mailing list > +Description: Hard minimum powercap > + > + This file provides the hard minimum powercap limit in > + Watts. The powercap value above hard minimum is always > + guaranteed to be asserted and the powercap value below > + the hard minimum limit may or may not be guaranteed. > diff --git a/arch/powerpc/platforms/powernv/opal-powercap.c > b/arch/powerpc/platforms/powernv/opal-powercap.c > index d90ee4f..38408e7 100644 > --- a/arch/powerpc/platforms/powernv/opal-powercap.c > +++ b/arch/powerpc/platforms/powernv/opal-powercap.c > @@ -139,10 +139,24 @@ static void powercap_add_attr(int handle, const char > *name, > attr->handle = handle; > sysfs_attr_init(&attr->attr.attr); > attr->attr.attr.name = name; > - attr->attr.attr.mode = 0444; > + > + if (!strncmp(name, "powercap-current", strlen(name))) { > + attr->attr.attr.mode = 0664; > + attr->attr.store = powercap_store; > + } else { > + attr->attr.attr.mode = 0444; > + } > + > attr->attr.show = powercap_show; > } > > +static const char * const powercap_strs[] = { > + "powercap-max", > + "powercap-min", > + "powercap-current", > + "powercap-hard-min", > +}; > + > void __init opal_powercap_init(void) > { > struct device_node *powercap, *node; > @@ -167,60 +181,34 @@ void __init opal_powercap_init(void) > > i = 0; > for_each_child_of_node(powercap, node) { > - u32 cur, min, max; > - int j = 0; > - bool has_cur = false, has_min = false, has_max = false; > + u32 id; > + int j, count = 0; > > - if (!of_property_read_u32(node, "powercap-min", &min)) { > - j++; > - has_min = true; > - } > - > - if (!of_property_read_u32(node, "powercap-max", &max)) { > - j++; > - has_max = true; > - } > + for (j = 0; j < ARRAY_SIZE(powercap_strs); j++) > + if (!of_property_read_u32(node, powercap_strs[j], &id)) > + count++; > > - if (!of_property_read_u32(node, "powercap-current", &cur)) { > - j++; > - has_cur = true; > - } > - > - pcaps[i].pattrs = kcalloc(j, sizeof(struct powercap_attr), > + pcaps[i].pattrs = kcalloc(count, sizeof(struct powercap_attr), > GFP_KERNEL); > if (!pcaps[i].pattrs) > goto out_pcaps_pattrs; > > - pcaps[i].pg.attrs = kcalloc(j + 1, sizeof(struct attribute *), > + pcaps[i].pg.attrs = kcalloc(count + 1, > + sizeof(struct attribute *), > GFP_KERNEL); > if (!pcaps[i].pg.attrs) { > kfree(pcaps[i].pattrs); > goto out_pcaps_pattrs; > } > > - j = 0; > pcaps[i].pg.name = kasprintf(GFP_KERNEL, "%pOFn", node); > - if (has_min) { > - powercap_add_attr(min, "powercap-min", > - &pcaps[i]
Re: [PATCH] printk: Ratelimit messages printed by console drivers
On (02/26/19 19:24), Tetsuo Handa wrote: > Does memory allocation by network stack / driver while trying to emit > the messages include __GFP_DIRECT_RECLAIM flag (e.g. GFP_KERNEL) ? > Commit 400e22499dd92613821 handles only memory allocations with > __GFP_DIRECT_RECLAIM flag. If memory allocation when trying to emit > the messages does not include __GFP_DIRECT_RECLAIM flag (e.g. > GFP_ATOMIC / GFP_NOWAIT), doesn't this particular problem still exist? Console drivers are always called from atomic contexts (including IRQ->printk->console_driver paths); if any console driver does a GFP_KERNEL allocation then it should be fixed. -ss
Re: [PATCH net-next 4/6] ethernet: eth: add default vid len for all ehternet kind devices
On 2/26/2019 10:45 AM, Ivan Khoronzhuk wrote: > IVDF - individual virtual device filtering. Allows to set per vlan > l2 address filters on end real network device (for unicast and for > multicast) and drop redundant not expected packet income. > > If CONFIG_VLAN_8021Q_IVDF is enabled the following changes are > applied, and only for ethernet network devices. > > By default every ethernet netdev needs vid len = 2 bytes to be able to > hold up to 4096 vids. So set it for every eth device to be correct, > except vlan devs. > > In order to shrink all addresses of devices above vlan, the vid_len > for vlan dev = 0, as result all suckers sync their addresses to common > base not taking in to account vid part (vid_len of "to" devices is > important only). And only vlan device is the source of addresses with > actual its vid set, propagating it to parent devices while rx_mode(). > > Also, don't bother those ethernet devices that at this moment are not > moved to vlan addressing scheme, so while end ethernet device is > created - set vid_len to 0, thus, while syncing, its address space is > concatenated to one dimensional like usual, and who needs IVDF - set > it to NET_8021Q_VID_TSIZE. > > There is another decision - is to inherit vid_len or some feature flag > from end root device in order to all upper devices have vlan extended > address space only if exact end real device have such capability. But > I didn't, because it requires more changes and probably I'm not > familiar with all places where it should be inherited, I would > appreciate if someone can guid where it's applicable, then it could > become a little bit more limited. I would think that a call to vlan_dev_ivdf_set() would be enough to indicate that the underlying network device driver supports IVDF and wants to make use of it. The infrastructure itself that you added costs little memory, it is once the call to vlan_dev_ivdf_set() is made that the memory consumption increases which is fine, since we want to make use of that feature. While I appreciate the thoughts given to making this a configurable option, I feel that enabling it unconditionally and having the underlying driver decide would be more manageable. We have had that conversation before, but let me ask again when we call dev_{uc,mc}_sync() and ultimately the network device's ndo_set_rx_mode(), by the time the ndo_set_rx_mode() function is called, we lost track of the call chain, like which virtual device was it originating from. If we somehow added a notification information about the network device stack (and we could use netdevice notifiers for that), then maybe we don't really need to add all of this code and we can just derive the necessary bits of information we want by checking: is this a VLAN network device? It is, okay what's your VLAN ID, etc.? Either approach would get us our cookie anyway :) > > Signed-off-by: Ivan Khoronzhuk > --- [snip] > @@ -404,8 +405,13 @@ EXPORT_SYMBOL(ether_setup); > struct net_device *alloc_etherdev_mqs(int sizeof_priv, unsigned int txqs, > unsigned int rxqs) > { > - return alloc_netdev_mqs(sizeof_priv, "eth%d", NET_NAME_UNKNOWN, > - ether_setup, txqs, rxqs); > + struct net_device *dev; > + > + dev = alloc_netdev_mqs(sizeof_priv, "eth%d", NET_NAME_UNKNOWN, > +ether_setup, txqs, rxqs); You need to check the return value of alloc_netdev_mqs() now, otherwise you could be doing a NPD in the call right under. Since that is the default though, do you really need to call vlan_dev_ivdf_set() below? -- Florian
Re: [PATCH v2] posix-cpu-timers: Avoid undefined behaviour in timespec64_to_ns()
On Tue, Feb 26, 2019 at 11:52 PM Xiongfeng Wang wrote: > > When I ran Syzkaller testsuite, I got the following call trace. > > UBSAN: Undefined behaviour in ./include/linux/time64.h:120:27 > signed integer overflow: > 8243129037239968815 * 10 cannot be represented in type 'long long int' > CPU: 5 PID: 28854 Comm: syz-executor.1 Not tainted 4.19.24 #4 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0xca/0x13e lib/dump_stack.c:113 > ubsan_epilogue+0xe/0x81 lib/ubsan.c:159 > handle_overflow+0x193/0x1e2 lib/ubsan.c:190 > timespec64_to_ns include/linux/time64.h:120 [inline] > posix_cpu_timer_set+0x95a/0xb70 kernel/time/posix-cpu-timers.c:687 > do_timer_settime+0x198/0x2a0 kernel/time/posix-timers.c:892 > __do_sys_timer_settime kernel/time/posix-timers.c:918 [inline] > __se_sys_timer_settime kernel/time/posix-timers.c:904 [inline] > __x64_sys_timer_settime+0x18d/0x260 kernel/time/posix-timers.c:904 > do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x462eb9 > Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 > 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 > RSP: 002b:7f14e4127c58 EFLAGS: 0246 ORIG_RAX: 00df > RAX: ffda RBX: 0073bfa0 RCX: 00462eb9 > RDX: 2080 RSI: RDI: > RBP: 0004 R08: R09: > R10: R11: 0246 R12: 7f14e41286bc > R13: 004c54cc R14: 00704278 R15: > > > It is because 'it_interval.tv_sec' is larger than 'KTIME_SEC_MAX' and > 'it_interval.tv_sec * NSEC_PER_SEC' overflows in 'timespec64_to_ns()'. > > This patch use 'timespec64_valid_restrict()' to check whether > 'it_interval.tv_sec' is larger than 'KTIME_SEC_MAX'. > > Signed-off-by: Xiongfeng Wang > --- > kernel/time/posix-timers.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c > index 0e84bb7..97b773c 100644 > --- a/kernel/time/posix-timers.c > +++ b/kernel/time/posix-timers.c > @@ -853,8 +853,8 @@ static int do_timer_settime(timer_t timer_id, int flags, > unsigned long flag; > int error = 0; > > - if (!timespec64_valid(&new_spec64->it_interval) || > - !timespec64_valid(&new_spec64->it_value)) > + if (!timespec64_valid_strict(&new_spec64->it_interval) || > + !timespec64_valid_strict(&new_spec64->it_value)) > return -EINVAL; > > if (old_spec64) sys_timer_settime() is a POSIX interface: http://pubs.opengroup.org/onlinepubs/7908799/xsh/timer_settime.html The timer_settime() function will fail if: [EINVAL] The timerid argument does not correspond to an id returned by timer_create() but not yet deleted by timer_delete(). [EINVAL] A value structure specified a nanosecond value less than zero or greater than or equal to 1000 million. So we cannot return EINVAL here if we want to maintain POSIX compatibility. Maybe we should check for limit and saturate here at the syscall interface? -Deepa
Re: [PATCH net-next 1/6] net: core: dev_addr_lists: add VID to device address
On 2/26/2019 10:45 AM, Ivan Khoronzhuk wrote: > Despite this is supposed to be used for Ethernet VLANs, not Ethernet > addresses with space for VID also can reuse this, so VID is considered > as virtual ID extension, not belonging strictly to Ethernet VLAN VIDs, > and overall change can be named individual virtual device filtering > (IVDF). > > This patch adds VID tag at the end of each address. The actual > reserved address size is 32 bytes. For Ethernet addresses with 6 bytes > long that's possible to add tag w/o increasing address size. Thus, > each address for the case has 32 - 6 = 26 bytes to hold additional > info, say VID for virtual device addresses. > > Therefore, when addresses are synced to the address list of parent > device the address list of latter can contain separate addresses for > virtual devices. It allows to track separate address tables for > virtual devices if they present and the device can be placed on > any place of device tree as the address is propagated to to the end > real device thru *_sync()/ndo_set_rx_mode() APIs. Also it simplifies > handling VID addresses at real device when it supports IVDF. > > If parent device doesn't want to have virtual addresses in its address > space the vid_len has to be 0, thus its address space is "shrunk" to > the state as before this patch. For now it's 0 for every device. It > allows two devices with and w/o IVDF to be part of same bond device > for instance. > > The end real device supporting IVDF can retrieve VID tag from an > address and set it for a given virtual device only. By default, vid 0 > is used for real devices to distinguish it from virtual addresses. > > See next patches to see how it's used. > > Signed-off-by: Ivan Khoronzhuk > --- [snip] > @@ -1889,6 +1890,7 @@ struct net_device { > unsigned char perm_addr[MAX_ADDR_LEN]; > unsigned char addr_assign_type; > unsigned char addr_len; > + unsigned char vid_len; Have not compiled or tested this patch series yet, but did you check that adding this member does not change the structure layout (you can use pahole for that purpose). > unsigned short neigh_priv_len; > unsigned short dev_id; > unsigned short dev_port; > @@ -4141,8 +4143,10 @@ int dev_addr_init(struct net_device *dev); > > /* Functions used for unicast addresses handling */ > int dev_uc_add(struct net_device *dev, const unsigned char *addr); > +int dev_vid_uc_add(struct net_device *dev, const unsigned char *addr); > int dev_uc_add_excl(struct net_device *dev, const unsigned char *addr); > int dev_uc_del(struct net_device *dev, const unsigned char *addr); > +int dev_vid_uc_del(struct net_device *dev, const unsigned char *addr); > int dev_uc_sync(struct net_device *to, struct net_device *from); > int dev_uc_sync_multiple(struct net_device *to, struct net_device *from); > void dev_uc_unsync(struct net_device *to, struct net_device *from); > diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c > index a6723b306717..e3c80e044b8c 100644 > --- a/net/core/dev_addr_lists.c > +++ b/net/core/dev_addr_lists.c > @@ -545,6 +545,26 @@ int dev_addr_del(struct net_device *dev, const unsigned > char *addr, > } > EXPORT_SYMBOL(dev_addr_del); > > +static int get_addr_len(struct net_device *dev) > +{ > + return dev->addr_len + dev->vid_len; > +} > + > +static int set_vid_addr(struct net_device *dev, const unsigned char *addr, > + unsigned char *naddr) Having some kernel doc comments here would be nice to indicate that the return value is dev->addr_len, it was not obvious until I saw in the next function how you used it. > +{ > + int i; > + > + if (!dev->vid_len) > + return dev->addr_len; > + > + memcpy(naddr, addr, dev->addr_len); > + for (i = 0; i < dev->vid_len; i++) > + naddr[dev->addr_len + i] = 0; memset(naddr + dev->addr_len, 0, dev->vid_len) would be more compact and maybe a little less error prone too? -- Florian
Re: [RFC v4 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework
On Fri, Feb 22, 2019 at 12:53 PM Thiago Jung Bauermann wrote: > > > Frank Rowand writes: > > > On 2/19/19 10:34 PM, Brendan Higgins wrote: > >> On Mon, Feb 18, 2019 at 12:02 PM Frank Rowand > >> wrote: > >> > >>> I have not read through the patches in any detail. I have read some of > >>> the code to try to understand the patches to the devicetree unit tests. > >>> So that may limit how valid my comments below are. > >> > >> No problem. > >> > >>> > >>> I found the code difficult to read in places where it should have been > >>> much simpler to read. Structuring the code in a pseudo object oriented > >>> style meant that everywhere in a code path that I encountered a dynamic > >>> function call, I had to go find where that dynamic function call was > >>> initialized (and being the cautious person that I am, verify that > >>> no where else was the value of that dynamic function call). With > >>> primitive vi and tags, that search would have instead just been a > >>> simple key press (or at worst a few keys) if hard coded function > >>> calls were done instead of dynamic function calls. In the code paths > >>> that I looked at, I did not see any case of a dynamic function being > >>> anything other than the value it was originally initialized as. > >>> There may be such cases, I did not read the entire patch set. There > >>> may also be cases envisioned in the architects mind of how this > >>> flexibility may be of future value. Dunno. > >> > >> Yeah, a lot of it is intended to make architecture specific > >> implementations and some other future work easier. Some of it is also > >> for testing purposes. Admittedly some is for neither reason, but given > >> the heavy usage elsewhere, I figured there was no harm since it was > >> all private internal usage anyway. > >> > > > > Increasing the cost for me (and all the other potential code readers) > > to read the code is harm. > > Dynamic function calls aren't necessary for arch-specific > implementations either. See for example arch_kexec_image_load() in > kernel/kexec_file.c, which uses a weak symbol that is overriden by > arch-specific code. Not everybody likes weak symbols, so another > alternative (which admitedly not everybody likes either) is to use a > macro with the name of the arch-specific function, as used by > arch_kexec_post_alloc_pages() in for instance. I personally have a strong preference for dynamic function calls over weak symbols or macros, but I can change it if it really makes anyone's eyes bleed.
Re: [PATCH net-next 5/6] net: ethernet: ti: cpsw: update mc filtering to use IVDF
On 2/26/2019 10:45 AM, Ivan Khoronzhuk wrote: > The cpsw can filter multicast addresses only per vlan. Thus if mcast > address is set for one of them or only for real device it must be > added for every created vlan consuming ALE table w/o reason. In order to > simplify dispatching vlan filters, the IVDF recently added is resused. > > In case IVDF is disabled - mc is updated only for real device as before. > The previous method is harder to reuse and vlan filtering is limited > only for vlans directly connected to real netdev, so drop it in flavor > of IVDF decision. > > Signed-off-by: Ivan Khoronzhuk > --- > drivers/net/ethernet/ti/Kconfig | 1 + > drivers/net/ethernet/ti/cpsw.c | 113 > 2 files changed, 13 insertions(+), 101 deletions(-) Nice diffstat! -- Florian
Re: [RFC v4 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework
On Tue, Feb 19, 2019 at 10:46 PM Frank Rowand wrote: > > On 2/19/19 10:34 PM, Brendan Higgins wrote: > > On Mon, Feb 18, 2019 at 12:02 PM Frank Rowand > > wrote: > > > >> I have not read through the patches in any detail. I have read some of > >> the code to try to understand the patches to the devicetree unit tests. > >> So that may limit how valid my comments below are. > > > > No problem. > > > >> > >> I found the code difficult to read in places where it should have been > >> much simpler to read. Structuring the code in a pseudo object oriented > >> style meant that everywhere in a code path that I encountered a dynamic > >> function call, I had to go find where that dynamic function call was > >> initialized (and being the cautious person that I am, verify that > >> no where else was the value of that dynamic function call). With > >> primitive vi and tags, that search would have instead just been a > >> simple key press (or at worst a few keys) if hard coded function > >> calls were done instead of dynamic function calls. In the code paths > >> that I looked at, I did not see any case of a dynamic function being > >> anything other than the value it was originally initialized as. > >> There may be such cases, I did not read the entire patch set. There > >> may also be cases envisioned in the architects mind of how this > >> flexibility may be of future value. Dunno. > > > > Yeah, a lot of it is intended to make architecture specific > > implementations and some other future work easier. Some of it is also > > for testing purposes. Admittedly some is for neither reason, but given > > the heavy usage elsewhere, I figured there was no harm since it was > > all private internal usage anyway. > > > > Increasing the cost for me (and all the other potential code readers) > to read the code is harm. You are right. I like the object oriented C style; I didn't think it hurt readability. In any case, I will go through and replace instances where I am not using it for one of the above reasons.
Re: [PATCH net-next 3/6] net: 8021q: vlan_dev: add vid tag for vlan device own address
On 2/26/2019 10:45 AM, Ivan Khoronzhuk wrote: > The vlan device address is held separately from uc/mc lists and > handled differently. The vlan dev address is bound with real device > address only if it's inherited from init, in all other cases it's > separate address entry in uc list. With vid set, the address becomes > not inherited from real device after it's set manually as before, but > is part of uc list any way, with appropriate vid tag set. If vid_len > for real device is 0, the behaviour is the same as before this change, > so shouldn't be any impact on systems w/o individual virtual device > filtering (IVDF) enabled. This allows to control and sync vlan device > address and disable concrete vlan packet income when vlan interface is > down. > > Signed-off-by: Ivan Khoronzhuk > --- [snip] > > +static int vlan_dev_add_addr(struct net_device *dev, u8 *addr) > +{ > + struct net_device *real_dev = vlan_dev_real_dev(dev); > + unsigned char naddr[ETH_ALEN + NET_8021Q_VID_TSIZE]; > + > + if (real_dev->vid_len) { Don't you need to check that real_dev->vid_len is >= NET_8021Q_VID_TSIZE here? > + memcpy(naddr, addr, dev->addr_len); > + vlan_dev_set_addr_vid(dev, naddr); > + return dev_vid_uc_add(real_dev, naddr); > + } > + > + if (ether_addr_equal(addr, real_dev->dev_addr)) > + return 0; > + > + return dev_uc_add(real_dev, addr); > +} > + > +static void vlan_dev_del_addr(struct net_device *dev, u8 *addr) > +{ > + struct net_device *real_dev = vlan_dev_real_dev(dev); > + unsigned char naddr[ETH_ALEN + NET_8021Q_VID_TSIZE]; > + > + if (real_dev->vid_len) { Same here. > + memcpy(naddr, addr, dev->addr_len); > + vlan_dev_set_addr_vid(dev, naddr); > + dev_vid_uc_del(real_dev, naddr); > + return; > + } > + > + if (!ether_addr_equal(dev->dev_addr, real_dev->dev_addr)) > + dev_uc_del(real_dev, addr); > +} > + > +static int vlan_dev_subs_addr(struct net_device *dev, u8 *addr) > +{ > + int err; > + > + err = vlan_dev_add_addr(dev, addr); > + if (err < 0) > + return err; > + > + vlan_dev_del_addr(dev, dev->dev_addr); > + return err; > +} > + > bool vlan_dev_inherit_address(struct net_device *dev, > struct net_device *real_dev) > { > if (dev->addr_assign_type != NET_ADDR_STOLEN) > return false; > > + if (real_dev->vid_len) > + if (vlan_dev_subs_addr(dev, real_dev->dev_addr)) > + return false; The check on real_dev->vid_len can be absorbed into vlan_dev_subs_addr()? > + > ether_addr_copy(dev->dev_addr, real_dev->dev_addr); > call_netdevice_notifiers(NETDEV_CHANGEADDR, dev); > return true; > @@ -278,9 +327,10 @@ static int vlan_dev_open(struct net_device *dev) > !(vlan->flags & VLAN_FLAG_LOOSE_BINDING)) > return -ENETDOWN; > > - if (!ether_addr_equal(dev->dev_addr, real_dev->dev_addr) && > - !vlan_dev_inherit_address(dev, real_dev)) { > - err = dev_uc_add(real_dev, dev->dev_addr); > + if (ether_addr_equal(dev->dev_addr, real_dev->dev_addr) || > + (!ether_addr_equal(dev->dev_addr, real_dev->dev_addr) && > + !vlan_dev_inherit_address(dev, real_dev))) { Should this condition simply become if !vlan_dev_inherit_address() now? -- Florian
Re: [PATCH] xfrm: policy: Fix possible user after free in __xfrm_policy_unlink
On Thu, Feb 28, 2019 at 11:16:23AM +0800, Yue Haibing wrote: > > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > index 8d1a898..b27eb742 100644 > --- a/net/xfrm/xfrm_policy.c > +++ b/net/xfrm/xfrm_policy.c > @@ -316,6 +316,8 @@ static void xfrm_policy_timer(struct timer_list *t) > goto out; > > dir = xfrm_policy_id2dir(xp->index); > + if (dir >= XFRM_POLICY_MAX * 2) > + dir = dir & XFRM_POLICY_MAX; This is still wrong. We shouldn't be allowing bogus policies to be in the system at all. I have digged deeper and the problem was introduced by: commit e682adf021be796940be6cc10c07be7f7398c220 Author: Fan Du Date: Thu Nov 7 17:47:48 2013 +0800 xfrm: Try to honor policy index if it's supplied by user Where the check for the user-supplied index is simply wrong in verify_newpolicy_info. So please fix it there. Thanks! -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH V3] cpufreq: kryo: Release OPP tables on module removal
Commit 5ad7346b4ae2 ("cpufreq: kryo: Add module remove and exit") made it possible to build the kryo cpufreq driver as a module, but it failed to release all the resources, i.e. OPP tables, when the module is unloaded. This patch fixes it by releasing the OPP tables, by calling dev_pm_opp_put_supported_hw() for them, from the qcom_cpufreq_kryo_remove() routine. The array of pointers to the OPP tables is also allocated dynamically now in qcom_cpufreq_kryo_probe(), as the pointers will be required while releasing the resources. Compile tested only. Cc: 4.18+ # v4.18+ Fixes: 5ad7346b4ae2 ("cpufreq: kryo: Add module remove and exit") Reviewed-by: Georgi Djakov Signed-off-by: Viresh Kumar --- V2->V3: - s/kyro/kryo/ - s/unsigned/unsigned int/ - Added Reviewed-by tag drivers/cpufreq/qcom-cpufreq-kryo.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/qcom-cpufreq-kryo.c b/drivers/cpufreq/qcom-cpufreq-kryo.c index 2a3675c24032..a472b814058f 100644 --- a/drivers/cpufreq/qcom-cpufreq-kryo.c +++ b/drivers/cpufreq/qcom-cpufreq-kryo.c @@ -75,7 +75,7 @@ static enum _msm8996_version qcom_cpufreq_kryo_get_msm_id(void) static int qcom_cpufreq_kryo_probe(struct platform_device *pdev) { - struct opp_table *opp_tables[NR_CPUS] = {0}; + struct opp_table **opp_tables; enum _msm8996_version msm8996_version; struct nvmem_cell *speedbin_nvmem; struct device_node *np; @@ -133,6 +133,10 @@ static int qcom_cpufreq_kryo_probe(struct platform_device *pdev) } kfree(speedbin); + opp_tables = kcalloc(num_possible_cpus(), sizeof(*opp_tables), GFP_KERNEL); + if (!opp_tables) + return -ENOMEM; + for_each_possible_cpu(cpu) { cpu_dev = get_cpu_device(cpu); if (NULL == cpu_dev) { @@ -151,8 +155,10 @@ static int qcom_cpufreq_kryo_probe(struct platform_device *pdev) cpufreq_dt_pdev = platform_device_register_simple("cpufreq-dt", -1, NULL, 0); - if (!IS_ERR(cpufreq_dt_pdev)) + if (!IS_ERR(cpufreq_dt_pdev)) { + platform_set_drvdata(pdev, opp_tables); return 0; + } ret = PTR_ERR(cpufreq_dt_pdev); dev_err(cpu_dev, "Failed to register platform device\n"); @@ -163,13 +169,23 @@ static int qcom_cpufreq_kryo_probe(struct platform_device *pdev) break; dev_pm_opp_put_supported_hw(opp_tables[cpu]); } + kfree(opp_tables); return ret; } static int qcom_cpufreq_kryo_remove(struct platform_device *pdev) { + struct opp_table **opp_tables = platform_get_drvdata(pdev); + unsigned int cpu; + platform_device_unregister(cpufreq_dt_pdev); + + for_each_possible_cpu(cpu) + dev_pm_opp_put_supported_hw(opp_tables[cpu]); + + kfree(opp_tables); + return 0; } -- 2.21.0.rc0.269.g1a574e7a288b
Re: [PATCH net-next 2/6] net: 8021q: vlan_dev: add vid tag to addresses of uc and mc lists
On 2/26/2019 10:45 AM, Ivan Khoronzhuk wrote: > Update vlan mc and uc addresses with VID tag while propagating > addresses to lower devices, do this only if address is not synced. > It allows at end driver level to distinguish addresses belonging > to vlan devices. > > Signed-off-by: Ivan Khoronzhuk > --- [snip] > > +u16 vlan_dev_get_addr_vid(struct net_device *dev, const u8 *addr) Having some kernel doc comment here would also be nice. > +{ > + u16 vid = 0; > + > + if (dev->vid_len != NET_8021Q_VID_TSIZE) > + return vid; > + > + vid = addr[dev->addr_len]; > + vid |= (addr[dev->addr_len + 1] & 0xf) << 8; This uses knowledge of the maximum VLAN ID is 4095, which is fine, might be a good idea to add a check on VID not exceeding the maximum VLAN ID number instead of doing a silent truncation? [snip] > +static void vlan_dev_align_addr_vid(struct net_device *vlan_dev) > +{ > + struct net_device *real_dev = vlan_dev_real_dev(vlan_dev); > + struct netdev_hw_addr *ha; > + > + if (!real_dev->vid_len) > + return; Should not this check be moved to dev_{mc,uc}_sync? It does not seem to me like this would scale really well across different stacked devices (VLAN, bond, macvlan) as well as underlying drivers (cpsw, dsa, etc.). Or maybe the check should be if vlan_dev->vid_len > real_dev->vid_len -> error, right? -- Florian
Re: [PATCH v2 6/8] modpost: Add modinfo flag to livepatch modules
On Wed, Feb 20, 2019 at 04:50:08PM +0100, Miroslav Benes wrote: > Adding CCs... Where was this posted? I don't see it on lkml. It would be good to be able to see the whole series for review. > > On Wed, 30 Jan 2019, Joao Moreira wrote: > > > From: Miroslav Benes > > > > Currently, livepatch infrastructure in the kernel relies on > > MODULE_INFO(livepatch, "Y") statement in a livepatch module. Then the > > kernel module loader knows a module is indeed livepatch module and can > > behave accordingly. > > > > klp-convert, on the other hand relies on LIVEPATCH_* statement in the > > module's Makefile for exactly the same reason. > > > > Remove dependency on modinfo and generate MODULE_INFO flag > > automatically in modpost when LIVEPATCH_* is defined in the module's > > Makefile. Generate a list of all built livepatch modules based on > > the .livepatch file and store it in (MODVERDIR)/livepatchmods. Give > > this list as an argument for modpost which will use it to identify > > livepatch modules. > > > > As MODULE_INFO is no longer needed, remove it. > > > > [jmoreira: > > * fix modpost.c (add_livepatch_flag) to update module structure with > > livepatch flag and prevent modpost from breaking due to unresolved > > symbols] > > > > Signed-off-by: Miroslav Benes > > Signed-off-by: Joao Moreira > > --- > > samples/livepatch/livepatch-sample.c | 1 - > > scripts/Makefile.modpost | 8 +++- > > scripts/mod/modpost.c| 84 > > +--- > > 3 files changed, 85 insertions(+), 8 deletions(-) > > > > diff --git a/samples/livepatch/livepatch-sample.c > > b/samples/livepatch/livepatch-sample.c > > index 2d554dd930e2..20a5891ebf7b 100644 > > --- a/samples/livepatch/livepatch-sample.c > > +++ b/samples/livepatch/livepatch-sample.c > > @@ -90,4 +90,3 @@ static void livepatch_exit(void) > > module_init(livepatch_init); > > module_exit(livepatch_exit); > > MODULE_LICENSE("GPL"); > > -MODULE_INFO(livepatch, "Y"); > > Again, we shoud do the same for the other samples we have in > samples/livepatch/ directory. > > > diff --git a/scripts/Makefile.modpost b/scripts/Makefile.modpost > > index da779a185218..0149a72ea7ae 100644 > > --- a/scripts/Makefile.modpost > > +++ b/scripts/Makefile.modpost > > @@ -65,6 +65,11 @@ MODLISTCMD := find $(MODVERDIR) -name '*.mod' | xargs -r > > grep -h '\.ko$$' | sort > > __modules := $(shell $(MODLISTCMD)) > > modules := $(patsubst %.o,%.ko, $(wildcard $(__modules:.ko=.o))) > > > > +# find all .livepatch files listed in $(MODVERDIR)/ > > +ifdef CONFIG_LIVEPATCH > > +$(shell find $(MODVERDIR) -name '*.livepatch' | xargs -r -I{} basename {} > > .livepatch > $(MODVERDIR)/livepatchmods) > > +endif > > + > > # Stop after building .o files if NOFINAL is set. Makes compile tests > > quicker > > _modpost: $(if $(KBUILD_MODPOST_NOFINAL), $(modules:.ko:.o),$(modules)) > > > > @@ -79,7 +84,8 @@ modpost = scripts/mod/modpost\ > > $(if $(KBUILD_EXTMOD),-o $(modulesymfile)) \ > > $(if $(CONFIG_DEBUG_SECTION_MISMATCH),,-S) \ > > $(if $(CONFIG_SECTION_MISMATCH_WARN_ONLY),,-E) \ > > - $(if $(KBUILD_EXTMOD)$(KBUILD_MODPOST_WARN),-w) > > + $(if $(KBUILD_EXTMOD)$(KBUILD_MODPOST_WARN),-w) \ > > + $(if $(CONFIG_LIVEPATCH),-l $(MODVERDIR)/livepatchmods) > > > > MODPOST_OPT=$(subst -i,-n,$(filter -i,$(MAKEFLAGS))) > > > > diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c > > index 1dfc34d8b668..2fd1d409ca5a 100644 > > --- a/scripts/mod/modpost.c > > +++ b/scripts/mod/modpost.c > > @@ -1979,10 +1979,6 @@ static void read_symbols(const char *modname) > > license = get_next_modinfo(&info, "license", license); > > } > > > > - /* Livepatch modules have unresolved symbols resolved by klp-convert */ > > - if (get_modinfo(info.modinfo, info.modinfo_len, "livepatch")) > > - mod->livepatch = 1; > > - > > for (sym = info.symtab_start; sym < info.symtab_stop; sym++) { > > symname = remove_dot(info.strtab + sym->st_name); > > > > @@ -2421,6 +2417,76 @@ static void write_dump(const char *fname) > > free(buf.p); > > } > > > > +struct livepatch_mod_list { > > + struct livepatch_mod_list *next; > > + char *livepatch_mod; > > +}; > > + > > +static struct livepatch_mod_list *load_livepatch_mods(const char *fname) > > +{ > > + struct livepatch_mod_list *list_iter, *list_start = NULL; > > + unsigned long size, pos = 0; > > + void *file = grab_file(fname, &size); > > + char *line; > > + > > + if (!file) > > + return NULL; > > + > > + while ((line = get_next_line(&pos, file, size))) { > > + list_iter = NOFAIL(malloc(sizeof(*list_iter))); > > + list_iter->next = list_start; > > + list_iter->livepatch_mod = NOFAIL(strdup(line)); > > + list_start = list_iter; > > + } > > + release_file(file, size); > > + > > + return list_start; > > +} > > + > > +static void free_livepatch_mods(
Re: [RFC v3 18/19] of: unittest: split out a couple of test cases from unittest
On Wed, Feb 20, 2019 at 12:45 PM Frank Rowand wrote: > > On 2/18/19 2:25 PM, Frank Rowand wrote: > > On 2/15/19 2:56 AM, Brendan Higgins wrote: > >> On Thu, Feb 14, 2019 at 6:05 PM Frank Rowand > >> wrote: > >>> > >>> On 2/14/19 4:56 PM, Brendan Higgins wrote: > On Thu, Feb 14, 2019 at 3:57 PM Frank Rowand > wrote: > > > > On 12/5/18 3:54 PM, Brendan Higgins wrote: > >> On Tue, Dec 4, 2018 at 2:58 AM Frank Rowand > >> wrote: > >>> > > < snip > > > > > > It makes it harder for me to read the source of the tests and > > understand the order they will execute. It also makes it harder > > for me to read through the actual tests (in this example the > > tests that are currently grouped in of_unittest_find_node_by_name()) > > because of all the extra function headers injected into the > > existing single function to break it apart into many smaller > > functions. > > < snip > > > This is not something I feel particularly strongly about, it is just > pretty atypical from my experience to have so many unrelated test > cases in a single file. > > Maybe you would prefer that I break up the test cases first, and then > we split up the file as appropriate? > >>> > >>> I prefer that the test cases not be broken up arbitrarily. There _may_ > > I expect that I created confusion by putting this in a reply to patch 18/19. > It is actually a comment about patch 19/19. Sorry about that. > No worries. > > >> > >> I wasn't trying to break them up arbitrarily. I thought I was doing it > >> according to a pattern (breaking up the file, that is), but maybe I > >> just hadn't looked at enough examples. > > > > This goes back to the kunit model of putting each test into a separate > > function that can be a KUNIT_CASE(). That is a model that I do not agree > > with for devicetree. > > So now that I am actually talking about patch 19/19, let me give a concrete > example. I will cut and paste (after my comments), the beginning portion > of base-test.c, after applying patch 19/19 (the "base version". Then I > will cut and paste my alternative version which does not break the tests > down into individual functions (the "frank version"). Awesome, thanks for putting the comparison together! > > I will also reply to this email with the base version and the frank version > as attachments, which will make it easier to save as separate versions > for easier viewing. I'm not sure if an email with attachments will make > it through the list servers, but I am cautiously optimistic. > > I am using v4 of the patch series because I never got v3 to cleanly apply > and it is not a constructive use of my time to do so since I have v4 applied. > > One of the points I was trying to make is that readability suffers from the > approach taken by patches 18/19 and 19/19. I understood that point. > > The base version contains the extra text of a function header for each > unit test. This is visual noise and makes the file larger. It is also > one more possible location of an error (although not likely). I don't see how it is much more visual noise than a comment. Admittedly, a space versus an underscore might be nice, but I think it is also more likely that a function name is more likely to be kept up to date than a comment even if they are both informational. It also forces the user to actually name all the tests. Even then, I am not married to doing it this exact way. The thing I really care about is isolating the code in each test case so that it can be executed separately. A side thought, when I was proofreading this, it occurred to me that you might not like the function name over comment partly because you think about them differently. You aren't used to seeing a function used to frame things or communicate information in this way. Is this true? Admittedly, I have gotten used to a lot of unit test frameworks that break up test cases by function, so I wondering if part of the difference in comfortability with this framing might come from the fact that I am really used to seeing it this way and you are not? If this is the case, maybe it would be better if we had something like: KUNIT_DECLARE_CASE(case_id, "Test case description") { KUNIT_EXPECT_EQ(kunit, ...); ... } Just a thought. > > The frank version has converted each of the new function headers into > a comment, using the function name with '_' converted to ' '. The > comments are more readable than the function headers. Note that I added > an extra blank line before each comment, which violates the kernel > coding standards, but I feel this makes the code more readable. I agree that the extra space is an improvement, but I think any sufficient visual separation would work. > > The base version needs to declare each of the individual test functions > in of_test_find_node_by_name_cases[]. It is possible that a test function > could be left out of of_test_find_node_by_name_cases[], in error. Th
Re: [PATCH v3] f2fs: rebuild nat_bits during umount
Ping, Could you please try this new version? On 2019/2/17 6:26, Chao Yu wrote: > From: Chao Yu > > If all free_nat_bitmap are available, we can rebuild nat_bits from > free_nat_bitmap entirely during umount, let's make another chance > to reenable nat_bits for image. > > Signed-off-by: Chao Yu > --- > v3: > - fix to remove nat journal during umount. > fs/f2fs/checkpoint.c | 23 + > fs/f2fs/f2fs.h | 29 ++-- > fs/f2fs/node.c | 82 +--- > 3 files changed, 89 insertions(+), 45 deletions(-) > > diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c > index 10a3ada28715..b4c9de4e6eb0 100644 > --- a/fs/f2fs/checkpoint.c > +++ b/fs/f2fs/checkpoint.c > @@ -1225,12 +1225,22 @@ static void update_ckpt_flags(struct f2fs_sb_info > *sbi, struct cp_control *cpc) > struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); > unsigned long flags; > > - spin_lock_irqsave(&sbi->cp_lock, flags); > + if (cpc->reason & CP_UMOUNT) { > + if (le32_to_cpu(ckpt->cp_pack_total_block_count) > > + sbi->blocks_per_seg - NM_I(sbi)->nat_bits_blocks) { > + clear_ckpt_flags(sbi, CP_NAT_BITS_FLAG); > + f2fs_msg(sbi->sb, KERN_NOTICE, > + "Disable nat_bits due to no space"); > + } else if (!is_set_ckpt_flags(sbi, CP_NAT_BITS_FLAG) && > + f2fs_nat_bitmap_enabled(sbi)) { > + f2fs_enable_nat_bits(sbi); > + set_ckpt_flags(sbi, CP_NAT_BITS_FLAG); > + f2fs_msg(sbi->sb, KERN_NOTICE, > + "Rebuild and enable nat_bits"); > + } > + } > > - if ((cpc->reason & CP_UMOUNT) && > - le32_to_cpu(ckpt->cp_pack_total_block_count) > > - sbi->blocks_per_seg - NM_I(sbi)->nat_bits_blocks) > - disable_nat_bits(sbi, false); > + spin_lock_irqsave(&sbi->cp_lock, flags); > > if (cpc->reason & CP_TRIMMED) > __set_ckpt_flags(ckpt, CP_TRIMMED_FLAG); > @@ -1395,7 +1405,8 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, > struct cp_control *cpc) > start_blk = __start_cp_next_addr(sbi); > > /* write nat bits */ > - if (enabled_nat_bits(sbi, cpc)) { > + if ((cpc->reason & CP_UMOUNT) && > + is_set_ckpt_flags(sbi, CP_NAT_BITS_FLAG)) { > __u64 cp_ver = cur_cp_version(ckpt); > block_t blk; > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index c0d30810ef8a..a543354fc6bf 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -1615,33 +1615,6 @@ static inline void clear_ckpt_flags(struct > f2fs_sb_info *sbi, unsigned int f) > spin_unlock_irqrestore(&sbi->cp_lock, flags); > } > > -static inline void disable_nat_bits(struct f2fs_sb_info *sbi, bool lock) > -{ > - unsigned long flags; > - > - /* > - * In order to re-enable nat_bits we need to call fsck.f2fs by > - * set_sbi_flag(sbi, SBI_NEED_FSCK). But it may give huge cost, > - * so let's rely on regular fsck or unclean shutdown. > - */ > - > - if (lock) > - spin_lock_irqsave(&sbi->cp_lock, flags); > - __clear_ckpt_flags(F2FS_CKPT(sbi), CP_NAT_BITS_FLAG); > - kvfree(NM_I(sbi)->nat_bits); > - NM_I(sbi)->nat_bits = NULL; > - if (lock) > - spin_unlock_irqrestore(&sbi->cp_lock, flags); > -} > - > -static inline bool enabled_nat_bits(struct f2fs_sb_info *sbi, > - struct cp_control *cpc) > -{ > - bool set = is_set_ckpt_flags(sbi, CP_NAT_BITS_FLAG); > - > - return (cpc) ? (cpc->reason & CP_UMOUNT) && set : set; > -} > - > static inline void f2fs_lock_op(struct f2fs_sb_info *sbi) > { > down_read(&sbi->cp_rwsem); > @@ -2970,6 +2943,7 @@ int f2fs_truncate_inode_blocks(struct inode *inode, > pgoff_t from); > int f2fs_truncate_xattr_node(struct inode *inode); > int f2fs_wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, > unsigned int seq_id); > +bool f2fs_nat_bitmap_enabled(struct f2fs_sb_info *sbi); > int f2fs_remove_inode_page(struct inode *inode); > struct page *f2fs_new_inode_page(struct inode *inode); > struct page *f2fs_new_node_page(struct dnode_of_data *dn, unsigned int ofs); > @@ -2993,6 +2967,7 @@ int f2fs_recover_xattr_data(struct inode *inode, struct > page *page); > int f2fs_recover_inode_page(struct f2fs_sb_info *sbi, struct page *page); > int f2fs_restore_node_summary(struct f2fs_sb_info *sbi, > unsigned int segno, struct f2fs_summary_block *sum); > +void f2fs_enable_nat_bits(struct f2fs_sb_info *sbi); > int f2fs_flush_nat_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc); > int f2fs_build_node_manager(struct f2fs_sb_info *sbi); > void f2fs_destroy_node_manager(struct f2fs_sb_info *sbi); > diff --git a/fs
Re: [PATCH v4] f2fs: add bio cache for IPU
Ping, On 2019/2/19 16:15, Chao Yu wrote: > SQLite in Wal mode may trigger sequential IPU write in db-wal file, after > commit d1b3e72d5490 ("f2fs: submit bio of in-place-update pages"), we > lost the chance of merging page in inner managed bio cache, result in > submitting more small-sized IO. > > So let's add temporary bio in writepages() to cache mergeable write IO as > much as possible. > > Test case: > 1. xfs_io -f /mnt/f2fs/file -c "pwrite 0 65536" -c "fsync" > 2. xfs_io -f /mnt/f2fs/file -c "pwrite 0 65536" -c "fsync" > > Before: > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65544, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65552, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65560, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65568, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65576, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65584, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65592, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65600, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65608, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65616, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65624, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65632, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65640, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65648, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65656, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65664, size = 4096 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), NODE, sector = > 57352, size = 4096 > > After: > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), DATA, sector = > 65544, size = 65536 > f2fs_submit_write_bio: dev = (251,0)/(251,0), rw = WRITE(S), NODE, sector = > 57368, size = 4096 > > Signed-off-by: Chao Yu > --- > v4: > - fix to set *bio to NULL in f2fs_submit_ipu_bio() > - spread f2fs_submit_ipu_bio() > - fix to count dirty encrypted page correctly in f2fs_merge_page_bio() > - remove redundant assignment of fio->last_block > fs/f2fs/data.c| 88 ++- > fs/f2fs/f2fs.h| 3 ++ > fs/f2fs/segment.c | 5 ++- > 3 files changed, 86 insertions(+), 10 deletions(-) > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index 35910ff23582..e4c183e85de8 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -296,20 +296,20 @@ static void __submit_merged_bio(struct f2fs_bio_info > *io) > io->bio = NULL; > } > > -static bool __has_merged_page(struct f2fs_bio_info *io, struct inode *inode, > +static bool __has_merged_page(struct bio *bio, struct inode *inode, > struct page *page, nid_t ino) > { > struct bio_vec *bvec; > struct page *target; > int i; > > - if (!io->bio) > + if (!bio) > return false; > > if (!inode && !page && !ino) > return true; > > - bio_for_each_segment_all(bvec, io->bio, i) { > + bio_for_each_segment_all(bvec, bio, i) { > > if (bvec->bv_page->mapping) > target = bvec->bv_page; > @@ -360,7 +360,7 @@ static void __submit_merged_write_cond(struct > f2fs_sb_info *sbi, > struct f2fs_bio_info *io = sbi->write_io[btype] + temp; > > down_read(&io->io_rwsem); > - ret = __has_merged_page(io, inode, page, ino); > + ret = __has_merged_page(io->bio, inode, page, ino); > up_read(&io->io_rwsem); > } > if (ret) > @@ -429,6 +429,61 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio) > return 0; > } > > +int f2fs_merge_page_bio(struct f2fs_io_info *fio) > +{ > + struct bio *bio = *fio->bio; > + struct page *page = fio->encrypted_page ? > + fio->encrypted_page : fio->page; > + > + if (!f2fs_is_valid_blkaddr(fio->sbi, fio->new_blkaddr, > + __is_meta_io(fio) ? META_GENERIC : DATA_GENERIC)) > + return -EFAULT; > + > + trace_f2fs_submit_page_bio(page, fio); > + f2fs_trace_ios(fio, 0); > + > + if (bio && (*fio->last_block + 1 != fio->new_blkaddr || > + !__same_bdev(fio->sbi, fio->new_blkaddr, bio))) { > + __submit_bio(
RE: [PATCH] can: flexcan: add TX support for variable payload size
> -Original Message- > From: Marc Kleine-Budde > Sent: 2019年2月27日 22:03 > To: Joakim Zhang ; linux-...@vger.kernel.org > Cc: w...@grandegger.com; net...@vger.kernel.org; > linux-kernel@vger.kernel.org; dl-linux-imx > Subject: Re: [PATCH] can: flexcan: add TX support for variable payload size > > On 12/12/18 7:46 AM, Joakim Zhang wrote: > > Now the FlexCAN driver always use last mailbox for TX, it will work > > well when MB payload size is 8/16 bytes. > > TX mailbox would change to 13 when MB payload size is 64 bytes to > > support CANFD. So we may need to set iflag register to add support for > > variable payload size. > > > > Signed-off-by: Joakim Zhang > > --- > > drivers/net/can/flexcan.c | 42 > > +-- > > 1 file changed, 32 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c > > index 0f36eafe3ac1..13fd085fcf84 100644 > > --- a/drivers/net/can/flexcan.c > > +++ b/drivers/net/can/flexcan.c > > @@ -141,7 +141,9 @@ > > #define FLEXCAN_TX_MB_RESERVED_OFF_FIFO8 > > #define FLEXCAN_TX_MB_RESERVED_OFF_TIMESTAMP 0 > > #define FLEXCAN_RX_MB_OFF_TIMESTAMP_FIRST > (FLEXCAN_TX_MB_RESERVED_OFF_TIMESTAMP + 1) > > -#define FLEXCAN_IFLAG_MB(x)BIT((x) & 0x1f) > > +#define FLEXCAN_IFLAG1_MB_NUM 32 > > +#define FLEXCAN_IFLAG1_MB(x) BIT(x) > > +#define FLEXCAN_IFLAG2_MB(x) BIT((x) & 0x1f) > > #define FLEXCAN_IFLAG_RX_FIFO_OVERFLOW BIT(7) > > #define FLEXCAN_IFLAG_RX_FIFO_WARN BIT(6) > > #define FLEXCAN_IFLAG_RX_FIFO_AVAILABLEBIT(5) > > @@ -822,9 +824,15 @@ static inline u64 flexcan_read_reg_iflag_rx(struct > flexcan_priv *priv) > > struct flexcan_regs __iomem *regs = priv->regs; > > u32 iflag1, iflag2; > > > > - iflag2 = priv->read(®s->iflag2) & priv->reg_imask2_default & > > - ~FLEXCAN_IFLAG_MB(priv->tx_mb_idx); > > - iflag1 = priv->read(®s->iflag1) & priv->reg_imask1_default; > > + if (priv->tx_mb_idx >= FLEXCAN_IFLAG1_MB_NUM) { > > + iflag2 = priv->read(®s->iflag2) & priv->reg_imask2_default & > > + ~FLEXCAN_IFLAG2_MB(priv->tx_mb_idx); > > + iflag1 = priv->read(®s->iflag1) & priv->reg_imask1_default; > > + } else { > > + iflag2 = priv->read(®s->iflag2) & priv->reg_imask2_default; > > + iflag1 = priv->read(®s->iflag1) & priv->reg_imask1_default & > > + ~FLEXCAN_IFLAG1_MB(priv->tx_mb_idx); > > + } > > I just noticed, that FLEXCAN_IFLAGx_MB(priv->tx_mb_idx) should already be > part of the corresponding imaskx_default. See flexcan_open(). So we can > remove it completely here, right? Hi Marc, flexcan_read_reg_iflag_rx() is the function to confirm the irq which RX mailbox generated, if we remove it completely here, how can we exclude that it is not a TX mailbox irq? > > > > return (u64)iflag2 << 32 | iflag1; > > } > > @@ -836,7 +844,8 @@ static irqreturn_t flexcan_irq(int irq, void *dev_id) > > struct flexcan_priv *priv = netdev_priv(dev); > > struct flexcan_regs __iomem *regs = priv->regs; > > irqreturn_t handled = IRQ_NONE; > > - u32 reg_iflag2, reg_esr; > > + u32 reg_tx_iflag, tx_iflag_idx, reg_esr; > > "tx_iflag_idx" is not an index (going from 0...63) but a bit-mask. > > > + void __iomem *reg_iflag; > > "reg_iflag" is not a register but a pointer to a register, better rename it. > There is > a "u64 reg_iflag" in the same function already, but in a different scope. Why > not make it a u32 instead of a void? Of course, we can make it a u32. Best Regards, Joakim Zhang > > enum can_state last_state = priv->can.state; > > > > /* reception interrupt */ > > @@ -870,10 +879,18 @@ static irqreturn_t flexcan_irq(int irq, void *dev_id) > > } > > } > > > > - reg_iflag2 = priv->read(®s->iflag2); > > + if (priv->tx_mb_idx >= FLEXCAN_IFLAG1_MB_NUM) { > > + reg_tx_iflag = priv->read(®s->iflag2); > > + tx_iflag_idx = FLEXCAN_IFLAG2_MB(priv->tx_mb_idx); > > + reg_iflag = ®s->iflag2; > > + } else { > > + reg_tx_iflag = priv->read(®s->iflag1); > > + tx_iflag_idx = FLEXCAN_IFLAG1_MB(priv->tx_mb_idx); > > + reg_iflag = ®s->iflag1; > > + } > > > > /* transmission complete interrupt */ > > - if (reg_iflag2 & FLEXCAN_IFLAG_MB(priv->tx_mb_idx)) { > > + if (reg_tx_iflag & tx_iflag_idx) { > > u32 reg_ctrl = priv->read(&priv->tx_mb->can_ctrl); > > > > handled = IRQ_HANDLED; > > @@ -885,7 +902,7 @@ static irqreturn_t flexcan_irq(int irq, void *dev_id) > > /* after sending a RTR frame MB is in RX mode */ > > priv->write(FLEXCAN_MB_CODE_TX_INACTIVE, > > &priv->tx_mb->can_ctrl); > > - priv->write(FLEXCAN_IFLAG_MB(priv->tx_mb_idx), ®s->iflag2); > > + priv->write(tx_iflag_idx, reg_iflag); > > netif_wake_queue(dev); > >
Re: [PATCH] acpi_pm: Reduce PMTMR counter read contention
On 2019/2/18 11:48, Zhenzhong Duan wrote: On 2019/2/11 5:08, Thomas Gleixner wrote: On Sat, 2 Feb 2019, Zhenzhong Duan wrote: On 2019/1/31 22:26, Thomas Gleixner wrote: I'm not against the change per se, but I really want to understand why we need all the complexity for something which should never be used in a real world deployment. Hmm, it's a strong word of "never be used". Customers may happen to use nohpet(sanity test?) and report bug to us. Sometimes they does report a bug that reproduce with their customed config. There may also be BIOS setting HPET disabled. And because the customer MAY do completely nonsensical things (and there are a lot more than the HPET) the kernel has to handle all of them? Ok, then. I don't have more suggestion to convince you. You give up too fast :) Ah, because I thought of a simple fix. The point is, that we really want proper justifications for changes like this. Some 'may, could and more handwaving' simply does not cut it. So if you can just describe a realistic scenario, which does not involve thoughtless flipping of BIOS options, then this becomes way more palatable. I indeed don't see a realistic scenario in a product env needing to use nohpet. My only justification is now that we have nohpet as kernel parameter, we should fix the softlockup in large machines for enterprise use. I just think of a simple fix as below. I think it will work for both hpet and pmtmr. We will test it when the env is available. --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1353,6 +1353,7 @@ static int change_clocksource(void *data) write_seqcount_end(&tk_core.seq); raw_spin_unlock_irqrestore(&timekeeper_lock, flags); + tick_clock_notify(); return 0; } @@ -1371,7 +1372,6 @@ int timekeeping_notify(struct clocksource *clock) if (tk->tkr_mono.clock == clock) return 0; stop_machine(change_clocksource, clock, NULL); - tick_clock_notify(); return tk->tkr_mono.clock == clock ? 0 : -1; } This won't resolve the concurrency issues of HPET or PMTIMER in any way. Just got chance to test and Kin confirmed it fix the softlockup of PMTMR(with nohpet) and HPET(without nohpet, revert previous hpet commit) at bootup stage. My understandig is, at bootup stage tick device is firstly initialized in periodic mode and then switch to one-shot mode. In periodic mode clock event interrupt is triggered every 1ms(HZ=1000), contention in HPET or PMTIMER exceeds 1ms and delayed the clock interrupt. Then CPUs continue to process interrupt one by one without a break, tick_clock_notify() have no chance to be called and we never switch to one-shot mode. In one-shot mode, the contention is still there but next event is always set with a future value. We may missed some ticks, but the timer code is smart enough to pick up those missed ticks. By moving tick_clock_notify() in stop_machine, kernel changes to one-shot mode early before the contention accumulate and lockup system. Instead it breaks the careful orchestrated mechanism of clocksource change. Sorry, I didn't get a idea how it breaks, tick_clock_notify() is a simple function setting bitmask in percpu variable. Could you explain a bit? Hi Thomas, May I have your further comments? I think applying a simple patch to fix both hpet and pmtmr softlockup is better? Thanks Zhenzhong
Re: [LKP] [page cache] eb797a8ee0: vm-scalability.throughput -16.5% regression
Waiman Long writes: > On 02/27/2019 08:18 PM, Huang, Ying wrote: >> Waiman Long writes: >> >>> On 02/26/2019 12:30 PM, Linus Torvalds wrote: On Tue, Feb 26, 2019 at 12:17 AM Huang, Ying wrote: > As for fixing. Should we care about the cache line alignment of struct > inode? Or its size is considered more important because there may be a > huge number of struct inode in the system? Thanks for the great analysis. I suspect we _would_ like to make sure inodes are as small as possible, since they are everywhere. Also, they are usually embedded in other structures (ie "struct inode" is embedded into "struct ext4_inode_info"), and unless we force alignment (and thus possibly lots of padding), the actual alignment of 'struct inode' will vary depending on filesystem. So I would suggest we *not* do cacheline alignment, because it will result in random padding. But it sounds like maybe the solution is to make sure that the different fields of the inode can and should be packed differently? So one thing to look at is to see what fields in 'struct inode' might be best moved together, to minimize cache accesses. And in particular, if this is *only* an issue of "struct rw_semaphore", maybe we should look at the layout of *that*. In particular, I'm getting the feeling that we should put the "owner" field right next to the "count" field, because the normal non-contended path only touches those two fields. >>> That is true. Putting the two next to each other reduces the chance of >>> needing to touch 2 cachelines to acquire a rwsem. >>> Right now those two fields are pretty far from each other in 'struct rw_semaphore', which then makes the "oops they got allocated in different cachelines" much more likely. So even if 'struct inode' layout itself isn't changed, maybe just optimizing the layout of 'struct rw_semaphore' a bit for the common case might fix it all up. Waiman, I didn't check if your rewrite already possibly fixes this? >>> My current patch doesn't move the owner field, but I will add one to do >>> it. That change alone probably won't solve the regression we see here. >>> The optimistic spinner is spinning on the on_cpu flag of the task >>> structure as well as the rwsem->owner value (looking for change). The >>> lock holder only need to touch the count/owner values once at unlock. >>> However, if other hot data variables are in the same cacheline as >>> rwsem->owner, we will have cacaheline bouncing problem. So we need to >>> pad some rarely touched variables right before the rwsem in order to >>> reduce the chance of false cacheline sharing. >> Yes. And if my understanding were correct, if the rwsem is locked, the >> new rw_sem users (which calls down_write()) will write rwsem->count and >> some other fields of rwsem. This will cause cache ping-pong between >> lock holder and the new users too if the memory accessed by lock holder >> shares the same cache line with rwsem->count, thus hurt the system >> performance. For the regression reported, the rwsem holder will change >> address_space->i_mmap, if I put i_mmap and rwsem->count in the same >> cache line and rwsem->owner in a different cache line, the performance >> can improve ~8.3%. While if I put i_mmap in one cache line and all >> fields of rwsem in another different cache line, the performance can >> improve ~12.9% (in another machine, where the regression is ~14%). > > So it is better to have i_mmap and the rwsem in separate cachelines. Right? Yes. >> So I think in the heavily contended situation, we should put the fields >> accessed by rwsem holder in a different cache line of rwsem. But in >> un-contended situation, we should put the fields accessed in rwsem >> holder and rwsem in the same cache line to reduce the cache footprint. >> The requirement of un-contended and heavily contended situation is >> contradicted. > > Write to the rwsem's count mostly happens at lock and unlock times. It > is the constant spinning on owner by the optimistic waiter that is > likely to cause the most problem when its cacheline is shared with > another piece of data outside of the rwsem that is rewritten to fairly > frequently. Perhaps moving i_mmap further away from i_mmap_rwsem may help. Yes. I think rwsem->owner is more important too. rwsem->count has measurable effect too. And yes, moving i_mmap further away should help rwsem->count sharing too. Best Regards, Huang, Ying > Cheers, > Longman
Re: [PATCH V7 1/2] arm64: dts: freescale: imx8qxp: add cpu opp table
On Tue, Feb 26, 2019 at 05:17:31AM +, Anson Huang wrote: > Add i.MX8QXP CPU opp table to support cpufreq. > > Signed-off-by: Anson Huang > Acked-by: Viresh Kumar Prefix 'arm64: dts: imx8qxp: ' would already be clear enough. I dropped 'freescale' from there and applied patch. > --- > No changes since V6. > --- > arch/arm64/boot/dts/freescale/imx8qxp.dtsi | 30 > ++ > 1 file changed, 30 insertions(+) > > diff --git a/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > b/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > index 4c3dd95..41bf0ce 100644 > --- a/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > +++ b/arch/arm64/boot/dts/freescale/imx8qxp.dtsi > @@ -34,6 +34,9 @@ > reg = <0x0 0x0>; > enable-method = "psci"; > next-level-cache = <&A35_L2>; > + clocks = <&clk IMX_A35_CLK>; > + operating-points-v2 = <&a35_0_opp_table>; > + #cooling-cells = <2>; > }; > > A35_1: cpu@1 { > @@ -42,6 +45,9 @@ > reg = <0x0 0x1>; > enable-method = "psci"; > next-level-cache = <&A35_L2>; > + clocks = <&clk IMX_A35_CLK>; > + operating-points-v2 = <&a35_0_opp_table>; > + #cooling-cells = <2>; > }; > > A35_2: cpu@2 { > @@ -50,6 +56,9 @@ > reg = <0x0 0x2>; > enable-method = "psci"; > next-level-cache = <&A35_L2>; > + clocks = <&clk IMX_A35_CLK>; > + operating-points-v2 = <&a35_0_opp_table>; > + #cooling-cells = <2>; > }; > > A35_3: cpu@3 { > @@ -58,6 +67,9 @@ > reg = <0x0 0x3>; > enable-method = "psci"; > next-level-cache = <&A35_L2>; > + clocks = <&clk IMX_A35_CLK>; > + operating-points-v2 = <&a35_0_opp_table>; > + #cooling-cells = <2>; > }; > > A35_L2: l2-cache0 { > @@ -65,6 +77,24 @@ > }; > }; > > + a35_0_opp_table: opp-table { What does the '0' in the label mean? Shawn > + compatible = "operating-points-v2"; > + opp-shared; > + > + opp-9 { > + opp-hz = /bits/ 64 <9>; > + opp-microvolt = <100>; > + clock-latency-ns = <15>; > + }; > + > + opp-12 { > + opp-hz = /bits/ 64 <12>; > + opp-microvolt = <110>; > + clock-latency-ns = <15>; > + opp-suspend; > + }; > + }; > + > gic: interrupt-controller@51a0 { > compatible = "arm,gic-v3"; > reg = <0x0 0x51a0 0 0x1>, /* GIC Dist */ > -- > 2.7.4 > > > ___ > linux-arm-kernel mailing list > linux-arm-ker...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
[PATCH] xfrm: policy: Fix possible user after free in __xfrm_policy_unlink
From: YueHaibing UBSAN report this: UBSAN: Undefined behaviour in net/xfrm/xfrm_policy.c:1289:24 index 6 is out of range for type 'unsigned int [6]' CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.162-514.55.6.9.x86_64+ #13 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 1466cf39b41b23c9 8801f6b07a58 81cb35f4 41b58ab3 83230f9c 81cb34e0 8801f6b07a80 8801f6b07a20 1466cf39b41b23c9 851706e0 8801f6b07ae8 Call Trace: [] __dump_stack lib/dump_stack.c:15 [inline] [] dump_stack+0x114/0x1a0 lib/dump_stack.c:51 [] ubsan_epilogue+0x12/0x8f lib/ubsan.c:164 [] __ubsan_handle_out_of_bounds+0x16e/0x1b2 lib/ubsan.c:382 [] __xfrm_policy_unlink+0x3dd/0x5b0 net/xfrm/xfrm_policy.c:1289 [] xfrm_policy_delete+0x52/0xb0 net/xfrm/xfrm_policy.c:1309 [] xfrm_policy_timer+0x30b/0x590 net/xfrm/xfrm_policy.c:243 [] call_timer_fn+0x237/0x990 kernel/time/timer.c:1144 [] __run_timers kernel/time/timer.c:1218 [inline] [] run_timer_softirq+0x6ce/0xb80 kernel/time/timer.c:1401 [] __do_softirq+0x299/0xe10 kernel/softirq.c:273 [] invoke_softirq kernel/softirq.c:350 [inline] [] irq_exit+0x216/0x2c0 kernel/softirq.c:391 [] exiting_irq arch/x86/include/asm/apic.h:652 [inline] [] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926 [] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:735 [] ? native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:52 [] arch_safe_halt arch/x86/include/asm/paravirt.h:111 [inline] [] default_idle+0x27/0x430 arch/x86/kernel/process.c:446 [] arch_cpu_idle+0x15/0x20 arch/x86/kernel/process.c:437 [] default_idle_call+0x53/0x90 kernel/sched/idle.c:92 [] cpuidle_idle_call kernel/sched/idle.c:156 [inline] [] cpu_idle_loop kernel/sched/idle.c:251 [inline] [] cpu_startup_entry+0x60d/0x9a0 kernel/sched/idle.c:299 [] start_secondary+0x3c9/0x560 arch/x86/kernel/smpboot.c:245 The issue is triggered as this: xfrm_add_policy -->verify_newpolicy_info //check the index provided by user with XFRM_POLICY_MAX //In my case, the index is 0x6E6BB6, so it pass the check. -->xfrm_policy_construct //copy the user's policy and set xfrm_policy_timer -->xfrm_policy_insert --> __xfrm_policy_link //use the orgin dir, in my case is 2 --> xfrm_gen_index //generate policy index, there is 0x6E6BB6 then xfrm_policy_timer be fired xfrm_policy_timer --> xfrm_policy_id2dir //get dir from (policy index & 7), in my case is 6 --> xfrm_policy_delete --> __xfrm_policy_unlink //access policy_count[dir], trigger out of range access Reported-by: Hulk Robot Fixes: e682adf021be ("xfrm: Try to honor policy index if it's supplied by user") Signed-off-by: YueHaibing --- net/xfrm/xfrm_policy.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 8d1a898..b27eb742 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -316,6 +316,8 @@ static void xfrm_policy_timer(struct timer_list *t) goto out; dir = xfrm_policy_id2dir(xp->index); + if (dir >= XFRM_POLICY_MAX * 2) + dir = dir & XFRM_POLICY_MAX; if (xp->lft.hard_add_expires_seconds) { time64_t tmo = xp->lft.hard_add_expires_seconds + -- 2.7.0
Re: [PATCH] tools lib traceevent: Fix buffer overflow in arg_eval
On Wed, 27 Feb 2019 17:55:32 -0800 Tony Jones wrote: > Fix buffer overflow observed when running perf test. > > The overflow is when trying to evaluate "1ULL << (64 - 1)" which > is resulting in -9223372036854775808 which overflows the 20 character > buffer. > > If is possible this bug has been reported before but I still don't > see any fix checked in: > > See: https://www.spinics.net/lists/linux-perf-users/msg07714.html > > Cc: Arnaldo Carvalho de Melo > Cc: linux-perf-us...@vger.kernel.org > Cc: Steven Rostedt > Signed-off-by: Tony Jones Acked-by: Steven Rostedt (VMware) I have to say I've let this slide and it is not the first time a patch went out with this fix. But this one has the correct fix because we should use a buffer with a multiple of 4. Anyway, Tony I believe was the first to report this anyway. For reference we have: I first heard about Tony's complaint on a post to linux-perf-users on Jan 18. But then we had after that: Michael Sartain reported it on 1/24 (and fixed by Tzvetomir) https://lore.kernel.org/linux-trace-devel/20190125102014.19600-1-tstoya...@vmware.com/ It was later fixed again by Mathias Krause https://lore.kernel.org/linux-trace-devel/20190223122404.21137-1-mini...@googlemail.com/ But since Tony was first to report it, and we discussed that it should be 24 bytes, I would say this is the patch to take. Again, sorry for not getting this acknowledged earlier and everyone doing the same thing multiple times. :-/ Arnaldo, please take this patch. But also add: Reported-by: Michael Sartain Reported-by: Mathias Krause Thanks, -- Steve > --- > tools/lib/traceevent/event-parse.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/lib/traceevent/event-parse.c > b/tools/lib/traceevent/event-parse.c index abd4fa5d3088..87494c7c619d > 100644 --- a/tools/lib/traceevent/event-parse.c > +++ b/tools/lib/traceevent/event-parse.c > @@ -2457,7 +2457,7 @@ static int arg_num_eval(struct tep_print_arg > *arg, long long *val) static char *arg_eval (struct tep_print_arg > *arg) { > long long val; > - static char buf[20]; > + static char buf[24]; > > switch (arg->type) { > case TEP_PRINT_ATOM:
RE: [PATCH V2 2/3] ARM: dts: imx7ulp: add mmdc support
Hi, Fabio Best Regards! Anson Huang > -Original Message- > From: Fabio Estevam [mailto:feste...@gmail.com] > Sent: 2019年2月28日 9:57 > To: Anson Huang > Cc: robh...@kernel.org; mark.rutl...@arm.com; shawn...@kernel.org; > s.ha...@pengutronix.de; ker...@pengutronix.de; > devicet...@vger.kernel.org; linux-kernel@vger.kernel.org; linux-arm- > ker...@lists.infradead.org; dl-linux-imx > Subject: Re: [PATCH V2 2/3] ARM: dts: imx7ulp: add mmdc support > > Hi Anson, > > On Wed, Feb 27, 2019 at 10:54 PM Anson Huang > wrote: > > > If so, should I change other i.MX6 SoCs with single MMDC node as well? > > Remove them as well? > > Only imx6qp needs the mmdc label. > > I can send a patch removing it from the other i.MX SoCs, so don't need to > worry about it in your series. Oops, I already sent out V3 patch series including all i.MX SoC MMDC label removed before seeing your mail, so you can remove it from your TODO list, please help review it, thanks! Anson. > > Thanks
(no subject)
-- Good Day, I am Mr. Alfred Cheuk Yu Chow, the Director for Credit & Marketing Chong Hing Bank, Hong Kong, Chong Hing Bank Centre, 24 Des Voeux Road Central, Hong Kong. I have a business proposal of $38,980,369.00. All confirmable documents to back up the claims will be made available to you prior to your acceptance and as soon as I receive your return mail. Email me for more details: Best Regards,
[PATCH v6 3/3] dt-bindings: PCI: altera: Add altr,pcie-root-port-2.0
Add support for altr,pcie-root-port-2.0. Signed-off-by: Ley Foon Tan Reviewed-by: Rob Herring --- Documentation/devicetree/bindings/pci/altera-pcie.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/pci/altera-pcie.txt b/Documentation/devicetree/bindings/pci/altera-pcie.txt index 6c396f17c91a..816b244a221e 100644 --- a/Documentation/devicetree/bindings/pci/altera-pcie.txt +++ b/Documentation/devicetree/bindings/pci/altera-pcie.txt @@ -1,11 +1,13 @@ * Altera PCIe controller Required properties: -- compatible : should contain "altr,pcie-root-port-1.0" +- compatible : should contain "altr,pcie-root-port-1.0" or "altr,pcie-root-port-2.0" - reg: a list of physical base address and length for TXS and CRA. + For "altr,pcie-root-port-2.0", additional HIP base address and length. - reg-names: must include the following entries: "Txs": TX slave port region "Cra": Control register access region + "Hip": Hard IP region (if "altr,pcie-root-port-2.0") - interrupts: specifies the interrupt source of the parent interrupt controller. The format of the interrupt specifier depends on the parent interrupt controller. -- 2.19.0
[PATCH v6 2/3] PCI: altera: Enable driver on ARM64
Enable PCIE_ALTERA on ARM64 platform. Signed-off-by: Ley Foon Tan --- drivers/pci/controller/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/controller/Kconfig b/drivers/pci/controller/Kconfig index 6671946dbf66..6012f3059acd 100644 --- a/drivers/pci/controller/Kconfig +++ b/drivers/pci/controller/Kconfig @@ -175,7 +175,7 @@ config PCIE_IPROC_MSI config PCIE_ALTERA bool "Altera PCIe controller" - depends on ARM || NIOS2 || COMPILE_TEST + depends on ARM || NIOS2 || ARM64 || COMPILE_TEST help Say Y here if you want to enable PCIe controller support on Altera FPGA. -- 2.19.0
[PATCH v6 1/3] PCI: altera: Add Stratix 10 PCIe support
Add PCIe Root Port support for Stratix 10 device. Main differences compare with PCIe Root Port IP on Cyclone V and Arria 10 devices: - HIP interface to access Root Port configuration register. - TLP programming flow: - One REG0 register - Don't need to check alignment Signed-off-by: Ley Foon Tan --- drivers/pci/controller/pcie-altera.c | 264 --- 1 file changed, 240 insertions(+), 24 deletions(-) diff --git a/drivers/pci/controller/pcie-altera.c b/drivers/pci/controller/pcie-altera.c index 7d05e51205b3..c57fd7f4e848 100644 --- a/drivers/pci/controller/pcie-altera.c +++ b/drivers/pci/controller/pcie-altera.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -37,7 +38,12 @@ #define RP_LTSSM_MASK 0x1f #define LTSSM_L0 0xf -#define PCIE_CAP_OFFSET0x80 +#define S10_RP_TX_CNTRL0x2004 +#define S10_RP_RXCPL_REG 0x2008 +#define S10_RP_RXCPL_STATUS0x200C +#define S10_RP_CFG_ADDR(pcie, reg) \ + (((pcie)->hip_base) + (reg) + (1 << 20)) + /* TLP configuration type 0 and 1 */ #define TLP_FMTTYPE_CFGRD0 0x04/* Configuration Read Type 0 */ #define TLP_FMTTYPE_CFGWR0 0x44/* Configuration Write Type 0 */ @@ -49,18 +55,19 @@ #define RP_DEVFN 0 #define TLP_REQ_ID(bus, devfn) (((bus) << 8) | (devfn)) #define TLP_CFGRD_DW0(pcie, bus) \ -bus == pcie->root_bus_nr) ? TLP_FMTTYPE_CFGRD0 \ - : TLP_FMTTYPE_CFGRD1) << 24) | \ - TLP_PAYLOAD_SIZE) + bus == pcie->root_bus_nr) ? pcie->pcie_data->cfgrd0 \ + : pcie->pcie_data->cfgrd1) << 24) | \ + TLP_PAYLOAD_SIZE) #define TLP_CFGWR_DW0(pcie, bus) \ -bus == pcie->root_bus_nr) ? TLP_FMTTYPE_CFGWR0 \ - : TLP_FMTTYPE_CFGWR1) << 24) | \ - TLP_PAYLOAD_SIZE) + bus == pcie->root_bus_nr) ? pcie->pcie_data->cfgwr0 \ + : pcie->pcie_data->cfgwr1) << 24) | \ + TLP_PAYLOAD_SIZE) #define TLP_CFG_DW1(pcie, tag, be) \ -(((TLP_REQ_ID(pcie->root_bus_nr, RP_DEVFN)) << 16) | (tag << 8) | (be)) + (((TLP_REQ_ID(pcie->root_bus_nr, RP_DEVFN)) << 16) | (tag << 8) | (be)) #define TLP_CFG_DW2(bus, devfn, offset)\ (((bus) << 24) | ((devfn) << 16) | (offset)) #define TLP_COMP_STATUS(s) (((s) >> 13) & 7) +#define TLP_BYTE_COUNT(s) (((s) >> 0) & 0xfff) #define TLP_HDR_SIZE 3 #define TLP_LOOP 500 @@ -69,14 +76,47 @@ #define DWORD_MASK 3 +#define S10_TLP_FMTTYPE_CFGRD0 0x05 +#define S10_TLP_FMTTYPE_CFGRD1 0x04 +#define S10_TLP_FMTTYPE_CFGWR0 0x45 +#define S10_TLP_FMTTYPE_CFGWR1 0x44 + +enum altera_pcie_version { + ALTERA_PCIE_V1 = 0, + ALTERA_PCIE_V2, +}; + struct altera_pcie { struct platform_device *pdev; - void __iomem*cra_base; /* DT Cra */ + void __iomem*cra_base; + void __iomem*hip_base; int irq; u8 root_bus_nr; struct irq_domain *irq_domain; struct resource bus_range; struct list_headresources; + const struct altera_pcie_data *pcie_data; +}; + +struct altera_pcie_ops { + int (*tlp_read_pkt)(struct altera_pcie *pcie, u32 *value); + void (*tlp_write_pkt)(struct altera_pcie *pcie, u32 *headers, + u32 data, bool align); + bool (*get_link_status)(struct altera_pcie *pcie); + int (*rp_read_cfg)(struct altera_pcie *pcie, int where, + int size, u32 *value); + int (*rp_write_cfg)(struct altera_pcie *pcie, u8 busno, + int where, int size, u32 value); +}; + +struct altera_pcie_data { + const struct altera_pcie_ops *ops; + enum altera_pcie_version version; + u32 cap_offset; /* PCIe capability structure register offset */ + u32 cfgrd0; + u32 cfgrd1; + u32 cfgwr0; + u32 cfgwr1; }; struct tlp_rp_regpair_t { @@ -101,6 +141,15 @@ static bool altera_pcie_link_up(struct altera_pcie *pcie) return !!((cra_readl(pcie, RP_LTSSM) & RP_LTSSM_MASK) == LTSSM_L0); } +static bool s10_altera_pcie_link_up(struct altera_pcie *pcie) +{ + void __iomem *addr = S10_RP_CFG_ADDR(pcie, + pcie->pcie_data->cap_offset + + PCI_EXP_LNKSTA); + + return !!(readw(addr) & PCI_EXP_LNKSTA_DLLLA); +} + /*
[PATCH v6 0/3] Add Stratix 10 PCIe Root Port support
Add PCIe Root Port support for Stratix 10 device and also update device tree binding documentation. v5 -> v6: - - Move udelay to SOP polling - Move count checking to for loop condition - Change dw[count++] to dw[0] for first DW v4 -> v5: - - Add struct altera_pcie_ops - Add count checking in s10_tlp_read_packet() v3 -> v4: - - Separate Kconfig change to a patch - Change cast to mask v2 -> v3: - - Rename Stratix10 to Stratix 10. - Change bool s10_flag to enum version. v1 -> v2: - - Add define S10_TLP_FMTTYPE_* macros. - Remove initialize structure members to NULL/zero. - Rename *_funcs to *_data. - Update comment and fix coding style warning from checkpatch.pl. - Rename StratixXX to stratix10. History: [v1]: https://lkml.org/lkml/2018/12/26/68 [v2]: https://lkml.org/lkml/2018/12/31/46 [v3]: https://lkml.org/lkml/2019/1/2/16 [v4]: https://lkml.org/lkml/2019/2/14/58 [v5]: https://lkml.org/lkml/2019/2/26/200 Ley Foon Tan (3): PCI: altera: Add Stratix 10 PCIe support PCI: altera: Enable driver on ARM64 dt-bindings: PCI: altera: Add altr,pcie-root-port-2.0 .../devicetree/bindings/pci/altera-pcie.txt | 4 +- drivers/pci/controller/Kconfig| 2 +- drivers/pci/controller/pcie-altera.c | 264 -- 3 files changed, 244 insertions(+), 26 deletions(-) -- 2.19.0
Re: [PATCH v2] MIPS: fix memory setup for platforms with PHYS_OFFSET != 0
Hello, Thomas Bogendoerfer wrote: > For platforms, which use a PHYS_OFFSET != 0, symbol _end also > contains that offset. So when calling memblock_reserve() for > reserving kernel the size argument needs to be adjusted. > > Fixes: bcec54bf3118 ("mips: switch to NO_BOOTMEM") > Acked-by: Mike Rapoport > Signed-off-by: Thomas Bogendoerfer Applied to mips-fixes. Thanks, Paul [ This message was auto-generated; if you believe anything is incorrect then please email paul.bur...@mips.com to report it. ]
[PATCH V3 2/3] ARM: dts: imx7ulp: add mmdc support
i.MX7ULP has a MMDC module to control DDR, it reuses i.MX6Q's MMDC module, add support for it. Signed-off-by: Anson Huang --- Changes since V2: - remove mmdc label. --- arch/arm/boot/dts/imx7ulp.dtsi | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/arm/boot/dts/imx7ulp.dtsi b/arch/arm/boot/dts/imx7ulp.dtsi index fca6e50..eb349fd 100644 --- a/arch/arm/boot/dts/imx7ulp.dtsi +++ b/arch/arm/boot/dts/imx7ulp.dtsi @@ -286,6 +286,12 @@ status = "disabled"; }; + memory-controller@40ab { + compatible = "fsl,imx7ulp-mmdc", "fsl,imx6q-mmdc"; + reg = <0x40ab 0x1000>; + clocks = <&pcc3 IMX7ULP_CLK_MMDC>; + }; + iomuxc1: pinctrl@40ac { compatible = "fsl,imx7ulp-iomuxc1"; reg = <0x40ac 0x1000>; -- 2.7.4
[PATCH V3 1/3] dt-bindings: memory-controllers: freescale: add MMDC binding doc
Freescale MMDC (Multi Mode DDR Controller) driver is supported since i.MX6Q, but not yet documented, this patch adds binding doc for MMDC module driver. Signed-off-by: Anson Huang --- No changes since V2. --- .../bindings/memory-controllers/fsl/mmdc.txt | 31 ++ 1 file changed, 31 insertions(+) create mode 100644 Documentation/devicetree/bindings/memory-controllers/fsl/mmdc.txt diff --git a/Documentation/devicetree/bindings/memory-controllers/fsl/mmdc.txt b/Documentation/devicetree/bindings/memory-controllers/fsl/mmdc.txt new file mode 100644 index 000..5750274 --- /dev/null +++ b/Documentation/devicetree/bindings/memory-controllers/fsl/mmdc.txt @@ -0,0 +1,31 @@ +Freescale Multi Mode DDR controller (MMDC) + +Required properties : +- compatible : should be one of following: + for i.MX6Q/i.MX6DL: + - "fsl,imx6q-mmdc"; + for i.MX6SL: + - "fsl,imx6sl-mmdc", "fsl,imx6q-mmdc"; + for i.MX6SLL: + - "fsl,imx6sll-mmdc", "fsl,imx6q-mmdc"; + for i.MX6SX: + - "fsl,imx6sx-mmdc", "fsl,imx6q-mmdc"; + for i.MX6UL/i.MX6ULL/i.MX6ULZ: + - "fsl,imx6ul-mmdc", "fsl,imx6q-mmdc"; + for i.MX7ULP: + - "fsl,imx7ulp-mmdc", "fsl,imx6q-mmdc"; +- reg : address and size of MMDC DDR controller registers + +Optional properties : +- clocks : the clock provided by the SoC to access the MMDC registers + +Example : + mmdc0: memory-controller@21b { /* MMDC0 */ + compatible = "fsl,imx6q-mmdc"; + reg = <0x021b 0x4000>; + clocks = <&clks IMX6QDL_CLK_MMDC_P0_IPG>; + }; + + mmdc1: memory-controller@21b4000 { /* MMDC1 */ + reg = <0x021b4000 0x4000>; + }; -- 2.7.4
[PATCH V3 3/3] ARM: dts: imx: make MMDC node name generic
Node name should be generic, so use "memory-controller" instead of "mmdc" for MMDC node name, also remove "mmdc" label for platforms with single MMDC node. Signed-off-by: Anson Huang --- Changes since V2: - remove "mmdc" label for single MMDC node. --- arch/arm/boot/dts/imx6qdl.dtsi | 4 ++-- arch/arm/boot/dts/imx6sl.dtsi | 2 +- arch/arm/boot/dts/imx6sx.dtsi | 2 +- arch/arm/boot/dts/imx6ul.dtsi | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm/boot/dts/imx6qdl.dtsi b/arch/arm/boot/dts/imx6qdl.dtsi index fe17a34..4c7278b 100644 --- a/arch/arm/boot/dts/imx6qdl.dtsi +++ b/arch/arm/boot/dts/imx6qdl.dtsi @@ -1129,13 +1129,13 @@ reg = <0x021ac000 0x4000>; }; - mmdc0: mmdc@21b { /* MMDC0 */ + mmdc0: memory-controller@21b { /* MMDC0 */ compatible = "fsl,imx6q-mmdc"; reg = <0x021b 0x4000>; clocks = <&clks IMX6QDL_CLK_MMDC_P0_IPG>; }; - mmdc1: mmdc@21b4000 { /* MMDC1 */ + mmdc1: memory-controller@21b4000 { /* MMDC1 */ reg = <0x021b4000 0x4000>; }; diff --git a/arch/arm/boot/dts/imx6sl.dtsi b/arch/arm/boot/dts/imx6sl.dtsi index 4b4813f..733ea50 100644 --- a/arch/arm/boot/dts/imx6sl.dtsi +++ b/arch/arm/boot/dts/imx6sl.dtsi @@ -922,7 +922,7 @@ status = "disabled"; }; - mmdc: mmdc@21b { + memory-controller@21b { compatible = "fsl,imx6sl-mmdc", "fsl,imx6q-mmdc"; reg = <0x021b 0x4000>; clocks = <&clks IMX6SL_CLK_MMDC_P0_IPG>; diff --git a/arch/arm/boot/dts/imx6sx.dtsi b/arch/arm/boot/dts/imx6sx.dtsi index 5b16e65..df0c595 100644 --- a/arch/arm/boot/dts/imx6sx.dtsi +++ b/arch/arm/boot/dts/imx6sx.dtsi @@ -1017,7 +1017,7 @@ status = "disabled"; }; - mmdc: mmdc@21b { + memory-controller@21b { compatible = "fsl,imx6sx-mmdc", "fsl,imx6q-mmdc"; reg = <0x021b 0x4000>; clocks = <&clks IMX6SX_CLK_MMDC_P0_IPG>; diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi index 62ed30c..a77bbca 100644 --- a/arch/arm/boot/dts/imx6ul.dtsi +++ b/arch/arm/boot/dts/imx6ul.dtsi @@ -914,7 +914,7 @@ status = "disabled"; }; - mmdc: mmdc@21b { + memory-controller@21b { compatible = "fsl,imx6ul-mmdc", "fsl,imx6q-mmdc"; reg = <0x021b 0x4000>; clocks = <&clks IMX6UL_CLK_MMDC_P0_IPG>; -- 2.7.4
Re: [PATCH -next] drm/amdgpu: remove set but not used variable 'vbi_time_out'
Applied. thanks! On Tue, Feb 26, 2019 at 12:36 AM YueHaibing wrote: > > Fixes gcc '-Wunused-but-set-variable' warning: > > drivers/gpu/drm/amd/amdgpu/si_dpm.c: In function 'si_program_response_times': > drivers/gpu/drm/amd/amdgpu/si_dpm.c:4101:29: warning: > variable 'backbias_response_time' set but not used > [-Wunused-but-set-variable] > > It's never used since introduction in 841686df9f7d ("drm/amdgpu: add SI DPM > support (v4)"), so can be removed > > Signed-off-by: YueHaibing > --- > drivers/gpu/drm/amd/amdgpu/si_dpm.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/si_dpm.c > b/drivers/gpu/drm/amd/amdgpu/si_dpm.c > index 41e01a7f57a4..d57e75e5c71f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/si_dpm.c > +++ b/drivers/gpu/drm/amd/amdgpu/si_dpm.c > @@ -4098,14 +4098,13 @@ static int si_notify_smc_display_change(struct > amdgpu_device *adev, > > static void si_program_response_times(struct amdgpu_device *adev) > { > - u32 voltage_response_time, backbias_response_time, acpi_delay_time, > vbi_time_out; > + u32 voltage_response_time, acpi_delay_time, vbi_time_out; > u32 vddc_dly, acpi_dly, vbi_dly; > u32 reference_clock; > > si_write_smc_soft_register(adev, SI_SMC_SOFT_REGISTER_mvdd_chg_time, > 1); > > voltage_response_time = (u32)adev->pm.dpm.voltage_response_time; > - backbias_response_time = (u32)adev->pm.dpm.backbias_response_time; > > if (voltage_response_time == 0) > voltage_response_time = 1000; > > > > ___ > amd-gfx mailing list > amd-...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH v2 23/26] userfaultfd: wp: don't wake up when doing write protect
On Tue, Feb 26, 2019 at 10:00:29AM +0200, Mike Rapoport wrote: > On Tue, Feb 26, 2019 at 03:41:17PM +0800, Peter Xu wrote: > > On Tue, Feb 26, 2019 at 09:29:33AM +0200, Mike Rapoport wrote: > > > On Tue, Feb 26, 2019 at 02:24:52PM +0800, Peter Xu wrote: > > > > On Mon, Feb 25, 2019 at 11:09:35PM +0200, Mike Rapoport wrote: > > > > > On Tue, Feb 12, 2019 at 10:56:29AM +0800, Peter Xu wrote: > > > > > > It does not make sense to try to wake up any waiting thread when > > > > > > we're > > > > > > write-protecting a memory region. Only wake up when resolving a > > > > > > write > > > > > > protected page fault. > > > > > > > > > > > > Signed-off-by: Peter Xu > > > > > > --- > > > > > > fs/userfaultfd.c | 13 - > > > > > > 1 file changed, 8 insertions(+), 5 deletions(-) > > > > > > > > > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > > > > > index 81962d62520c..f1f61a0278c2 100644 > > > > > > --- a/fs/userfaultfd.c > > > > > > +++ b/fs/userfaultfd.c > > > > > > @@ -1771,6 +1771,7 @@ static int userfaultfd_writeprotect(struct > > > > > > userfaultfd_ctx *ctx, > > > > > > struct uffdio_writeprotect uffdio_wp; > > > > > > struct uffdio_writeprotect __user *user_uffdio_wp; > > > > > > struct userfaultfd_wake_range range; > > > > > > + bool mode_wp, mode_dontwake; > > > > > > > > > > > > if (READ_ONCE(ctx->mmap_changing)) > > > > > > return -EAGAIN; > > > > > > @@ -1789,18 +1790,20 @@ static int userfaultfd_writeprotect(struct > > > > > > userfaultfd_ctx *ctx, > > > > > > if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE | > > > > > >UFFDIO_WRITEPROTECT_MODE_WP)) > > > > > > return -EINVAL; > > > > > > - if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) && > > > > > > -(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) > > > > > > + > > > > > > + mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; > > > > > > + mode_dontwake = uffdio_wp.mode & > > > > > > UFFDIO_WRITEPROTECT_MODE_DONTWAKE; > > > > > > + > > > > > > + if (mode_wp && mode_dontwake) > > > > > > return -EINVAL; > > > > > > > > > > This actually means the opposite of the commit message text ;-) > > > > > > > > > > Is any dependency of _WP and _DONTWAKE needed at all? > > > > > > > > So this is indeed confusing at least, because both you and Jerome have > > > > asked the same question... :) > > > > > > > > My understanding is that we don't have any reason to wake up any > > > > thread when we are write-protecting a range, in that sense the flag > > > > UFFDIO_WRITEPROTECT_MODE_DONTWAKE is already meaningless in the > > > > UFFDIO_WRITEPROTECT ioctl context. So before everything here's how > > > > these flags are defined: > > > > > > > > struct uffdio_writeprotect { > > > > struct uffdio_range range; > > > > /* !WP means undo writeprotect. DONTWAKE is valid only with !WP > > > > */ > > > > #define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0) > > > > #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1) > > > > __u64 mode; > > > > }; > > > > > > > > To make it clear, we simply define it as "DONTWAKE is valid only with > > > > !WP". When with that, "mode_wp && mode_dontwake" is indeed a > > > > meaningless flag combination. Though please note that it does not > > > > mean that the operation ("don't wake up the thread") is meaningless - > > > > that's what we'll do no matter what when WP==1. IMHO it's only about > > > > the interface not the behavior. > > > > > > > > I don't have a good way to make this clearer because firstly we'll > > > > need the WP flag to mark whether we're protecting or unprotecting the > > > > pages. Later on, we need DONTWAKE for page fault handling case to > > > > mark that we don't want to wake up the waiting thread now. So both > > > > the flags have their reason to stay so far. Then with all these in > > > > mind what I can think of is only to forbid using DONTWAKE in WP case, > > > > and that's how above definition comes (I believe, because it was > > > > defined that way even before I started to work on it and I think it > > > > makes sense). > > > > > > There's no argument how DONTWAKE can be used with !WP. The > > > userfaultfd_writeprotect() is called in response of the uffd monitor to WP > > > page fault, it asks to clear write protection to some range, but it does > > > not want to wake the faulting thread yet but rather it will use > > > uffd_wake() > > > later. > > > > > > Still, I can't grok the usage of DONTWAKE with WP=1. In my understanding, > > > in this case userfaultfd_writeprotect() is called unrelated to page > > > faults, > > > and the monitored thread runs freely, so why it should be waked at all? > > > > Exactly this is how I understand it. And that's why I wrote this > > patch to remove the extra wakeup() since I think it's unecessary. > > > > > > > > And what happens, if the thread is waiting on
Re: [PATCH -next] drm/amdgpu: remove set but not used variables 'vm, bo'
Applied. thanks! On Wed, Feb 20, 2019 at 3:05 AM YueHaibing wrote: > > Fixes gcc '-Wunused-but-set-variable' warning: > > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: In function > 'update_gpuvm_pte': > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:840:20: warning: > variable 'bo' set but not used [-Wunused-but-set-variable] > > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:838:20: warning: > variable 'vm' set but not used [-Wunused-but-set-variable] > > They're never used since introduction in a46a2cd103a8 ("drm/amdgpu: Add GPUVM > memory management functions for KFD") > > Signed-off-by: YueHaibing > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 8 +--- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index d7b10d79f1de..63daf758cd03 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > @@ -837,13 +837,7 @@ static int update_gpuvm_pte(struct amdgpu_device *adev, > struct amdgpu_sync *sync) > { > int ret; > - struct amdgpu_vm *vm; > - struct amdgpu_bo_va *bo_va; > - struct amdgpu_bo *bo; > - > - bo_va = entry->bo_va; > - vm = bo_va->base.vm; > - bo = bo_va->base.bo; > + struct amdgpu_bo_va *bo_va = entry->bo_va; > > /* Update the page tables */ > ret = amdgpu_vm_bo_update(adev, bo_va, false); > > > > > > ___ > dri-devel mailing list > dri-de...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v3 5/5] tracing/probe: Support user-space dereference
On Wed, 27 Feb 2019 23:44:42 +0900 Masami Hiramatsu wrote: > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c > index a7012de37a00..0efef172db17 100644 > --- a/kernel/trace/trace_probe.c > +++ b/kernel/trace/trace_probe.c > @@ -239,6 +239,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type, > { > struct fetch_insn *code = *pcode; > unsigned long param; > + int deref = FETCH_OP_DEREF; > long offset = 0; > char *tmp; > int ret = 0; > @@ -301,8 +302,17 @@ parse_probe_arg(char *arg, const struct fetch_type *type, > break; > > case '+': /* deref memory */ > - arg++; /* Skip '+', because kstrtol() rejects it. */ > case '-': > + if (arg[0] == '+') { > + arg++; /* Skip '+', because kstrtol() rejects it. */ > + if (arg[0] == 'u') { > + deref = FETCH_OP_UDEREF; > + arg++; > + } > + } else if (arg[1] == 'u') { /* Start with "-u" */ > + deref = FETCH_OP_UDEREF; > + *(++arg) = '-'; > + } What about: if (arg[1] == 'u') { deref = FETCH_OP_UDEREF; arg[1] = arg[0]; arg++; } if (arg[0] == '+') arg++; /* Skip '+', because kstrtol() rejects it. */ A bit less messy. > tmp = strchr(arg, '('); > if (!tmp) > return -EINVAL; > @@ -328,7 +338,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type, > return -E2BIG; > *pcode = code; > > - code->op = FETCH_OP_DEREF; > + code->op = deref; > code->offset = offset; > } > break; > @@ -444,13 +454,14 @@ static int traceprobe_parse_probe_arg_body(char *arg, > ssize_t *size, > /* Store operation */ > if (!strcmp(parg->type->name, "string") || > !strcmp(parg->type->name, "ustring")) { > - if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_IMM && > - code->op != FETCH_OP_COMM) { > + if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_UDEREF > + && code->op != FETCH_OP_IMM && code->op != FETCH_OP_COMM) { > pr_info("string only accepts memory or address.\n"); > ret = -EINVAL; > goto fail; > } > - if (code->op != FETCH_OP_DEREF || parg->count) { > + if ((code->op == FETCH_OP_IMM && code->op == FETCH_OP_COMM) > + || parg->count) { How would "code->op == FETCH_OP_IMM && code->op == FETCH_OP_COMM" ever be true? Did you mean || ? -- Steve > /* >* IMM and COMM is pointing actual address, those must >* be kept, and if parg->count != 0, this is an array > @@ -463,7 +474,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, > ssize_t *size, > } > } > /* If op == DEREF, replace it with STRING */ > - if (!strcmp(parg->type->name, "ustring")) > + if (!strcmp(parg->type->name, "ustring") || > + code->op == FETCH_OP_UDEREF) > code->op = FETCH_OP_ST_USTRING; > else > code->op = FETCH_OP_ST_STRING;
Re: [PATCH -next] drm/ttm: remove set but not used variable 'bdev'
Applied. Thanks! On Tue, Feb 19, 2019 at 3:22 AM YueHaibing wrote: > > Fixes gcc '-Wunused-but-set-variable' warning: > > drivers/gpu/drm/ttm/ttm_execbuf_util.c: In function > 'ttm_eu_fence_buffer_objects': > drivers/gpu/drm/ttm/ttm_execbuf_util.c:191:24: warning: > variable 'bdev' set but not used [-Wunused-but-set-variable] > > It's not used any more and can be removed. > > Signed-off-by: YueHaibing > --- > drivers/gpu/drm/ttm/ttm_execbuf_util.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_execbuf_util.c > b/drivers/gpu/drm/ttm/ttm_execbuf_util.c > index 93860346c426..0075eb9a0b52 100644 > --- a/drivers/gpu/drm/ttm/ttm_execbuf_util.c > +++ b/drivers/gpu/drm/ttm/ttm_execbuf_util.c > @@ -188,13 +188,11 @@ void ttm_eu_fence_buffer_objects(struct ww_acquire_ctx > *ticket, > struct ttm_validate_buffer *entry; > struct ttm_buffer_object *bo; > struct ttm_bo_global *glob; > - struct ttm_bo_device *bdev; > > if (list_empty(list)) > return; > > bo = list_first_entry(list, struct ttm_validate_buffer, head)->bo; > - bdev = bo->bdev; > glob = bo->bdev->glob; > > spin_lock(&glob->lru_lock); > > > > ___ > dri-devel mailing list > dri-de...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [LKP] [page cache] eb797a8ee0: vm-scalability.throughput -16.5% regression
On 02/27/2019 08:18 PM, Huang, Ying wrote: > Waiman Long writes: > >> On 02/26/2019 12:30 PM, Linus Torvalds wrote: >>> On Tue, Feb 26, 2019 at 12:17 AM Huang, Ying wrote: As for fixing. Should we care about the cache line alignment of struct inode? Or its size is considered more important because there may be a huge number of struct inode in the system? >>> Thanks for the great analysis. >>> >>> I suspect we _would_ like to make sure inodes are as small as >>> possible, since they are everywhere. Also, they are usually embedded >>> in other structures (ie "struct inode" is embedded into "struct >>> ext4_inode_info"), and unless we force alignment (and thus possibly >>> lots of padding), the actual alignment of 'struct inode' will vary >>> depending on filesystem. >>> >>> So I would suggest we *not* do cacheline alignment, because it will >>> result in random padding. >>> >>> But it sounds like maybe the solution is to make sure that the >>> different fields of the inode can and should be packed differently? >>> >>> So one thing to look at is to see what fields in 'struct inode' might >>> be best moved together, to minimize cache accesses. >>> >>> And in particular, if this is *only* an issue of "struct >>> rw_semaphore", maybe we should look at the layout of *that*. In >>> particular, I'm getting the feeling that we should put the "owner" >>> field right next to the "count" field, because the normal >>> non-contended path only touches those two fields. >> That is true. Putting the two next to each other reduces the chance of >> needing to touch 2 cachelines to acquire a rwsem. >> >>> Right now those two fields are pretty far from each other in 'struct >>> rw_semaphore', which then makes the "oops they got allocated in >>> different cachelines" much more likely. >>> >>> So even if 'struct inode' layout itself isn't changed, maybe just >>> optimizing the layout of 'struct rw_semaphore' a bit for the common >>> case might fix it all up. >>> >>> Waiman, I didn't check if your rewrite already possibly fixes this? >> My current patch doesn't move the owner field, but I will add one to do >> it. That change alone probably won't solve the regression we see here. >> The optimistic spinner is spinning on the on_cpu flag of the task >> structure as well as the rwsem->owner value (looking for change). The >> lock holder only need to touch the count/owner values once at unlock. >> However, if other hot data variables are in the same cacheline as >> rwsem->owner, we will have cacaheline bouncing problem. So we need to >> pad some rarely touched variables right before the rwsem in order to >> reduce the chance of false cacheline sharing. > Yes. And if my understanding were correct, if the rwsem is locked, the > new rw_sem users (which calls down_write()) will write rwsem->count and > some other fields of rwsem. This will cause cache ping-pong between > lock holder and the new users too if the memory accessed by lock holder > shares the same cache line with rwsem->count, thus hurt the system > performance. For the regression reported, the rwsem holder will change > address_space->i_mmap, if I put i_mmap and rwsem->count in the same > cache line and rwsem->owner in a different cache line, the performance > can improve ~8.3%. While if I put i_mmap in one cache line and all > fields of rwsem in another different cache line, the performance can > improve ~12.9% (in another machine, where the regression is ~14%). So it is better to have i_mmap and the rwsem in separate cachelines. Right? > So I think in the heavily contended situation, we should put the fields > accessed by rwsem holder in a different cache line of rwsem. But in > un-contended situation, we should put the fields accessed in rwsem > holder and rwsem in the same cache line to reduce the cache footprint. > The requirement of un-contended and heavily contended situation is > contradicted. Write to the rwsem's count mostly happens at lock and unlock times. It is the constant spinning on owner by the optimistic waiter that is likely to cause the most problem when its cacheline is shared with another piece of data outside of the rwsem that is rewritten to fairly frequently. Perhaps moving i_mmap further away from i_mmap_rwsem may help. Cheers, Longman
[PATCH v4 2/2] PCI: hv: Refactor hv_irq_unmask() to use cpumask_to_vpset()
Remove the duplicate implementation of cpumask_to_vpset() and use the shared implementation. Export hv_max_vp_index, which is required by cpumask_to_vpset(). Apply changes to hv_irq_unmask() based on feedback. Based on Vitaly's finding, use GFP_ATOMIC instead of GFP_KERNEL for alloc_cpumask_var() because hv_irq_unmask() runs while a spinlock is held. Signed-off-by: Maya Nakamura --- Changes in v4: - Replace GFP_KERNEL with GFP_ATOMIC for alloc_cpumask_var(). - Update the commit message. Changes in v3: - Modify to catch all failures from cpumask_to_vpset(). - Correct the v2 change log about the commit message. Changes in v2: - Remove unnecessary nr_bank initialization. - Delete two unnecessary dev_err()'s. - Unlock before returning. - Update the commit message. arch/x86/hyperv/hv_init.c | 1 + drivers/pci/controller/pci-hyperv.c | 38 + 2 files changed, 18 insertions(+), 21 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 7abb09e2eeb8..7f2eed1fc81b 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -96,6 +96,7 @@ void __percpu **hyperv_pcpu_input_arg; EXPORT_SYMBOL_GPL(hyperv_pcpu_input_arg); u32 hv_max_vp_index; +EXPORT_SYMBOL_GPL(hv_max_vp_index); static int hv_cpu_init(unsigned int cpu) { diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index d71695db1ba0..95441a35eceb 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -391,8 +391,6 @@ struct hv_interrupt_entry { u32 data; }; -#define HV_VP_SET_BANK_COUNT_MAX 5 /* current implementation limit */ - /* * flags for hv_device_interrupt_target.flags */ @@ -908,12 +906,12 @@ static void hv_irq_unmask(struct irq_data *data) struct retarget_msi_interrupt *params; struct hv_pcibus_device *hbus; struct cpumask *dest; + cpumask_var_t tmp; struct pci_bus *pbus; struct pci_dev *pdev; unsigned long flags; u32 var_size = 0; - int cpu_vmbus; - int cpu; + int cpu, nr_bank; u64 res; dest = irq_data_get_effective_affinity_mask(data); @@ -953,29 +951,27 @@ static void hv_irq_unmask(struct irq_data *data) */ params->int_target.flags |= HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET; - params->int_target.vp_set.valid_bank_mask = - (1ull << HV_VP_SET_BANK_COUNT_MAX) - 1; + + if (!alloc_cpumask_var(&tmp, GFP_ATOMIC)) { + res = 1; + goto exit_unlock; + } + + cpumask_and(tmp, dest, cpu_online_mask); + nr_bank = cpumask_to_vpset(¶ms->int_target.vp_set, tmp); + free_cpumask_var(tmp); + + if (nr_bank <= 0) { + res = 1; + goto exit_unlock; + } /* * var-sized hypercall, var-size starts after vp_mask (thus * vp_set.format does not count, but vp_set.valid_bank_mask * does). */ - var_size = 1 + HV_VP_SET_BANK_COUNT_MAX; - - for_each_cpu_and(cpu, dest, cpu_online_mask) { - cpu_vmbus = hv_cpu_number_to_vp_number(cpu); - - if (cpu_vmbus >= HV_VP_SET_BANK_COUNT_MAX * 64) { - dev_err(&hbus->hdev->device, - "too high CPU %d", cpu_vmbus); - res = 1; - goto exit_unlock; - } - - params->int_target.vp_set.bank_contents[cpu_vmbus / 64] |= - (1ULL << (cpu_vmbus & 63)); - } + var_size = 1 + nr_bank; } else { for_each_cpu_and(cpu, dest, cpu_online_mask) { params->int_target.vp_mask |= -- 2.17.1
Re: [PATCH v4 07/10] dt-bindings: serial: Add Milbeaut serial driver description
Hi, On 2019/02/28 0:04, Greg Kroah-Hartman wrote: On Wed, Feb 27, 2019 at 01:53:30PM +0900, Sugaya Taichi wrote: Add DT bindings document for Milbeaut serial driver. Signed-off-by: Sugaya Taichi Reviewed-by: Rob Herring --- .../devicetree/bindings/serial/milbeaut-uart.txt| 21 + 1 file changed, 21 insertions(+) create mode 100644 Documentation/devicetree/bindings/serial/milbeaut-uart.txt I have already queued this up in my tty tree for merging in 5.1-rc1. thanks, greg k-h I see. Thank you!
[PATCH v4 1/2] PCI: hv: Replace hv_vp_set with hv_vpset
Remove a duplicate definition of VP set (hv_vp_set) and use the common definition (hv_vpset) that is used in other places. Change the order of the members in struct hv_pcibus_device so that the declaration of retarget_msi_interrupt_params is the last member. Struct hv_vpset, which contains a flexible array, is nested two levels deep in struct hv_pcibus_device via retarget_msi_interrupt_params. Add a comment that retarget_msi_interrupt_params should be the last member of struct hv_pcibus_device. Based on Vitaly's finding, add __aligned(8) to struct retarget_msi_interrupt because Hyper-V requires that hypercall arguments be aligned on an 8 byte boundary. Signed-off-by: Maya Nakamura --- Changes in v4: - Add __aligned(8) to struct retarget_msi_interrupt. - Update the commit message. Change in v3: - Correct the v2 change log. Change in v2: - Update the commit message. drivers/pci/controller/pci-hyperv.c | 27 +-- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 9ba4d12c179c..d71695db1ba0 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -393,12 +393,6 @@ struct hv_interrupt_entry { #define HV_VP_SET_BANK_COUNT_MAX 5 /* current implementation limit */ -struct hv_vp_set { - u64 format; /* 0 (HvGenericSetSparse4k) */ - u64 valid_banks; - u64 masks[HV_VP_SET_BANK_COUNT_MAX]; -}; - /* * flags for hv_device_interrupt_target.flags */ @@ -410,7 +404,7 @@ struct hv_device_interrupt_target { u32 flags; union { u64 vp_mask; - struct hv_vp_set vp_set; + struct hv_vpset vp_set; }; }; @@ -420,7 +414,7 @@ struct retarget_msi_interrupt { struct hv_interrupt_entry int_entry; u64 reserved2; struct hv_device_interrupt_target int_target; -} __packed; +} __packed __aligned(8); /* * Driver specific state. @@ -460,12 +454,16 @@ struct hv_pcibus_device { struct msi_controller msi_chip; struct irq_domain *irq_domain; - /* hypercall arg, must not cross page boundary */ - struct retarget_msi_interrupt retarget_msi_interrupt_params; - spinlock_t retarget_msi_interrupt_lock; struct workqueue_struct *wq; + + /* hypercall arg, must not cross page boundary */ + struct retarget_msi_interrupt retarget_msi_interrupt_params; + + /* +* Don't put anything here: retarget_msi_interrupt_params must be last +*/ }; /* @@ -955,12 +953,13 @@ static void hv_irq_unmask(struct irq_data *data) */ params->int_target.flags |= HV_DEVICE_INTERRUPT_TARGET_PROCESSOR_SET; - params->int_target.vp_set.valid_banks = + params->int_target.vp_set.valid_bank_mask = (1ull << HV_VP_SET_BANK_COUNT_MAX) - 1; /* * var-sized hypercall, var-size starts after vp_mask (thus -* vp_set.format does not count, but vp_set.valid_banks does). +* vp_set.format does not count, but vp_set.valid_bank_mask +* does). */ var_size = 1 + HV_VP_SET_BANK_COUNT_MAX; @@ -974,7 +973,7 @@ static void hv_irq_unmask(struct irq_data *data) goto exit_unlock; } - params->int_target.vp_set.masks[cpu_vmbus / 64] |= + params->int_target.vp_set.bank_contents[cpu_vmbus / 64] |= (1ULL << (cpu_vmbus & 63)); } } else { -- 2.17.1
[PATCH] powerpc: fix "sz" set but not used
arch/powerpc/mm/hugetlbpage-hash64.c: In function '__hash_page_huge': arch/powerpc/mm/hugetlbpage-hash64.c:29:28: warning: variable 'sz' set but not used [-Wunused-but-set-variable] Signed-off-by: Qian Cai --- arch/powerpc/mm/hugetlbpage-hash64.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c index 2e6a8f9345d3..f6b09edc5e6e 100644 --- a/arch/powerpc/mm/hugetlbpage-hash64.c +++ b/arch/powerpc/mm/hugetlbpage-hash64.c @@ -26,7 +26,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid, real_pte_t rpte; unsigned long vpn; unsigned long old_pte, new_pte; - unsigned long rflags, pa, sz; + unsigned long rflags, pa; long slot, offset; BUG_ON(shift != mmu_psize_defs[mmu_psize].shift); @@ -73,7 +73,6 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid, offset = PTRS_PER_PMD; rpte = __real_pte(__pte(old_pte), ptep, offset); - sz = ((1UL) << shift); if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) /* No CPU has hugepages but lacks no execute, so we * don't need to worry about that case */ -- 2.17.2 (Apple Git-113)
Re: [PATCH] drm/amd/display: Pass app_tf by value rather than by reference
Applied. Sorry for the delay. Alex On Fri, Feb 22, 2019 at 10:36 AM Nathan Chancellor wrote: > > On Mon, Dec 10, 2018 at 04:42:01PM -0700, Nathan Chancellor wrote: > > Clang warns when an expression that equals zero is used as a null > > pointer constant (in lieu of NULL): > > > > drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:4435:3: > > warning: expression which evaluates to zero treated as a null pointer > > constant of type 'const enum color_transfer_func *' > > [-Wnon-literal-null-conversion] > > TRANSFER_FUNC_UNKNOWN, > > ^ > > 1 warning generated. > > > > This warning is caused by commit bb47de736661 ("drm/amdgpu: Set FreeSync > > state using drm VRR properties") and it could be solved by using NULL > > instead of TRANSFER_FUNC_UNKNOWN or casting TRANSFER_FUNC_UNKNOWN as a > > pointer. However, after looking into it, there doesn't appear to be a > > good reason to pass app_tf by reference as it is never mutated along the > > way. This is the only code path in which app_tf is used: > > > > mod_freesync_build_vrr_infopacket -> > > build_vrr_infopacket_v2 -> > > build_vrr_infopacket_fs2_data > > > > Neither mod_freesync_build_vrr_infopacket or build_vrr_infopacket_v2 > > modify app_tf's value and build_vrr_infopacket_fs2_data expects just > > the value so we can avoid dereferencing anything by just passing in > > app_tf's value to mod_freesync_build_vrr_infopacket and > > build_vrr_infopacket_v2. > > > > There is no functional change because build_vrr_infopacket_fs2_data > > doesn't do anything if TRANSFER_FUNC_UNKNOWN is passed to it, the same > > as not calling build_vrr_infopacket_fs2_data at all like before this > > change when NULL was used for app_tf. > > > > Signed-off-by: Nathan Chancellor > > --- > > drivers/gpu/drm/amd/display/modules/freesync/freesync.c | 7 +++ > > drivers/gpu/drm/amd/display/modules/inc/mod_freesync.h | 2 +- > > 2 files changed, 4 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/display/modules/freesync/freesync.c > > b/drivers/gpu/drm/amd/display/modules/freesync/freesync.c > > index 620a171620ee..520665a9d81a 100644 > > --- a/drivers/gpu/drm/amd/display/modules/freesync/freesync.c > > +++ b/drivers/gpu/drm/amd/display/modules/freesync/freesync.c > > @@ -656,7 +656,7 @@ static void build_vrr_infopacket_v1(enum signal_type > > signal, > > > > static void build_vrr_infopacket_v2(enum signal_type signal, > > const struct mod_vrr_params *vrr, > > - const enum color_transfer_func *app_tf, > > + enum color_transfer_func app_tf, > > struct dc_info_packet *infopacket) > > { > > unsigned int payload_size = 0; > > @@ -664,8 +664,7 @@ static void build_vrr_infopacket_v2(enum signal_type > > signal, > > build_vrr_infopacket_header_v2(signal, infopacket, &payload_size); > > build_vrr_infopacket_data(vrr, infopacket); > > > > - if (app_tf != NULL) > > - build_vrr_infopacket_fs2_data(*app_tf, infopacket); > > + build_vrr_infopacket_fs2_data(app_tf, infopacket); > > > > build_vrr_infopacket_checksum(&payload_size, infopacket); > > > > @@ -676,7 +675,7 @@ void mod_freesync_build_vrr_infopacket(struct > > mod_freesync *mod_freesync, > > const struct dc_stream_state *stream, > > const struct mod_vrr_params *vrr, > > enum vrr_packet_type packet_type, > > - const enum color_transfer_func *app_tf, > > + enum color_transfer_func app_tf, > > struct dc_info_packet *infopacket) > > { > > /* SPD info packet for FreeSync */ > > diff --git a/drivers/gpu/drm/amd/display/modules/inc/mod_freesync.h > > b/drivers/gpu/drm/amd/display/modules/inc/mod_freesync.h > > index 949a8b62aa98..063af6258fd9 100644 > > --- a/drivers/gpu/drm/amd/display/modules/inc/mod_freesync.h > > +++ b/drivers/gpu/drm/amd/display/modules/inc/mod_freesync.h > > @@ -145,7 +145,7 @@ void mod_freesync_build_vrr_infopacket(struct > > mod_freesync *mod_freesync, > > const struct dc_stream_state *stream, > > const struct mod_vrr_params *vrr, > > enum vrr_packet_type packet_type, > > - const enum color_transfer_func *app_tf, > > + enum color_transfer_func app_tf, > > struct dc_info_packet *infopacket); > > > > void mod_freesync_build_vrr_params(struct mod_freesync *mod_freesync, > > -- > > 2.20.0 > > > > Gentle ping on this patch, it would be nice to get this fixed and into > 5.1! > > Thanks, > Nathan > ___ > amd-gfx mailing list > amd-...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] mwifiex PCI/wake-up interrupt fixes
Hi Rafael, On Wed, Feb 27, 2019 at 3:04 PM Rafael J. Wysocki wrote: > On Wed, Feb 27, 2019 at 9:58 PM Brian Norris wrote: > > On Wed, Feb 27, 2019 at 11:16:12AM +0100, Ard Biesheuvel wrote: > > > So I'd argue that we should add an optional 'wake-gpio' DT property > > > instead to the generic PCI device binding, and leave the interrupt > > > binding and discovery alone. > > > > So I think Mark Rutland already shot that one down; it's conceptually an > > interrupt from the device's perspective. Perhaps I shouldn't speak for Mark, but I am basically quoting him off IRC. > Which device are you talking about? The one that signals wakeup? If > so, then I beg to differ. Yes, the endpoint device. > On ACPI platforms WAKE# is represented as an ACPI GPE that is signaled > through SCI and handled at a different level (on HW-reduced ACPI it > actually can be a GPIO interrupt, but it still is handled with the > help of AML). The driver of the device signaling wakeup need not even > be aware that WAKE# has been asserted. Frankly, ACPI is not relevant to how we represent WAKE# in DT, IMO. Also, we're talking about the *device*, not the driver. When talking about Device Tree, that distinction is relevant. So while the driver need not be aware (and I agree! it only needs to care about enabling/disabling wake), *something* should be aware, and the signal that "something" should be receiving is simply "did WAKE happen"? That sounds basically like the device is signalling an interrupt to me. Maybe this goes back to some confusion we had elsewhere: what is the meaning of "interrupt" in device tree? > > We just need to figure out a good way of representing it that doesn't stomp > > on the existing INTx > > definitions. > > WAKE# is a signal that is converted into an interrupt, but that > interrupt may arrive at some place your driver has nothing to do with. I could agree with that, perhaps. But that's also what Device Tree is all about, really. We describe the relation between devices. So some other handles events that are triggered by , so we use a phandle to relate to . > It generally doesn't make sense to represent it as an interrupt for > the target device. What would you suggest then? I'm not clearly understanding how you think we should (a) describe (in DT) and (b) implement this WAKE# handling. Brian
Re: [PATCH v4 03/10] dt-bindings: Add documentation for Milbeaut SoCs
Hi, On 2019/02/27 23:47, Rob Herring wrote: On Tue, Feb 26, 2019 at 10:51 PM Sugaya Taichi wrote: This adds a DT binding documentation for the M10V and its evaluation board. Signed-off-by: Sugaya Taichi --- .../bindings/arm/socionext/milbeaut.yaml | 22 ++ 1 file changed, 22 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/socionext/milbeaut.yaml Reviewed-by: Rob Herring Thank you!
[PATCH v4 0/2] PCI: hv: Refactor hv_irq_unmask() to use hv_vpset and cpumask_to_vpset()
This patchset removes a duplicate definition of VP set (hv_vp_set) and uses the common definition (hv_vpset) that is used in other places. It changes the order of the members in struct hv_pcibus_device due to flexible array in hv_vpset. It also removes the duplicate implementation of cpumask_to_vpset(), uses the shared implementation, and exports hv_max_vp_index, which is required by cpumask_to_vpset(). Based on Vitaly's findings, two changes were applied: replace GFP_KERNEL with GFP_ATOMIC for alloc_cpumask_var() because hv_irq_unmask() runs while a spinlock is held, and add __aligned(8) to struct retarget_msi_interrupt because Hyper-V requires that hypercall arguments be aligned on an 8 byte boundary. Vitaly, thank you for finding the issues, and Lorenzo and Michael, thank you for your guidance and support! Maya Nakamura (2): PCI: hv: Replace hv_vp_set with hv_vpset PCI: hv: Refactor hv_irq_unmask() to use cpumask_to_vpset() arch/x86/hyperv/hv_init.c | 1 + drivers/pci/controller/pci-hyperv.c | 61 + 2 files changed, 29 insertions(+), 33 deletions(-) -- 2.17.1
Re: [PATCH v3 5/5] tracing/probe: Support user-space dereference
On Wed, 27 Feb 2019 23:44:42 +0900 Masami Hiramatsu wrote: > > +.. _user_mem_access: > +User Memory Access > +-- > +Kprobe events supports user-space memory access. For that purpose, you can > use > +either user-space dereference syntax or 'ustring' type. > + > +user-space dereference syntax allows you to access a field of a data > structure "The user-space" > +n user-space. This is done by "u" prefix with dereference syntax. For > example, in user-space? "This is done by adding the "u" prefix to the dereference syntax" > ++u4(%si) means read a user memory from the user-space address %si+4. You can "means it will read memory from the address in the register %si offset by 4, and that memory is expected to be in user-space." > +use this for string too, e.g. +u0(%si):string means that the read a user > space "for strings too" > +string from the address where %si register points. 'ustring' is a kind of > +short-cut. You can use +0(%si):ustring instead of that. "+u0(%si):string will read a string from the address in the register %si that is expected to be in user-space. 'ustring' is a shortcut way off performing the same task. That is, +0(%si):ustring is equivalent to +u0(%si):string." > + > +Note that kprobe-event provides user-memory access syntax, but it > doesn't +use it transparently. This means if you use normal > dereference or string type +for user memory, it might fail, and > always fails on some arch. So user has to +check if the targe data is > in kernel or in user space carefully. > Per-Probe Event Filtering > - > diff --git a/Documentation/trace/uprobetracer.rst > b/Documentation/trace/uprobetracer.rst index > 4c3bfde2ba47..6144423b2368 100644 --- > a/Documentation/trace/uprobetracer.rst +++ > b/Documentation/trace/uprobetracer.rst @@ -42,16 +42,17 @@ Synopsis > of uprobe_tracer @+OFFSET : Fetch memory at OFFSET (OFFSET > from same file as PATH) $stackN : Fetch Nth entry of stack (N > >= 0) $stack : Fetch stack address. > - $retval : Fetch return value.(*) > + $retval : Fetch return value.(\*1) > $comm : Fetch current task comm. > - +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) > + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS > address.(\*2)(\*3) NAME=FETCHARG : Set NAME as the argument name > of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. > Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal > types (x8/x16/x32/x64), "string" and bitfield are supported. > > - (*) only for return probe. > - (**) this is useful for fetching a field of data structures. > + (\*1) only for return probe. > + (\*2) this is useful for fetching a field of data structures. > + (\*3) Unlike kprobe event, "u" prefix will be just ignored. "will just be ignored." > > Types > - > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 4cacbb0e1538..5408a82a015d 100644 -- Steve
[PATCH v2] appletalk: Fix use-after-free in atalk_proc_exit
From: YueHaibing KASAN report this: BUG: KASAN: use-after-free in pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71 Read of size 8 at addr 8881f41fe5b0 by task syz-executor.0/2806 CPU: 0 PID: 2806 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xfa/0x1ce lib/dump_stack.c:113 print_address_description+0x65/0x270 mm/kasan/report.c:187 kasan_report+0x149/0x18d mm/kasan/report.c:317 pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71 remove_proc_entry+0xe8/0x420 fs/proc/generic.c:667 atalk_proc_exit+0x18/0x820 [appletalk] atalk_exit+0xf/0x5a [appletalk] __do_sys_delete_module kernel/module.c:1018 [inline] __se_sys_delete_module kernel/module.c:961 [inline] __x64_sys_delete_module+0x3dc/0x5e0 kernel/module.c:961 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:7fb2de6b9c58 EFLAGS: 0246 ORIG_RAX: 00b0 RAX: ffda RBX: 0073bf00 RCX: 00462e99 RDX: RSI: RDI: 21c0 RBP: 0002 R08: R09: R10: R11: 0246 R12: 7fb2de6ba6bc R13: 004bccaa R14: 006f6bc8 R15: Allocated by task 2806: set_track mm/kasan/common.c:85 [inline] __kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:496 slab_post_alloc_hook mm/slab.h:444 [inline] slab_alloc_node mm/slub.c:2739 [inline] slab_alloc mm/slub.c:2747 [inline] kmem_cache_alloc+0xcf/0x250 mm/slub.c:2752 kmem_cache_zalloc include/linux/slab.h:730 [inline] __proc_create+0x30f/0xa20 fs/proc/generic.c:408 proc_mkdir_data+0x47/0x190 fs/proc/generic.c:469 0xc10c01bb 0xc10c0166 do_one_initcall+0xfa/0x5ca init/main.c:887 do_init_module+0x204/0x5f6 kernel/module.c:3460 load_module+0x66b2/0x8570 kernel/module.c:3808 __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 2806: set_track mm/kasan/common.c:85 [inline] __kasan_slab_free+0x130/0x180 mm/kasan/common.c:458 slab_free_hook mm/slub.c:1409 [inline] slab_free_freelist_hook mm/slub.c:1436 [inline] slab_free mm/slub.c:2986 [inline] kmem_cache_free+0xa6/0x2a0 mm/slub.c:3002 pde_put+0x6e/0x80 fs/proc/generic.c:647 remove_proc_entry+0x1d3/0x420 fs/proc/generic.c:684 0xc10c031c 0xc10c0166 do_one_initcall+0xfa/0x5ca init/main.c:887 do_init_module+0x204/0x5f6 kernel/module.c:3460 load_module+0x66b2/0x8570 kernel/module.c:3808 __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902 do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at 8881f41fe500 which belongs to the cache proc_dir_entry of size 256 The buggy address is located 176 bytes inside of 256-byte region [8881f41fe500, 8881f41fe600) The buggy address belongs to the page: page:ea0007d07f80 count:1 mapcount:0 mapping:8881f6e69a00 index:0x0 flags: 0x2fffc000200(slab) raw: 02fffc000200 dead0100 dead0200 8881f6e69a00 raw: 800c000c 0001 page dumped because: kasan: bad access detected Memory state around the buggy address: 8881f41fe480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc 8881f41fe500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >8881f41fe580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ 8881f41fe600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb 8881f41fe680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb It should check the return value of atalk_proc_init,otherwise atalk_exit will trgger use-after-free in pde_subdir_find while unload the module.This patch fix error cleanup path of atalk_init Reported-by: Hulk Robot Signed-off-by: YueHaibing --- v2: fix error cleanup path of atalk_init --- include/linux/atalk.h| 2 +- net/appletalk/ddp.c | 37 +++-- net/appletalk/sysctl_net_atalk.c | 5 - 3 files changed, 36 insertions(+), 8 deletions(-) diff --git a/include/linux/atalk.h b/include/linux/atalk.h index 23f8055..5a90f28 100644 --- a/include/linux/atalk.h +++ b/include/linux/atalk.h @@ -158,7 +158,7 @@ extern int sysctl_aarp_retransmit_limit; extern int sysctl_aarp_resolve_time; #ifdef CONFIG_SYSCTL -extern void atalk_register_sysctl(void); +extern int atalk_register_sysctl(void); extern void atalk_unregister_sysctl(void); #else #define atalk_regis
Re: [PATCH v3 4/5] tracing/probe: Add ustring type for user-space string
On Wed, 27 Feb 2019 23:44:13 +0900 Masami Hiramatsu wrote: > Add "ustring" type for fetching user-space string from kprobe event. > User can specify ustring type at uprobe event, and it is same as > "string" for uprobe. > > Note that probe-event provides this option but it doesn't choose the > correct type automatically since we have not way to decide the address > is in user-space or not on some arch (and on some other arch, you can > fetch the string by "string" type). So user must carefully check the > target code (e.g. if you see __user on the target variable) and > use this new type. > > Signed-off-by: Masami Hiramatsu > --- > Changes in v2: > - Use strnlen_user() instead of open code for fetch_store_strlen_user(). Acked-by: Steven Rostedt (VMware) -- Steve
Re: [PATCH v3 1/2] Provide in-kernel headers for making it easy to extend the kernel
Hi Joel, On Thu, Feb 28, 2019 at 4:40 AM Joel Fernandes (Google) wrote: > > Introduce in-kernel headers and other artifacts which are made available > as an archive through proc (/proc/kheaders.tar.xz file). This archive makes > it possible to build kernel modules, run eBPF programs, and other > tracing programs that need to extend the kernel for tracing purposes > without any dependency on the file system having headers and build > artifacts. > > On Android and embedded systems, it is common to switch kernels but not > have kernel headers available on the file system. Raw kernel headers > also cannot be copied into the filesystem like they can be on other > distros, due to licensing and other issues. There's no linux-headers > package on Android. Further once a different kernel is booted, any > headers stored on the file system will no longer be useful. By storing > the headers as a compressed archive within the kernel, we can avoid these > issues that have been a hindrance for a long time. > > The feature is also buildable as a module just in case the user desires > it not being part of the kernel image. This makes it possible to load > and unload the headers on demand. A tracing program, or a kernel module > builder can load the module, do its operations, and then unload the > module to save kernel memory. The total memory needed is 3.8MB. > > The code to read the headers is based on /proc/config.gz code and uses > the same technique to embed the headers. Please let me ask a question about the actual use-case. To build embedded systems including Android, I use an x86 build machine. In other words, I cross-compile vmlinux and in-tree modules. So, target-arch: arm64 host-arch: x86 > To build a module, the below steps have been tested on an x86 machine: > modprobe kheaders > rm -rf $HOME/headers > mkdir -p $HOME/headers > tar -xvf /proc/kheaders.tar.xz -C $HOME/headers >/dev/null > cd my-kernel-module > make -C $HOME/headers M=$(pwd) modules > rmmod kheaders I am guessing the user will run these commands on the target system. In other words, external modules are native-compiled. So, target-arch: arm64 host-arch: arm64 Is this correct? If I understood the assumed use-case correctly, kheaders.tar.xw will contain host-programs compiled for x86, which will not work on the target system. Masahiro > Additional notes: > (1) > A limitation of module building with this is, since Module.symvers is > not available in the archive due to a cyclic dependency with building of > the archive into the kernel or module binaries, the modules built using > the archive will not contain symbol versioning (modversion). This is > usually not an issue since the idea of this patch is to build a kernel > module on the fly and load it into the same kernel. An appropriate > warning is already printed by the kernel to alert the user of modules > not having modversions when built using the archive. For building with > modversions, the user can use traditional header packages. For our > tracing usecases, we build modules on the fly with this so it is not a > concern. > > (2) I have left IKHD_ST and IKHD_ED markers as is to facilitate > future patches that would extract the headers from a kernel or module > image. > > Signed-off-by: Joel Fernandes (Google) > --- -- Best Regards Masahiro Yamada
Re: linux-next: build failure after merge of the bpf-next tree
On Wed, Feb 27, 2019 at 5:31 PM Stephen Rothwell wrote: > > Hi all, > > After merging the bpf-next tree, today's linux-next build (arm > multi_v7_defconfig) failed like this: > > kernel/sysctl.c:1238:13: error: 'sysctl_bpf_stats_enabled' undeclared here > (not in a function); did you mean 'sysctl_base_table'? >.data = &sysctl_bpf_stats_enabled, > ^~~~ > sysctl_base_table > > Caused by commit > > 492ecee892c2 ("bpf: enable program stats") > > CONFIG_BPF=y > # CONFIG_BPF_SYSCALL is not set > > I applied the following patch for today, but it should be done > properly. Also this patch leaves proc_dointvec_minmax_bpf_stats() as an > unused function. thanks for the report. working on the fix.
[PATCH 07/12] percpu: add block level scan_hint
Fragmentation can cause both blocks and chunks to have an early first_firee bit available, but only able to satisfy allocations much later on. This patch introduces a scan_hint to help mitigate some unnecessary scanning. The scan_hint remembers the largest area prior to the contig_hint. If the contig_hint == scan_hint, then scan_hint_start > contig_hint_start. This is necessary for scan_hint discovery when refreshing a block. Signed-off-by: Dennis Zhou --- mm/percpu-internal.h | 9 mm/percpu.c | 101 --- 2 files changed, 103 insertions(+), 7 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index b1739dc06b73..ec58b244545d 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -9,8 +9,17 @@ * pcpu_block_md is the metadata block struct. * Each chunk's bitmap is split into a number of full blocks. * All units are in terms of bits. + * + * The scan hint is the largest known contiguous area before the contig hint. + * It is not necessarily the actual largest contig hint though. There is an + * invariant that the scan_hint_start > contig_hint_start iff + * scan_hint == contig_hint. This is necessary because when scanning forward, + * we don't know if a new contig hint would be better than the current one. */ struct pcpu_block_md { + int scan_hint; /* scan hint for block */ + int scan_hint_start; /* block relative starting + position of the scan hint */ int contig_hint;/* contig hint for block */ int contig_hint_start; /* block relative starting position of the contig hint */ diff --git a/mm/percpu.c b/mm/percpu.c index 967c9cc3a928..df1aacf58ac8 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -320,6 +320,34 @@ static unsigned long pcpu_block_off_to_off(int index, int off) return index * PCPU_BITMAP_BLOCK_BITS + off; } +/* + * pcpu_next_hint - determine which hint to use + * @block: block of interest + * @alloc_bits: size of allocation + * + * This determines if we should scan based on the scan_hint or first_free. + * In general, we want to scan from first_free to fulfill allocations by + * first fit. However, if we know a scan_hint at position scan_hint_start + * cannot fulfill an allocation, we can begin scanning from there knowing + * the contig_hint will be our fallback. + */ +static int pcpu_next_hint(struct pcpu_block_md *block, int alloc_bits) +{ + /* +* The three conditions below determine if we can skip past the +* scan_hint. First, does the scan hint exist. Second, is the +* contig_hint after the scan_hint (possibly not true iff +* contig_hint == scan_hint). Third, is the allocation request +* larger than the scan_hint. +*/ + if (block->scan_hint && + block->contig_hint_start > block->scan_hint_start && + alloc_bits > block->scan_hint) + return block->scan_hint_start + block->scan_hint; + + return block->first_free; +} + /** * pcpu_next_md_free_region - finds the next hint free area * @chunk: chunk of interest @@ -415,9 +443,11 @@ static void pcpu_next_fit_region(struct pcpu_chunk *chunk, int alloc_bits, if (block->contig_hint && block->contig_hint_start >= block_off && block->contig_hint >= *bits + alloc_bits) { + int start = pcpu_next_hint(block, alloc_bits); + *bits += alloc_bits + block->contig_hint_start - -block->first_free; - *bit_off = pcpu_block_off_to_off(i, block->first_free); +start; + *bit_off = pcpu_block_off_to_off(i, start); return; } /* reset to satisfy the second predicate above */ @@ -632,12 +662,57 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end) block->right_free = contig; if (contig > block->contig_hint) { + /* promote the old contig_hint to be the new scan_hint */ + if (start > block->contig_hint_start) { + if (block->contig_hint > block->scan_hint) { + block->scan_hint_start = + block->contig_hint_start; + block->scan_hint = block->contig_hint; + } else if (start < block->scan_hint_start) { + /* +* The old contig_hint == scan_hint. But, the +* new contig is larger so hold the invariant +* scan_hint_start < contig_hint_start. +
[PATCH 11/12] percpu: convert chunk hints to be based on pcpu_block_md
As mentioned in the last patch, a chunk's hints are no different than a block just responsible for more bits. This converts chunk level hints to use a pcpu_block_md to maintain them. This lets us reuse the same hint helper functions as a block. The left_free and right_free are unused by the chunk's pcpu_block_md. Signed-off-by: Dennis Zhou --- mm/percpu-internal.h | 5 +- mm/percpu-stats.c| 5 +- mm/percpu.c | 120 +++ 3 files changed, 57 insertions(+), 73 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 119bd1119aa7..0468ba500bd4 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -39,9 +39,7 @@ struct pcpu_chunk { struct list_headlist; /* linked to pcpu_slot lists */ int free_bytes; /* free bytes in the chunk */ - int contig_bits;/* max contiguous size hint */ - int contig_bits_start; /* contig_bits starting - offset */ + struct pcpu_block_mdchunk_md; void*base_addr; /* base address of this chunk */ unsigned long *alloc_map; /* allocation map */ @@ -49,7 +47,6 @@ struct pcpu_chunk { struct pcpu_block_md*md_blocks; /* metadata blocks */ void*data; /* chunk data */ - int first_bit; /* no free below this */ boolimmutable; /* no [de]population allowed */ int start_offset; /* the overlap with the previous region to have a page aligned diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index b5fdd43b60c9..ef5034a0464e 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -53,6 +53,7 @@ static int find_max_nr_alloc(void) static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, int *buffer) { + struct pcpu_block_md *chunk_md = &chunk->chunk_md; int i, last_alloc, as_len, start, end; int *alloc_sizes, *p; /* statistics */ @@ -121,9 +122,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, P("nr_alloc", chunk->nr_alloc); P("max_alloc_size", chunk->max_alloc_size); P("empty_pop_pages", chunk->nr_empty_pop_pages); - P("first_bit", chunk->first_bit); + P("first_bit", chunk_md->first_free); P("free_bytes", chunk->free_bytes); - P("contig_bytes", chunk->contig_bits * PCPU_MIN_ALLOC_SIZE); + P("contig_bytes", chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE); P("sum_frag", sum_frag); P("max_frag", max_frag); P("cur_min_alloc", cur_min_alloc); diff --git a/mm/percpu.c b/mm/percpu.c index 7cdf14c242de..197479f2c489 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -233,10 +233,13 @@ static int pcpu_size_to_slot(int size) static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) { - if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits == 0) + const struct pcpu_block_md *chunk_md = &chunk->chunk_md; + + if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || + chunk_md->contig_hint == 0) return 0; - return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE); + return pcpu_size_to_slot(chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE); } /* set the pointer to a chunk in a page struct */ @@ -592,54 +595,6 @@ static inline bool pcpu_region_overlap(int a, int b, int x, int y) return false; } -/** - * pcpu_chunk_update - updates the chunk metadata given a free area - * @chunk: chunk of interest - * @bit_off: chunk offset - * @bits: size of free area - * - * This updates the chunk's contig hint and starting offset given a free area. - * Choose the best starting offset if the contig hint is equal. - */ -static void pcpu_chunk_update(struct pcpu_chunk *chunk, int bit_off, int bits) -{ - if (bits > chunk->contig_bits) { - chunk->contig_bits_start = bit_off; - chunk->contig_bits = bits; - } else if (bits == chunk->contig_bits && chunk->contig_bits_start && - (!bit_off || - __ffs(bit_off) > __ffs(chunk->contig_bits_start))) { - /* use the start with the best alignment */ - chunk->contig_bits_start = bit_off; - } -} - -/** - * pcpu_chunk_refresh_hint - updates metadata about a chunk - * @chunk: chunk of interest - * - * Iterates over the metadata blocks to find the largest contig area. - * It also counts the populated pages and uses the delta to update the - * global count. - * - * Updates: - * chunk->contig_bits - * chunk->contig_bits_start - */ -static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) -{ - int bit_off, bit
[PATCH 04/12] percpu: manage chunks based on contig_bits instead of free_bytes
When a chunk becomes fragmented, it can end up having a large number of small allocation areas free. The free_bytes sorting of chunks leads to unnecessary checking of chunks that cannot satisfy the allocation. Switch to contig_bits sorting to prevent scanning chunks that may not be able to service the allocation request. Signed-off-by: Dennis Zhou --- mm/percpu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/percpu.c b/mm/percpu.c index b40112b2fc59..c996bcffbb2a 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -234,7 +234,7 @@ static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits == 0) return 0; - return pcpu_size_to_slot(chunk->free_bytes); + return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE); } /* set the pointer to a chunk in a page struct */ -- 2.17.1
[PATCH 12/12] percpu: use chunk scan_hint to skip some scanning
Just like blocks, chunks now maintain a scan_hint. This can be used to skip some scanning by promoting the scan_hint to be the contig_hint. The chunk's scan_hint is primarily updated on the backside and relies on full scanning when a block becomes free or the free region spans across blocks. Signed-off-by: Dennis Zhou --- mm/percpu.c | 36 +++- 1 file changed, 27 insertions(+), 9 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 197479f2c489..40d49d7fb286 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -711,20 +711,31 @@ static void pcpu_block_update_scan(struct pcpu_chunk *chunk, int bit_off, /** * pcpu_chunk_refresh_hint - updates metadata about a chunk * @chunk: chunk of interest + * @full_scan: if we should scan from the beginning * * Iterates over the metadata blocks to find the largest contig area. - * It also counts the populated pages and uses the delta to update the - * global count. + * A full scan can be avoided on the allocation path as this is triggered + * if we broke the contig_hint. In doing so, the scan_hint will be before + * the contig_hint or after if the scan_hint == contig_hint. This cannot + * be prevented on freeing as we want to find the largest area possibly + * spanning blocks. */ -static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) +static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk, bool full_scan) { struct pcpu_block_md *chunk_md = &chunk->chunk_md; int bit_off, bits; - /* clear metadata */ - chunk_md->contig_hint = 0; + /* promote scan_hint to contig_hint */ + if (!full_scan && chunk_md->scan_hint) { + bit_off = chunk_md->scan_hint_start + chunk_md->scan_hint; + chunk_md->contig_hint_start = chunk_md->scan_hint_start; + chunk_md->contig_hint = chunk_md->scan_hint; + chunk_md->scan_hint = 0; + } else { + bit_off = chunk_md->first_free; + chunk_md->contig_hint = 0; + } - bit_off = chunk_md->first_free; bits = 0; pcpu_for_each_md_free_region(chunk, bit_off, bits) { pcpu_block_update(chunk_md, bit_off, bit_off + bits); @@ -884,6 +895,13 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, if (nr_empty_pages) pcpu_update_empty_pages(chunk, -1 * nr_empty_pages); + if (pcpu_region_overlap(chunk_md->scan_hint_start, + chunk_md->scan_hint_start + + chunk_md->scan_hint, + bit_off, + bit_off + bits)) + chunk_md->scan_hint = 0; + /* * The only time a full chunk scan is required is if the chunk * contig hint is broken. Otherwise, it means a smaller space @@ -894,7 +912,7 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, chunk_md->contig_hint, bit_off, bit_off + bits)) - pcpu_chunk_refresh_hint(chunk); + pcpu_chunk_refresh_hint(chunk, false); } /** @@ -1005,7 +1023,7 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, * the else condition below. */ if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index) - pcpu_chunk_refresh_hint(chunk); + pcpu_chunk_refresh_hint(chunk, true); else pcpu_block_update(&chunk->chunk_md, pcpu_block_off_to_off(s_index, start), @@ -1078,7 +1096,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits, if (bit_off + alloc_bits > chunk_md->contig_hint) return -1; - bit_off = chunk_md->first_free; + bit_off = pcpu_next_hint(chunk_md, alloc_bits); bits = 0; pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) { if (!pop_only || pcpu_is_populated(chunk, bit_off, bits, -- 2.17.1
[PATCH 06/12] percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
Previously, block size was flexible based on the constraint that the GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the overhead that keeping a floating number of populated free pages required scanning over the free regions of a chunk. Setting the block size to be fixed at PAGE_SIZE lets us know when an empty page becomes used as we will break a full contig_hint of a block. This means we no longer have to scan the whole chunk upon breaking a contig_hint which empty page management piggybacks off. A later patch takes advantage of this to optimize the allocation path by only scanning forward using the scan_hint introduced later too. Signed-off-by: Dennis Zhou --- include/linux/percpu.h | 12 ++--- mm/percpu-km.c | 2 +- mm/percpu.c| 111 + 3 files changed, 49 insertions(+), 76 deletions(-) diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 70b7123f38c7..9909dc0e273a 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -26,16 +26,10 @@ #define PCPU_MIN_ALLOC_SHIFT 2 #define PCPU_MIN_ALLOC_SIZE(1 << PCPU_MIN_ALLOC_SHIFT) -/* number of bits per page, used to trigger a scan if blocks are > PAGE_SIZE */ -#define PCPU_BITS_PER_PAGE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT) - /* - * This determines the size of each metadata block. There are several subtle - * constraints around this constant. The reserved region must be a multiple of - * PCPU_BITMAP_BLOCK_SIZE. Additionally, PCPU_BITMAP_BLOCK_SIZE must be a - * multiple of PAGE_SIZE or PAGE_SIZE must be a multiple of - * PCPU_BITMAP_BLOCK_SIZE to align with the populated page map. The unit_size - * also has to be a multiple of PCPU_BITMAP_BLOCK_SIZE to ensure full blocks. + * The PCPU_BITMAP_BLOCK_SIZE must be the same size as PAGE_SIZE as the + * updating of hints is used to manage the nr_empty_pop_pages in both + * the chunk and globally. */ #define PCPU_BITMAP_BLOCK_SIZE PAGE_SIZE #define PCPU_BITMAP_BLOCK_BITS (PCPU_BITMAP_BLOCK_SIZE >> \ diff --git a/mm/percpu-km.c b/mm/percpu-km.c index 0f643dc2dc65..c10bf7466596 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -70,7 +70,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) chunk->base_addr = page_address(pages) - pcpu_group_offsets[0]; spin_lock_irqsave(&pcpu_lock, flags); - pcpu_chunk_populated(chunk, 0, nr_pages, false); + pcpu_chunk_populated(chunk, 0, nr_pages); spin_unlock_irqrestore(&pcpu_lock, flags); pcpu_stats_chunk_alloc(); diff --git a/mm/percpu.c b/mm/percpu.c index 3d7deece9556..967c9cc3a928 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -527,37 +527,21 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot) __pcpu_chunk_move(chunk, nslot, oslot < nslot); } -/** - * pcpu_cnt_pop_pages- counts populated backing pages in range +/* + * pcpu_update_empty_pages - update empty page counters * @chunk: chunk of interest - * @bit_off: start offset - * @bits: size of area to check + * @nr: nr of empty pages * - * Calculates the number of populated pages in the region - * [page_start, page_end). This keeps track of how many empty populated - * pages are available and decide if async work should be scheduled. - * - * RETURNS: - * The nr of populated pages. + * This is used to keep track of the empty pages now based on the premise + * a pcpu_block_md covers a page. The hint update functions recognize if + * a block is made full or broken to calculate deltas for keeping track of + * free pages. */ -static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, int bit_off, -int bits) +static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr) { - int page_start = PFN_UP(bit_off * PCPU_MIN_ALLOC_SIZE); - int page_end = PFN_DOWN((bit_off + bits) * PCPU_MIN_ALLOC_SIZE); - - if (page_start >= page_end) - return 0; - - /* -* bitmap_weight counts the number of bits set in a bitmap up to -* the specified number of bits. This is counting the populated -* pages up to page_end and then subtracting the populated pages -* up to page_start to count the populated pages in -* [page_start, page_end). -*/ - return bitmap_weight(chunk->populated, page_end) - - bitmap_weight(chunk->populated, page_start); + chunk->nr_empty_pop_pages += nr; + if (chunk != pcpu_reserved_chunk) + pcpu_nr_empty_pop_pages += nr; } /* @@ -611,36 +595,19 @@ static void pcpu_chunk_update(struct pcpu_chunk *chunk, int bit_off, int bits) * Updates: * chunk->contig_bits * chunk->contig_bits_start - * nr_empty_pop_pages (chunk and global) */ static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) { - int bit_off, bits, nr_empty_pop_pages; + int bit_off
[PATCH 09/12] percpu: use block scan_hint to only scan forward
Blocks now remember the latest scan_hint. This can be used on the allocation path as when a contig_hint is broken, we can promote the scan_hint to the contig_hint and scan forward from there. This works because pcpu_block_refresh_hint() is only called on the allocation path while block free regions are updated manually in pcpu_block_update_hint_free(). Signed-off-by: Dennis Zhou --- mm/percpu.c | 23 +-- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index dac18968d79f..e51c151ed692 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -765,14 +765,23 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index) { struct pcpu_block_md *block = chunk->md_blocks + index; unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index); - int rs, re; /* region start, region end */ + int rs, re, start; /* region start, region end */ + + /* promote scan_hint to contig_hint */ + if (block->scan_hint) { + start = block->scan_hint_start + block->scan_hint; + block->contig_hint_start = block->scan_hint_start; + block->contig_hint = block->scan_hint; + block->scan_hint = 0; + } else { + start = block->first_free; + block->contig_hint = 0; + } - /* clear hints */ - block->contig_hint = block->scan_hint = 0; - block->left_free = block->right_free = 0; + block->right_free = 0; /* iterate over free areas and update the contig hints */ - pcpu_for_each_unpop_region(alloc_map, rs, re, block->first_free, + pcpu_for_each_unpop_region(alloc_map, rs, re, start, PCPU_BITMAP_BLOCK_BITS) { pcpu_block_update(block, rs, re); } @@ -837,6 +846,8 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, s_off, s_off + bits)) { /* block contig hint is broken - scan to fix it */ + if (!s_off) + s_block->left_free = 0; pcpu_block_refresh_hint(chunk, s_index); } else { /* update left and right contig manually */ @@ -870,11 +881,11 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, if (e_off > e_block->scan_hint_start) e_block->scan_hint = 0; + e_block->left_free = 0; if (e_off > e_block->contig_hint_start) { /* contig hint is broken - scan to fix it */ pcpu_block_refresh_hint(chunk, e_index); } else { - e_block->left_free = 0; e_block->right_free = min_t(int, e_block->right_free, PCPU_BITMAP_BLOCK_BITS - e_off); -- 2.17.1
[PATCH 10/12] percpu: make pcpu_block_md generic
In reality, a chunk is just a block covering a larger number of bits. The hints themselves are one in the same. Rather than maintaining the hints separately, first introduce nr_bits to genericize pcpu_block_update() to correctly maintain block->right_free. The next patch will convert chunk hints to be managed as a pcpu_block_md. Signed-off-by: Dennis Zhou --- mm/percpu-internal.h | 1 + mm/percpu.c | 20 +--- 2 files changed, 14 insertions(+), 7 deletions(-) diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index ec58b244545d..119bd1119aa7 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -28,6 +28,7 @@ struct pcpu_block_md { int right_free; /* size of free space along the right side of the block */ int first_free; /* block position of first free */ + int nr_bits;/* total bits responsible for */ }; struct pcpu_chunk { diff --git a/mm/percpu.c b/mm/percpu.c index e51c151ed692..7cdf14c242de 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -658,7 +658,7 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end) if (start == 0) block->left_free = contig; - if (end == PCPU_BITMAP_BLOCK_BITS) + if (end == block->nr_bits) block->right_free = contig; if (contig > block->contig_hint) { @@ -1271,18 +1271,24 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off) pcpu_chunk_relocate(chunk, oslot); } +static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits) +{ + block->scan_hint = 0; + block->contig_hint = nr_bits; + block->left_free = nr_bits; + block->right_free = nr_bits; + block->first_free = 0; + block->nr_bits = nr_bits; +} + static void pcpu_init_md_blocks(struct pcpu_chunk *chunk) { struct pcpu_block_md *md_block; for (md_block = chunk->md_blocks; md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk); -md_block++) { - md_block->scan_hint = 0; - md_block->contig_hint = PCPU_BITMAP_BLOCK_BITS; - md_block->left_free = PCPU_BITMAP_BLOCK_BITS; - md_block->right_free = PCPU_BITMAP_BLOCK_BITS; - } +md_block++) + pcpu_init_md_block(md_block, PCPU_BITMAP_BLOCK_BITS); } /** -- 2.17.1
[PATCH 02/12] percpu: do not search past bitmap when allocating an area
pcpu_find_block_fit() guarantees that a fit is found within PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it compares against the block's contig_hint. This can lead to incorrectly scanning past the end of the bitmap. The behavior was okay given the check after for bit_off >= end and the correctness of the hints from pcpu_find_block_fit(). This patch fixes this by bounding the end offset by the number of bits in a chunk. Signed-off-by: Dennis Zhou --- mm/percpu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/percpu.c b/mm/percpu.c index 53bd79a617b1..69ca51d238b5 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -988,7 +988,8 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits, /* * Search to find a fit. */ - end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS; + end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS, + pcpu_chunk_map_bits(chunk)); bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start, alloc_bits, align_mask); if (bit_off >= end) -- 2.17.1
[PATCH 08/12] percpu: remember largest area skipped during allocation
Percpu allocations attempt to do first fit by scanning forward from the first_free of a block. However, fragmentation from allocation requests can cause holes not seen by block hint update functions. To address this, create a local version of bitmap_find_next_zero_area_off() that remembers the largest area skipped over. The caveat is that it only sees regions skipped over due to not fitting, not regions skipped due to alignment. Prior to updating the scan_hint, a scan backwards is done to try and recover free bits skipped due to alignment. While this can cause scanning to miss earlier possible free areas, smaller allocations will eventually fill those holes. Signed-off-by: Dennis Zhou --- mm/percpu.c | 101 ++-- 1 file changed, 99 insertions(+), 2 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index df1aacf58ac8..dac18968d79f 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -716,6 +716,43 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end) } } +/* + * pcpu_block_update_scan - update a block given a free area from a scan + * @chunk: chunk of interest + * @bit_off: chunk offset + * @bits: size of free area + * + * Finding the final allocation spot first goes through pcpu_find_block_fit() + * to find a block that can hold the allocation and then pcpu_alloc_area() + * where a scan is used. When allocations require specific alignments, + * we can inadvertently create holes which will not be seen in the alloc + * or free paths. + * + * This takes a given free area hole and updates a block as it may change the + * scan_hint. We need to scan backwards to ensure we don't miss free bits + * from alignment. + */ +static void pcpu_block_update_scan(struct pcpu_chunk *chunk, int bit_off, + int bits) +{ + int s_off = pcpu_off_to_block_off(bit_off); + int e_off = s_off + bits; + int s_index, l_bit; + struct pcpu_block_md *block; + + if (e_off > PCPU_BITMAP_BLOCK_BITS) + return; + + s_index = pcpu_off_to_block_index(bit_off); + block = chunk->md_blocks + s_index; + + /* scan backwards in case of alignment skipping free bits */ + l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), s_off); + s_off = (s_off == l_bit) ? 0 : l_bit + 1; + + pcpu_block_update(block, s_off, e_off); +} + /** * pcpu_block_refresh_hint * @chunk: chunk of interest @@ -1064,6 +1101,62 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits, return bit_off; } +/* + * pcpu_find_zero_area - modified from bitmap_find_next_zero_area_off + * @map: the address to base the search on + * @size: the bitmap size in bits + * @start: the bitnumber to start searching at + * @nr: the number of zeroed bits we're looking for + * @align_mask: alignment mask for zero area + * @largest_off: offset of the largest area skipped + * @largest_bits: size of the largest area skipped + * + * The @align_mask should be one less than a power of 2. + * + * This is a modified version of bitmap_find_next_zero_area_off() to remember + * the largest area that was skipped. This is imperfect, but in general is + * good enough. The largest remembered region is the largest failed region + * seen. This does not include anything we possibly skipped due to alignment. + * pcpu_block_update_scan() does scan backwards to try and recover what was + * lost to alignment. While this can cause scanning to miss earlier possible + * free areas, smaller allocations will eventually fill those holes. + */ +static unsigned long pcpu_find_zero_area(unsigned long *map, +unsigned long size, +unsigned long start, +unsigned long nr, +unsigned long align_mask, +unsigned long *largest_off, +unsigned long *largest_bits) +{ + unsigned long index, end, i, area_off, area_bits; +again: + index = find_next_zero_bit(map, size, start); + + /* Align allocation */ + index = __ALIGN_MASK(index, align_mask); + area_off = index; + + end = index + nr; + if (end > size) + return end; + i = find_next_bit(map, end, index); + if (i < end) { + area_bits = i - area_off; + /* remember largest unused area with best alignment */ + if (area_bits > *largest_bits || + (area_bits == *largest_bits && *largest_off && +(!area_off || __ffs(area_off) > __ffs(*largest_off { + *largest_off = area_off; + *largest_bits = area_bits; + } + + start = i + 1; + goto again; + } + return index; +} + /** * pcpu_alloc_ar
[PATCH 05/12] percpu: relegate chunks unusable when failing small allocations
In certain cases, requestors of percpu memory may want specific alignments. However, it is possible to end up in situations where the contig_hint matches, but the alignment does not. This causes excess scanning of chunks that will fail. To prevent this, if a small allocation fails (< 32B), the chunk is moved to the empty list. Once an allocation is freed from that chunk, it is placed back into rotation. Signed-off-by: Dennis Zhou --- mm/percpu.c | 35 ++- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index c996bcffbb2a..3d7deece9556 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -94,6 +94,8 @@ /* the slots are sorted by free bytes left, 1-31 bytes share the same slot */ #define PCPU_SLOT_BASE_SHIFT 5 +/* chunks in slots below this are subject to being sidelined on failed alloc */ +#define PCPU_SLOT_FAIL_THRESHOLD 3 #define PCPU_EMPTY_POP_PAGES_LOW 2 #define PCPU_EMPTY_POP_PAGES_HIGH 4 @@ -488,6 +490,22 @@ static void pcpu_mem_free(void *ptr) kvfree(ptr); } +static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot, + bool move_front) +{ + if (chunk != pcpu_reserved_chunk) { + if (move_front) + list_move(&chunk->list, &pcpu_slot[slot]); + else + list_move_tail(&chunk->list, &pcpu_slot[slot]); + } +} + +static void pcpu_chunk_move(struct pcpu_chunk *chunk, int slot) +{ + __pcpu_chunk_move(chunk, slot, true); +} + /** * pcpu_chunk_relocate - put chunk in the appropriate chunk slot * @chunk: chunk of interest @@ -505,12 +523,8 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot) { int nslot = pcpu_chunk_slot(chunk); - if (chunk != pcpu_reserved_chunk && oslot != nslot) { - if (oslot < nslot) - list_move(&chunk->list, &pcpu_slot[nslot]); - else - list_move_tail(&chunk->list, &pcpu_slot[nslot]); - } + if (oslot != nslot) + __pcpu_chunk_move(chunk, nslot, oslot < nslot); } /** @@ -1381,7 +1395,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL; bool do_warn = !(gfp & __GFP_NOWARN); static int warn_limit = 10; - struct pcpu_chunk *chunk; + struct pcpu_chunk *chunk, *next; const char *err; int slot, off, cpu, ret; unsigned long flags; @@ -1443,11 +1457,14 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, restart: /* search through normal chunks */ for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) { - list_for_each_entry(chunk, &pcpu_slot[slot], list) { + list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) { off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic); - if (off < 0) + if (off < 0) { + if (slot < PCPU_SLOT_FAIL_THRESHOLD) + pcpu_chunk_move(chunk, 0); continue; + } off = pcpu_alloc_area(chunk, bits, bit_align, off); if (off >= 0) -- 2.17.1
[PATCH 03/12] percpu: introduce helper to determine if two regions overlap
While block hints were always accurate, it's possible when spanning across blocks that we miss updating the chunk's contig_hint. Rather than rely on correctness of the boundaries of hints, do a full overlap comparison. Signed-off-by: Dennis Zhou --- mm/percpu.c | 31 +++ 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 69ca51d238b5..b40112b2fc59 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -546,6 +546,24 @@ static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, int bit_off, bitmap_weight(chunk->populated, page_start); } +/* + * pcpu_region_overlap - determines if two regions overlap + * @a: start of first region, inclusive + * @b: end of first region, exclusive + * @x: start of second region, inclusive + * @y: end of second region, exclusive + * + * This is used to determine if the hint region [a, b) overlaps with the + * allocated region [x, y). + */ +static inline bool pcpu_region_overlap(int a, int b, int x, int y) +{ + if ((x >= a && x < b) || (y > a && y <= b) || + (x <= a && y >= b)) + return true; + return false; +} + /** * pcpu_chunk_update - updates the chunk metadata given a free area * @chunk: chunk of interest @@ -710,8 +728,11 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, PCPU_BITMAP_BLOCK_BITS, s_off + bits); - if (s_off >= s_block->contig_hint_start && - s_off < s_block->contig_hint_start + s_block->contig_hint) { + if (pcpu_region_overlap(s_block->contig_hint_start, + s_block->contig_hint_start + + s_block->contig_hint, + s_off, + s_off + bits)) { /* block contig hint is broken - scan to fix it */ pcpu_block_refresh_hint(chunk, s_index); } else { @@ -764,8 +785,10 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, * contig hint is broken. Otherwise, it means a smaller space * was used and therefore the chunk contig hint is still correct. */ - if (bit_off >= chunk->contig_bits_start && - bit_off < chunk->contig_bits_start + chunk->contig_bits) + if (pcpu_region_overlap(chunk->contig_bits_start, + chunk->contig_bits_start + chunk->contig_bits, + bit_off, + bit_off + bits)) pcpu_chunk_refresh_hint(chunk); } -- 2.17.1