Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup
On Tue, Apr 6, 2021 at 6:39 PM Pavan Kondeti wrote: > > On Tue, Apr 06, 2021 at 12:15:24PM -0400, Tejun Heo wrote: > > Hello, > > > > On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote: > > > Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a > > > background workqueue if we identify that it is cpu intensive. However, > > > this > > > needs case by case analysis, tweaking etc. If there is no other > > > alternative, > > > we might end up chasing the background workers and reduce their nice > > > value. > > > > There shouldn't be that many workqueues that consume a lot of cpu cycles. > > The usual culprit is kswapd, IO related stuff (writeback, encryption), so it > > shouldn't be a long list and we want them identified anyway. > > > Sure. I have not done a complete study on which workers in our system can > compete with important tasks in other cgroups. We will have to do that to > adjust the workqueue priority so that the impact can be minimized. > kswapd0 is actually migratable to subgroup. But echo what Pavan said, the real world is not ideal and the problematice drivers are inactive when the use case is not activated in Android, e.g. connectivity, camera, etc. It is tricky sometimes to track all of those. > > > The problem at our hand (which you might be knowing already) is that, > > > lets say > > > we have 2 cgroups in our setup and we want to prioritize UX over > > > background. > > > IOW, reduce the cpu.shares of background cgroup. This helps in > > > prioritizing > > > Task-A and Task-B over Task-X and Task-Y. However, each individual kworker > > > can potentially have CPU share equal to the entire UX cgroup. > > > > > > -kworker/0:0 > > > -kworker/1:0 > > > - UX > > > Task-A > > > Task-B > > > - background > > > Task-X > > > Task-Y > > > Hence we want to move all kernel threads to another cgroup so that this > > > cgroup > > > will have CPU share equal to UX. > > > > > > The patch presented here allows us to create the above setup. Any other > > > alternative approaches to achieve the same without impacting any future > > > designs/requirements? > > > > Not quite the same but we already have > > /sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues, > > so maybe a global default priority knob would help here? > > > > yeah, not exactly what we are looking for. It gives us the abiility to > restrict > the priority of all unbound workqueues at the expense of not being able to > prioritize one workqueue over another workqueue. > Same here, Android used to have its cgroup setup like this, where the BG group can be starved too long potentially (and sometimes PI is not inevitable of course, that's the reason why I removed BG cgroup in Android (https://android-review.googlesource.com/q/topic:remove_bg_cgroup)). -Task-A -Task-B -kworker/0:0 - background Task-X Task-Y So having things left in root will post the same risk, > Thanks, > Pavan > -- > Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux > Foundation Collaborative Project. >
[PATCH v5 1/5] dt-bindings: aspeed-lpc: Remove LPC partitioning
The LPC controller has no concept of the BMC and the Host partitions. This patch fixes the documentation by removing the description on LPC partitions. The register offsets illustrated in the DTS node examples are also fixed to adapt to the LPC DTS change. Signed-off-by: Chia-Wei, Wang --- .../devicetree/bindings/mfd/aspeed-lpc.txt| 100 +- 1 file changed, 25 insertions(+), 75 deletions(-) diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt index d0a38ba8b9ce..936aa108eab4 100644 --- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt +++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt @@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on the bus conditions it can also take the role of bus master. The LPC controller is represented as a multi-function device to account for the -mix of functionality it provides. The principle split is between the register -layout at the start of the I/O space which is, to quote the Aspeed datasheet, -"basically compatible with the [LPC registers from the] popular BMC controller -H8S/2168[1]", and everything else, where everything else is an eclectic -collection of functions with a esoteric register layout. "Everything else", -here labeled the "host" portion of the controller, includes, but is not limited -to: +mix of functionality, which includes, but is not limited to: * An IPMI Block Transfer[2] Controller @@ -44,80 +38,36 @@ Required properties === - compatible: One of: - "aspeed,ast2400-lpc", "simple-mfd" - "aspeed,ast2500-lpc", "simple-mfd" - "aspeed,ast2600-lpc", "simple-mfd" + "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon" + "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon" + "aspeed,ast2600-lpc-v2", "simple-mfd", "syscon" - reg: contains the physical address and length values of the Aspeed LPC memory region. - #address-cells: <1> - #size-cells: <1> -- ranges: Maps 0 to the physical address and length of the LPC memory -region - -Required LPC Child nodes - - -BMC Node - - -- compatible: One of: - "aspeed,ast2400-lpc-bmc" - "aspeed,ast2500-lpc-bmc" - "aspeed,ast2600-lpc-bmc" - -- reg: contains the physical address and length values of the -H8S/2168-compatible LPC controller memory region - -Host Node -- - -- compatible: One of: - "aspeed,ast2400-lpc-host", "simple-mfd", "syscon" - "aspeed,ast2500-lpc-host", "simple-mfd", "syscon" - "aspeed,ast2600-lpc-host", "simple-mfd", "syscon" - -- reg: contains the address and length values of the host-related -register space for the Aspeed LPC controller - -- #address-cells: <1> -- #size-cells: <1> -- ranges: Maps 0 to the address and length of the host-related LPC memory +- ranges: Maps 0 to the physical address and length of the LPC memory region Example: lpc: lpc@1e789000 { - compatible = "aspeed,ast2500-lpc", "simple-mfd"; + compatible = "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2500-lpc-bmc"; + lpc_snoop: lpc-snoop@0 { + compatible = "aspeed,ast2600-lpc-snoop"; reg = <0x0 0x80>; - }; - - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; + interrupts = ; + snoop-ports = <0x80>; }; }; -BMC Node Children -== - - -Host Node Children -== LPC Host Interface Controller --- @@ -149,14 +99,12 @@ Optional properties: Example: -lpc-host@80 { - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2500-lpc-ctrl"; - reg = <0x0 0x80>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; -
[PATCH v5 4/5] pinctrl: aspeed-g5: Adapt to new LPC device tree layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c index 0cab4c2576e2..996ebcba4d38 100644 --- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c +++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c @@ -60,7 +60,7 @@ #define COND2 { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 } /* LHCR0 is offset from the end of the H8S/2168-compatible registers */ -#define LHCR0 0x20 +#define LHCR0 0xa0 #define GFX064 0x64 #define B14 0 @@ -2648,14 +2648,19 @@ static struct regmap *aspeed_g5_acquire_regmap(struct aspeed_pinmux_data *ctx, } if (ip == ASPEED_IP_LPC) { - struct device_node *node; + struct device_node *np; struct regmap *map; - node = of_parse_phandle(ctx->dev->of_node, + np = of_parse_phandle(ctx->dev->of_node, "aspeed,external-nodes", 1); - if (node) { - map = syscon_node_to_regmap(node->parent); - of_node_put(node); + if (np) { + if (!of_device_is_compatible(np->parent, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np->parent, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np->parent, "aspeed,ast2600-lpc-v2")) + return ERR_PTR(-ENODEV); + + map = syscon_node_to_regmap(np->parent); + of_node_put(np); if (IS_ERR(map)) return map; } else -- 2.17.1
[PATCH v5 5/5] soc: aspeed: Adapt to new LPC device tree layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 20 ++-- drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +++ 2 files changed, 29 insertions(+), 14 deletions(-) diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c b/drivers/soc/aspeed/aspeed-lpc-ctrl.c index 439bcd6b8c4a..c557ffd0992c 100644 --- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c +++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c @@ -18,15 +18,15 @@ #define DEVICE_NAME"aspeed-lpc-ctrl" -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_ENL2HBIT(8) #define HICR5_ENFWHBIT(10) -#define HICR6 0x4 +#define HICR6 0x84 #define SW_FWH2AHB BIT(17) -#define HICR7 0x8 -#define HICR8 0xc +#define HICR7 0x88 +#define HICR8 0x8c struct aspeed_lpc_ctrl { struct miscdevice miscdev; @@ -215,6 +215,7 @@ static int aspeed_lpc_ctrl_probe(struct platform_device *pdev) struct device_node *node; struct resource resm; struct device *dev; + struct device_node *np; int rc; dev = &pdev->dev; @@ -270,8 +271,15 @@ static int aspeed_lpc_ctrl_probe(struct platform_device *pdev) } } - lpc_ctrl->regmap = syscon_node_to_regmap( - pdev->dev.parent->of_node); + np = pdev->dev.parent->of_node; + if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + lpc_ctrl->regmap = syscon_node_to_regmap(np); if (IS_ERR(lpc_ctrl->regmap)) { dev_err(dev, "Couldn't get regmap\n"); return -ENODEV; diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c b/drivers/soc/aspeed/aspeed-lpc-snoop.c index 682ba0eb4eba..ab0f0a54fea6 100644 --- a/drivers/soc/aspeed/aspeed-lpc-snoop.c +++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c @@ -28,26 +28,25 @@ #define NUM_SNOOP_CHANNELS 2 #define SNOOP_FIFO_SIZE 2048 -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_EN_SNP0W BIT(0) #define HICR5_ENINT_SNP0W BIT(1) #define HICR5_EN_SNP1W BIT(2) #define HICR5_ENINT_SNP1W BIT(3) - -#define HICR6 0x4 +#define HICR6 0x84 #define HICR6_STR_SNP0WBIT(0) #define HICR6_STR_SNP1WBIT(1) -#define SNPWADR0x10 +#define SNPWADR0x90 #define SNPWADR_CH0_MASK GENMASK(15, 0) #define SNPWADR_CH0_SHIFT 0 #define SNPWADR_CH1_MASK GENMASK(31, 16) #define SNPWADR_CH1_SHIFT 16 -#define SNPWDR 0x14 +#define SNPWDR 0x94 #define SNPWDR_CH0_MASKGENMASK(7, 0) #define SNPWDR_CH0_SHIFT 0 #define SNPWDR_CH1_MASKGENMASK(15, 8) #define SNPWDR_CH1_SHIFT 8 -#define HICRB 0x80 +#define HICRB 0x100 #define HICRB_ENSNP0D BIT(14) #define HICRB_ENSNP1D BIT(15) @@ -258,6 +257,7 @@ static int aspeed_lpc_snoop_probe(struct platform_device *pdev) { struct aspeed_lpc_snoop *lpc_snoop; struct device *dev; + struct device_node *np; u32 port; int rc; @@ -267,8 +267,15 @@ static int aspeed_lpc_snoop_probe(struct platform_device *pdev) if (!lpc_snoop) return -ENOMEM; - lpc_snoop->regmap = syscon_node_to_regmap( - pdev->dev.parent->of_node); + np = pdev->dev.parent->of_node; + if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + lpc_snoop->regmap = syscon_node_to_regmap(np); if (IS_ERR(lpc_snoop->regmap)) { dev_err(dev, "Couldn't get regmap\n"); return -ENODEV; -- 2.17.1
[PATCH v5 3/5] ipmi: kcs: aspeed: Adapt to new LPC DTS layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang Acked-by: Haiyue Wang --- drivers/char/ipmi/kcs_bmc_aspeed.c | 27 --- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c index a140203c079b..eefe362f65f0 100644 --- a/drivers/char/ipmi/kcs_bmc_aspeed.c +++ b/drivers/char/ipmi/kcs_bmc_aspeed.c @@ -27,7 +27,6 @@ #define KCS_CHANNEL_MAX 4 -/* mapped to lpc-bmc@0 IO space */ #define LPC_HICR00x000 #define LPC_HICR0_LPC3E BIT(7) #define LPC_HICR0_LPC2E BIT(6) @@ -52,15 +51,13 @@ #define LPC_STR1 0x03C #define LPC_STR2 0x040 #define LPC_STR3 0x044 - -/* mapped to lpc-host@80 IO space */ -#define LPC_HICRB0x080 +#define LPC_HICRB0x100 #define LPC_HICRB_IBFIF4 BIT(1) #define LPC_HICRB_LPC4E BIT(0) -#define LPC_LADR40x090 -#define LPC_IDR4 0x094 -#define LPC_ODR4 0x098 -#define LPC_STR4 0x09C +#define LPC_LADR40x110 +#define LPC_IDR4 0x114 +#define LPC_ODR4 0x118 +#define LPC_STR4 0x11C struct aspeed_kcs_bmc { struct regmap *map; @@ -348,12 +345,20 @@ static int aspeed_kcs_probe(struct platform_device *pdev) struct device_node *np; int rc; - np = pdev->dev.of_node; + np = dev->of_node->parent; + if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + np = dev->of_node; if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc") || - of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc")) + of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc")) kcs_bmc = aspeed_kcs_probe_of_v1(pdev); else if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc-v2") || - of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc-v2")) +of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc-v2")) kcs_bmc = aspeed_kcs_probe_of_v2(pdev); else return -EINVAL; -- 2.17.1
[PATCH v5 0/5] Remove LPC register partitioning
The LPC controller has no concept of the BMC and the Host partitions. The incorrect partitioning can impose unnecessary range restrictions on register access through the syscon regmap interface. For instance, HICRB contains the I/O port address configuration of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access HICRB as it is located at the other LPC partition. In addition, to be backward compatible, the newly added HW control bits could be located at any reserved bits over the LPC addressing space. Thereby, this patch series aims to remove the LPC partitioning for better driver development and maintenance. This requires the change to both the device tree and the driver implementation. To ensure both sides are synchronously updated, a v2 binding check is added. Chagnes since v4: - Add child node example in dt-bindings. Chagnes since v3: - Revise binding check as suggested by Haiyue Wang. Changes since v2: - Add v2 binding check to ensure the synchronization between the device tree change and the driver register offset fix. Changes since v1: - Add the fix to the aspeed-lpc binding documentation. Chia-Wei, Wang (5): dt-bindings: aspeed-lpc: Remove LPC partitioning ARM: dts: Remove LPC BMC and Host partitions ipmi: kcs: aspeed: Adapt to new LPC DTS layout pinctrl: aspeed-g5: Adapt to new LPC device tree layout soc: aspeed: Adapt to new LPC device tree layout .../devicetree/bindings/mfd/aspeed-lpc.txt| 100 - arch/arm/boot/dts/aspeed-g4.dtsi | 74 -- arch/arm/boot/dts/aspeed-g5.dtsi | 135 -- arch/arm/boot/dts/aspeed-g6.dtsi | 135 -- drivers/char/ipmi/kcs_bmc_aspeed.c| 27 ++-- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c| 17 ++- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 20 ++- drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +-- 8 files changed, 229 insertions(+), 302 deletions(-) -- 2.17.1
[PATCH v5 2/5] ARM: dts: Remove LPC BMC and Host partitions
The LPC controller has no concept of the BMC and the Host partitions. A concrete instance is that the HICRB[5:4] are for the I/O port address configurtaion of KCS channel 1/2. However, the KCS driver cannot access HICRB for channel 1/2 initialization via syscon regmap interface due to the parition boundary. (i.e. offset 80h) In addition, for the HW design backward compatibility, a newly added HW control bit could be located at any reserved one over the LPC addressing space. Thereby, this patch removes the lpc-bmc and lpc-host child node and thus the LPC partitioning. Note that this change requires the synchronization between device tree change and the driver change. To prevent the misuse of old devicetrees with new drivers, or vice versa, the v2 compatible strings are adopted for the LPC device as listed: "aspeed,ast2400-lpc-v2" "aspeed,ast2500-lpc-v2" "aspeed,ast2600-lpc-v2" Signed-off-by: Chia-Wei, Wang --- arch/arm/boot/dts/aspeed-g4.dtsi | 74 +++-- arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++- arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++- 3 files changed, 148 insertions(+), 196 deletions(-) diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi index b3dafbc8caca..ee22bc036440 100644 --- a/arch/arm/boot/dts/aspeed-g4.dtsi +++ b/arch/arm/boot/dts/aspeed-g4.dtsi @@ -343,58 +343,44 @@ }; lpc: lpc@1e789000 { - compatible = "aspeed,ast2400-lpc", "simple-mfd"; + compatible = "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; + reg-io-width = <4>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2400-lpc-bmc"; - reg = <0x0 0x80>; + lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2400-lpc-ctrl"; + reg = <0x80 0x10>; + clocks = <&syscon ASPEED_CLK_GATE_LCLK>; + status = "disabled"; }; - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2400-lpc-ctrl"; - reg = <0x0 0x10>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - status = "disabled"; - }; - - lpc_snoop: lpc-snoop@10 { - compatible = "aspeed,ast2400-lpc-snoop"; - reg = <0x10 0x8>; - interrupts = <8>; - status = "disabled"; - }; - - lhc: lhc@20 { - compatible = "aspeed,ast2400-lhc"; - reg = <0x20 0x24 0x48 0x8>; - }; - - lpc_reset: reset-controller@18 { - compatible = "aspeed,ast2400-lpc-reset"; - reg = <0x18 0x4>; - #reset-cells = <1>; - }; - - ibt: ibt@c0 { - compatible = "aspeed,ast2400-ibt-bmc"; - reg = <0xc0 0x18>; -
[PATCH 3/6] clk: ast2600: Add eSPI reset bit
Add bit field definition for the eSPI reset control. Signed-off-by: Chia-Wei, Wang --- include/dt-bindings/clock/ast2600-clock.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/dt-bindings/clock/ast2600-clock.h b/include/dt-bindings/clock/ast2600-clock.h index 62b9520a00fd..964934b1caef 100644 --- a/include/dt-bindings/clock/ast2600-clock.h +++ b/include/dt-bindings/clock/ast2600-clock.h @@ -89,6 +89,7 @@ #define ASPEED_CLK_MAC4RCLK70 /* Only list resets here that are not part of a gate */ +#define ASPEED_RESET_ESPI 57 #define ASPEED_RESET_ADC 55 #define ASPEED_RESET_JTAG_MASTER2 54 #define ASPEED_RESET_I3C_DMA 39 -- 2.17.1
[PATCH 0/6] arm: aspeed: Add eSPI support
This patch series add the driver support for the eSPI controller of Aspeed 6th generation SoCs. This controller is a slave device communicating with a master over Enhanced Serial Peripheral Interface (eSPI). It supports all of the 4 eSPI channels, namely peripheral, virtual wire, out-of-band, and flash, and operates at max frequency of 66MHz. Chia-Wei, Wang (6): dt-bindings: aspeed: Add eSPI controller MAINTAINER: Add ASPEED eSPI driver entry clk: ast2600: Add eSPI reset bit irqchip/aspeed: Add Aspeed eSPI interrupt controller soc: aspeed: Add eSPI driver ARM: dts: aspeed: Add AST2600 eSPI nodes .../devicetree/bindings/soc/aspeed/espi.yaml | 252 +++ MAINTAINERS | 14 + arch/arm/boot/dts/aspeed-g6.dtsi | 57 ++ drivers/irqchip/Makefile | 2 +- drivers/irqchip/irq-aspeed-espi-ic.c | 251 +++ drivers/soc/aspeed/Kconfig| 49 ++ drivers/soc/aspeed/Makefile | 5 + drivers/soc/aspeed/aspeed-espi-ctrl.c | 197 + drivers/soc/aspeed/aspeed-espi-flash.c| 490 drivers/soc/aspeed/aspeed-espi-oob.c | 706 ++ drivers/soc/aspeed/aspeed-espi-peripheral.c | 613 +++ drivers/soc/aspeed/aspeed-espi-vw.c | 211 ++ include/dt-bindings/clock/ast2600-clock.h | 1 + .../interrupt-controller/aspeed-espi-ic.h | 15 + include/soc/aspeed/espi.h | 279 +++ include/uapi/linux/aspeed-espi.h | 160 16 files changed, 3301 insertions(+), 1 deletion(-) create mode 100644 Documentation/devicetree/bindings/soc/aspeed/espi.yaml create mode 100644 drivers/irqchip/irq-aspeed-espi-ic.c create mode 100644 drivers/soc/aspeed/aspeed-espi-ctrl.c create mode 100644 drivers/soc/aspeed/aspeed-espi-flash.c create mode 100644 drivers/soc/aspeed/aspeed-espi-oob.c create mode 100644 drivers/soc/aspeed/aspeed-espi-peripheral.c create mode 100644 drivers/soc/aspeed/aspeed-espi-vw.c create mode 100644 include/dt-bindings/interrupt-controller/aspeed-espi-ic.h create mode 100644 include/soc/aspeed/espi.h create mode 100644 include/uapi/linux/aspeed-espi.h -- 2.17.1
[PATCH 6/6] ARM: dts: aspeed: Add AST2600 eSPI nodes
Add eSPI nodes for the device tree of Aspeed 6th generation SoCs. Signed-off-by: Chia-Wei, Wang --- arch/arm/boot/dts/aspeed-g6.dtsi | 57 1 file changed, 57 insertions(+) diff --git a/arch/arm/boot/dts/aspeed-g6.dtsi b/arch/arm/boot/dts/aspeed-g6.dtsi index 810b0676ab03..d457baf11e37 100644 --- a/arch/arm/boot/dts/aspeed-g6.dtsi +++ b/arch/arm/boot/dts/aspeed-g6.dtsi @@ -3,7 +3,9 @@ #include #include +#include #include +#include / { model = "Aspeed BMC"; @@ -75,6 +77,61 @@ interrupts = ; }; + espi: espi@1e6ee000 { + compatible = "aspeed,ast2600-espi", "simple-mfd", "syscon"; + reg = <0x1e6ee000 0x1000>; + + #address-cells = <1>; + #size-cells = <1>; + ranges = <0x0 0x1e6ee000 0x1000>; + + espi_ic: espi-ic { + #interrupt-cells = <1>; + compatible = "aspeed,ast2600-espi-ic"; + interrupts-extended = <&gic GIC_SPI 42 IRQ_TYPE_LEVEL_HIGH>, + <&gpio0 ASPEED_GPIO(W, 7) IRQ_TYPE_EDGE_FALLING>; + interrupt-controller; + status = "disabled"; + }; + + espi_ctrl: espi-ctrl { + compatible = "aspeed,ast2600-espi-ctrl"; + interrupts-extended = <&espi_ic ASPEED_ESPI_IC_CTRL_EVENT>, + <&espi_ic ASPEED_ESPI_IC_CTRL_RESET>; + clocks = <&syscon ASPEED_CLK_GATE_ESPICLK>; + resets = <&syscon ASPEED_RESET_ESPI>; + status = "disabled"; + }; + + espi_peripheral: espi-peripheral-channel { + compatible = "aspeed,ast2600-espi-peripheral"; + interrupts-extended = <&espi_ic ASPEED_ESPI_IC_PERIF_EVENT>, + <&espi_ic ASPEED_ESPI_IC_CHAN_RESET>; + status = "disabled"; + }; + + espi_vw: espi-vw-channel { + compatible = "aspeed,ast2600-espi-vw"; + interrupts-extended = <&espi_ic ASPEED_ESPI_IC_VW_EVENT>, + <&espi_ic ASPEED_ESPI_IC_CHAN_RESET>; + status = "disabled"; + }; + + espi_oob: espi-oob-channel { + compatible = "aspeed,ast2600-espi-oob"; + interrupts-extended = <&espi_ic ASPEED_ESPI_IC_OOB_EVENT>, + <&espi_ic ASPEED_ESPI_IC_CHAN_RESET>; + status = "disabled"; + }; + + espi_flash: espi-flash-channel { + compatible = "aspeed,ast2600-espi-flash"; + interrupts-extended = <&espi_ic ASPEED_ESPI_IC_FLASH_EVENT>, + <&espi_ic ASPEED_ESPI_IC_CHAN_RESET>; + status = "disabled"; + }; + }; + ahb { compatible = "simple-bus"; #address-cells = <1>; -- 2.17.1
[PATCH 5/6] soc: aspeed: Add eSPI driver
The Aspeed eSPI controller is slave device to communicate with the master through the Enhanced Serial Peripheral Interface (eSPI). All of the four eSPI channels, namely peripheral, virtual wire, out-of-band, and flash are supported. Signed-off-by: Chia-Wei, Wang --- drivers/soc/aspeed/Kconfig | 49 ++ drivers/soc/aspeed/Makefile | 5 + drivers/soc/aspeed/aspeed-espi-ctrl.c | 197 ++ drivers/soc/aspeed/aspeed-espi-flash.c | 490 ++ drivers/soc/aspeed/aspeed-espi-oob.c| 706 drivers/soc/aspeed/aspeed-espi-peripheral.c | 613 + drivers/soc/aspeed/aspeed-espi-vw.c | 211 ++ include/uapi/linux/aspeed-espi.h| 160 + 8 files changed, 2431 insertions(+) create mode 100644 drivers/soc/aspeed/aspeed-espi-ctrl.c create mode 100644 drivers/soc/aspeed/aspeed-espi-flash.c create mode 100644 drivers/soc/aspeed/aspeed-espi-oob.c create mode 100644 drivers/soc/aspeed/aspeed-espi-peripheral.c create mode 100644 drivers/soc/aspeed/aspeed-espi-vw.c create mode 100644 include/uapi/linux/aspeed-espi.h diff --git a/drivers/soc/aspeed/Kconfig b/drivers/soc/aspeed/Kconfig index 243ca196e6ad..e4408e97023d 100644 --- a/drivers/soc/aspeed/Kconfig +++ b/drivers/soc/aspeed/Kconfig @@ -42,6 +42,55 @@ config ASPEED_SOCINFO help Say yes to support decoding of ASPEED BMC information. +config ASPEED_ESPI + tristate "Aspeed eSPI Engine Driver" + select REGMAP + select MFD_SYSON + select ASPEED_ESPI_PERIPHERAL + select ASPEED_ESPI_VW + select ASPEED_ESPI_OOB + select ASPEED_ESPI_FLASH + default n + help + Enable support for the Aspeed eSPI engine. The eSPI engine + plays as a slave device in BMC to communicate with the host + side master over the eSPI bus interface. + + eSPI and LPC are mutually execulisve features on Aspeed SoC. + If not sure, say N. + +config ASPEED_ESPI_PERIPHERAL + tristate "Aspeed eSPI peripheral channel driver" + depends on ASPEED_ESPI + help + Control Aspeed eSPI peripheral channel driver. The driver + also provides an eSPI packet put/get interface to communicate + with the eSPI host. + +config ASPEED_ESPI_VW + tristate "Aspeed eSPI virtual wire channel driver" + depends on ASPEED_ESPI + help + Control Aspeed eSPI virtual wire channel driver. The driver + also provides an eSPI packet put/get interface to communicate + with the eSPI host. + +config ASPEED_ESPI_OOB + tristate "Aspeed eSPI out-of-band channel driver" + depends on ASPEED_ESPI + help + Control Aspeed eSPI out-of-band channel driver. The driver + also provides an eSPI packet put/get interface to communicat + with the eSPI host. + +config ASPEED_ESPI_FLASH + tristate "Aspeed eSPI flash channel driver" + depends on ASPEED_ESPI + help + Control Aspeed eSPI flash channel driver. The driver + also provides an eSPI packet put/get interface to communicat + with the eSPI host. + endmenu endif diff --git a/drivers/soc/aspeed/Makefile b/drivers/soc/aspeed/Makefile index fcab7192e1a4..ac41ce82bb78 100644 --- a/drivers/soc/aspeed/Makefile +++ b/drivers/soc/aspeed/Makefile @@ -3,3 +3,8 @@ obj-$(CONFIG_ASPEED_LPC_CTRL) += aspeed-lpc-ctrl.o obj-$(CONFIG_ASPEED_LPC_SNOOP) += aspeed-lpc-snoop.o obj-$(CONFIG_ASPEED_P2A_CTRL) += aspeed-p2a-ctrl.o obj-$(CONFIG_ASPEED_SOCINFO) += aspeed-socinfo.o +obj-$(CONFIG_ASPEED_ESPI) += aspeed-espi-ctrl.o \ + aspeed-espi-peripheral.o \ + aspeed-espi-vw.o \ + aspeed-espi-oob.o \ + aspeed-espi-flash.o diff --git a/drivers/soc/aspeed/aspeed-espi-ctrl.c b/drivers/soc/aspeed/aspeed-espi-ctrl.c new file mode 100644 index ..e4329f5f8ed3 --- /dev/null +++ b/drivers/soc/aspeed/aspeed-espi-ctrl.c @@ -0,0 +1,197 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020 Aspeed Technology Inc. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define DEVICE_NAME "aspeed-espi-ctrl" + +struct aspeed_espi_ctrl { + struct regmap *map; + struct clk *clk; + struct reset_control *rst; + + int irq; + int irq_reset; +}; + +static irqreturn_t aspeed_espi_ctrl_isr(int irq, void *arg) +{ + uint32_t sts; + struct aspeed_espi_ctrl *espi_ctrl = (struct aspeed_espi_ctrl *)arg; + + regmap_read(espi_ctrl->map, ESPI_INT_STS, &sts); + + if (!(sts & ESPI_INT_STS_HW_RST_DEASSERT)) + return IRQ_NONE; + + regmap_update_bits(espi_ctrl
[PATCH 1/6] dt-bindings: aspeed: Add eSPI controller
Add dt-bindings and the inclusion header for Aspeed eSPI controller. Signed-off-by: Chia-Wei, Wang --- .../devicetree/bindings/soc/aspeed/espi.yaml | 252 ++ .../interrupt-controller/aspeed-espi-ic.h | 15 ++ 2 files changed, 267 insertions(+) create mode 100644 Documentation/devicetree/bindings/soc/aspeed/espi.yaml create mode 100644 include/dt-bindings/interrupt-controller/aspeed-espi-ic.h diff --git a/Documentation/devicetree/bindings/soc/aspeed/espi.yaml b/Documentation/devicetree/bindings/soc/aspeed/espi.yaml new file mode 100644 index ..ee92b66fe15b --- /dev/null +++ b/Documentation/devicetree/bindings/soc/aspeed/espi.yaml @@ -0,0 +1,252 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +# # Copyright (c) 2020 Aspeed Technology Inc. +%YAML 1.2 +--- +$id: "http://devicetree.org/schemas/soc/aspeed/espi.yaml#"; +$schema: "http://devicetree.org/meta-schemas/core.yaml#"; + +title: Aspeed eSPI Controller + +maintainers: + - Chia-Wei Wang + - Ryan Chen + +description: + Aspeed eSPI controller implements a slave side eSPI endpoint device + supporting the four eSPI channels, namely peripheral, virtual wire, + out-of-band, and flash. + +properties: + compatible: +items: + - enum: + - aspeed,ast2600-espi + - const: simple-mfd + - const: syscon + + reg: +maxItems: 1 + + "#address-cells": +const: 1 + + "#size-cells": +const: 1 + + ranges: true + + espi-ic: +type: object + +properties: + compatible: +const: aspeed,ast2600-espi-ic + + interrupts: +maxItems: 2 + + interrupt-controller: true + + "#interrupt-cells": +const: 1 +description: + The cell is a SW-encoded number for IRQ dispatching to eSPI channels + +required: + - compatible + - interrupts + - interrupt-controller + - "#interrupt-cells" + + espi-ctrl: +type: object + +properties: + compatible: +const: aspeed,ast2600-espi-ctrl + + interrupts: +maxItems: 2 + + clocks: +maxItems: 1 + + resets: +maxItems: 1 + +required: + - compatible + - interrupts + - clocks + - resets + + espi-peripheral-channel: +type: object + +properties: + compatible: +const: aspeed,ast2600-espi-peripheral + + interrupts: +maxItems: 2 + + dma-mode: +type: boolean + + memcyc,map-src-addr: +$ref: "/schemas/types.yaml#/definitions/uint32" +description: The host side address to be decoded into the memory cycle over eSPI peripheral channel + + memcyc,map-size: +$ref: "/schemas/types.yaml#/definitions/uint32" +description: The size of the memory region allocated for the memory cycle over eSPI peripheral channel +minimum: 65536 + +required: + - compatible + - interrupts + - memcyc,map-src-addr + - memcyc,map-size + + espi-vw-channel: +type: object + +properties: + compatible: +const: aspeed,ast2600-espi-vw + + interrupts: +maxItems: 2 + +required: + - compatible + - interrupts + + espi-oob-channel: +type: object + +properties: + compatible: +const: aspeed,ast2600-espi-oob + + interrupts: +maxItems: 2 + + dma-mode: +type: boolean + + dma-tx-desc-num: +$ref: "/schemas/types.yaml#/definitions/uint32" +minimum: 2 +maximum: 1023 + + dma-rx-desc-num: +$ref: "/schemas/types.yaml#/definitions/uint32" +minimum: 2 +maximum: 1023 + +required: + - compatible + - interrupts + + espi-flash-channel: +type: object + +properties: + compatible: +const: aspeed,ast2600-espi-flash + + interrupts: +maxItems: 2 + + dma-mode: +type: boolean + + safs-mode: +description: Slave-Attached-Sharing-Flash mode, 0->Mix, 1->SW, 2->HW +enum: [ 0, 1, 2 ] + +required: + - compatible + - interrupts + +required: + - compatible + - reg + - "#address-cells" + - "#size-cells" + - ranges + - espi-ic + - espi-ctrl + - espi-peripheral-channel + - espi-vw-channel + - espi-oob-channel + - espi-flash-channel + +additionalProperties: false + +examples: + - | +#include +#include +#include +#include + +espi: espi@1e6ee000 { +compatible = "aspeed,ast2600-espi", "simple-mfd", "syscon"; +reg = <0x1e6ee000 0x1000>; + +#address-cells = <1>; +#size-cells = <1>; +ranges = <0x0 0x1e6ee000 0x1000>; + +espi_ic: espi-ic { +#interrupt-cells = <1>; +compatible = "aspeed,ast2600-espi-ic&q
[PATCH 4/6] irqchip/aspeed: Add Aspeed eSPI interrupt controller
The eSPI interrupt controller acts as a SW IRQ number decoder to correctly control/dispatch interrupts of the eSPI peripheral, virtual wire, out-of-band, and flash channels. Signed-off-by: Chia-Wei, Wang --- drivers/irqchip/Makefile | 2 +- drivers/irqchip/irq-aspeed-espi-ic.c | 251 include/soc/aspeed/espi.h| 279 +++ 3 files changed, 531 insertions(+), 1 deletion(-) create mode 100644 drivers/irqchip/irq-aspeed-espi-ic.c create mode 100644 include/soc/aspeed/espi.h diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile index 0ac93bfaec61..56da4a3123f8 100644 --- a/drivers/irqchip/Makefile +++ b/drivers/irqchip/Makefile @@ -86,7 +86,7 @@ obj-$(CONFIG_MVEBU_PIC) += irq-mvebu-pic.o obj-$(CONFIG_MVEBU_SEI)+= irq-mvebu-sei.o obj-$(CONFIG_LS_EXTIRQ)+= irq-ls-extirq.o obj-$(CONFIG_LS_SCFG_MSI) += irq-ls-scfg-msi.o -obj-$(CONFIG_ARCH_ASPEED) += irq-aspeed-vic.o irq-aspeed-i2c-ic.o irq-aspeed-scu-ic.o +obj-$(CONFIG_ARCH_ASPEED) += irq-aspeed-vic.o irq-aspeed-i2c-ic.o irq-aspeed-scu-ic.o irq-aspeed-espi-ic.o obj-$(CONFIG_STM32_EXTI) += irq-stm32-exti.o obj-$(CONFIG_QCOM_IRQ_COMBINER)+= qcom-irq-combiner.o obj-$(CONFIG_IRQ_UNIPHIER_AIDET) += irq-uniphier-aidet.o diff --git a/drivers/irqchip/irq-aspeed-espi-ic.c b/drivers/irqchip/irq-aspeed-espi-ic.c new file mode 100644 index ..8a5cc8fe3f0c --- /dev/null +++ b/drivers/irqchip/irq-aspeed-espi-ic.c @@ -0,0 +1,251 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020 Aspeed Technology Inc. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#define DEVICE_NAME"aspeed-espi-ic" +#define IRQCHIP_NAME "eSPI-IC" + +#define ESPI_IC_IRQ_NUM7 + +struct aspeed_espi_ic { + struct regmap *map; + int irq; + int gpio_irq; + struct irq_domain *irq_domain; +}; + +static void aspeed_espi_ic_gpio_isr(struct irq_desc *desc) +{ + unsigned int irq; + struct aspeed_espi_ic *espi_ic = irq_desc_get_handler_data(desc); + struct irq_chip *chip = irq_desc_get_chip(desc); + + chained_irq_enter(chip, desc); + + irq = irq_find_mapping(espi_ic->irq_domain, + ASPEED_ESPI_IC_CTRL_RESET); + generic_handle_irq(irq); + + irq = irq_find_mapping(espi_ic->irq_domain, + ASPEED_ESPI_IC_CHAN_RESET); + generic_handle_irq(irq); + + chained_irq_exit(chip, desc); +} + +static void aspeed_espi_ic_isr(struct irq_desc *desc) +{ + unsigned int sts; + unsigned int irq; + struct aspeed_espi_ic *espi_ic = irq_desc_get_handler_data(desc); + struct irq_chip *chip = irq_desc_get_chip(desc); + + chained_irq_enter(chip, desc); + + regmap_read(espi_ic->map, ESPI_INT_STS, &sts); + + if (sts & ESPI_INT_STS_PERIF_BITS) { + irq = irq_find_mapping(espi_ic->irq_domain, + ASPEED_ESPI_IC_PERIF_EVENT); + generic_handle_irq(irq); + } + + if (sts & ESPI_INT_STS_VW_BITS) { + irq = irq_find_mapping(espi_ic->irq_domain, + ASPEED_ESPI_IC_VW_EVENT); + generic_handle_irq(irq); + } + + if (sts & ESPI_INT_STS_OOB_BITS) { + irq = irq_find_mapping(espi_ic->irq_domain, + ASPEED_ESPI_IC_OOB_EVENT); + generic_handle_irq(irq); + } + + if (sts & ESPI_INT_STS_FLASH_BITS) { + irq = irq_find_mapping(espi_ic->irq_domain, + ASPEED_ESPI_IC_FLASH_EVENT); + generic_handle_irq(irq); + } + + if (sts & ESPI_INT_STS_HW_RST_DEASSERT) { + irq = irq_find_mapping(espi_ic->irq_domain, + ASPEED_ESPI_IC_CTRL_EVENT); + generic_handle_irq(irq); + } + + chained_irq_exit(chip, desc); +} + +static void aspeed_espi_ic_irq_disable(struct irq_data *data) +{ + struct aspeed_espi_ic *espi_ic = irq_data_get_irq_chip_data(data); + + switch (data->hwirq) { + case ASPEED_ESPI_IC_CTRL_EVENT: + regmap_update_bits(espi_ic->map, ESPI_INT_EN, + ESPI_INT_EN_HW_RST_DEASSERT, + 0); + break; + case ASPEED_ESPI_IC_PERIF_EVENT: + regmap_update_bits(espi_ic->map, ESPI_INT_EN, + ESPI_INT_EN_PERIF_BITS, 0); + break; + case ASPEED_ESPI_IC_VW_EVENT: + regmap_update_bits(espi_ic
[PATCH 2/6] MAINTAINER: Add ASPEED eSPI driver entry
Add myself and Ryan Chen as maintainer of the Aspeed eSPI driver and the associated eSPI interrupt controller. Joel Stanley is also added as the reviewer. Signed-off-by: Chia-Wei, Wang --- MAINTAINERS | 14 ++ 1 file changed, 14 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 7c1e45c416b1..d5f9205a5439 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1639,6 +1639,20 @@ F: drivers/crypto/axis F: drivers/mmc/host/usdhi6rol0.c F: drivers/pinctrl/pinctrl-artpec* +ARM/ASPEED ESPI DRIVER +M: Chia-Wei Wang +M: Ryan Chen +R: Joel Stanley +L: linux-asp...@lists.ozlabs.org (moderated for non-subscribers) +L: open...@lists.ozlabs.org (moderated for non-subscribers) +S: Maintained +F: Documentation/devicetree/bindings/soc/aspeed/espi.yaml +F: drivers/irqchip/irq-aspeed-espi-ic.c +F: drivers/soc/aspeed/aspeed-espi* +F: include/dt-bindings/interrupt-controller/aspeed-espi-ic.h +F: include/soc/aspeed/espi.h +F: include/uapi/linux/aspeed-espi.h + ARM/ASPEED I2C DRIVER M: Brendan Higgins R: Benjamin Herrenschmidt -- 2.17.1
[PATCH v4 4/5] pinctrl: aspeed-g5: Adapt to new LPC device tree layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c index 0cab4c2576e2..996ebcba4d38 100644 --- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c +++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c @@ -60,7 +60,7 @@ #define COND2 { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 } /* LHCR0 is offset from the end of the H8S/2168-compatible registers */ -#define LHCR0 0x20 +#define LHCR0 0xa0 #define GFX064 0x64 #define B14 0 @@ -2648,14 +2648,19 @@ static struct regmap *aspeed_g5_acquire_regmap(struct aspeed_pinmux_data *ctx, } if (ip == ASPEED_IP_LPC) { - struct device_node *node; + struct device_node *np; struct regmap *map; - node = of_parse_phandle(ctx->dev->of_node, + np = of_parse_phandle(ctx->dev->of_node, "aspeed,external-nodes", 1); - if (node) { - map = syscon_node_to_regmap(node->parent); - of_node_put(node); + if (np) { + if (!of_device_is_compatible(np->parent, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np->parent, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np->parent, "aspeed,ast2600-lpc-v2")) + return ERR_PTR(-ENODEV); + + map = syscon_node_to_regmap(np->parent); + of_node_put(np); if (IS_ERR(map)) return map; } else -- 2.17.1
[PATCH v4 1/5] dt-bindings: aspeed-lpc: Remove LPC partitioning
The LPC controller has no concept of the BMC and the Host partitions. This patch fixes the documentation by removing the description on LPC partitions. The register offsets illustrated in the DTS node examples are also fixed to adapt to the LPC DTS change. Signed-off-by: Chia-Wei, Wang --- .../devicetree/bindings/mfd/aspeed-lpc.txt| 99 --- 1 file changed, 21 insertions(+), 78 deletions(-) diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt index d0a38ba8b9ce..90eb0ecc95d1 100644 --- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt +++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt @@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on the bus conditions it can also take the role of bus master. The LPC controller is represented as a multi-function device to account for the -mix of functionality it provides. The principle split is between the register -layout at the start of the I/O space which is, to quote the Aspeed datasheet, -"basically compatible with the [LPC registers from the] popular BMC controller -H8S/2168[1]", and everything else, where everything else is an eclectic -collection of functions with a esoteric register layout. "Everything else", -here labeled the "host" portion of the controller, includes, but is not limited -to: +mix of functionality, which includes, but is not limited to: * An IPMI Block Transfer[2] Controller @@ -44,80 +38,29 @@ Required properties === - compatible: One of: - "aspeed,ast2400-lpc", "simple-mfd" - "aspeed,ast2500-lpc", "simple-mfd" - "aspeed,ast2600-lpc", "simple-mfd" + "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon" + "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon" + "aspeed,ast2600-lpc-v2", "simple-mfd", "syscon" - reg: contains the physical address and length values of the Aspeed LPC memory region. - #address-cells: <1> - #size-cells: <1> -- ranges: Maps 0 to the physical address and length of the LPC memory -region - -Required LPC Child nodes - - -BMC Node - - -- compatible: One of: - "aspeed,ast2400-lpc-bmc" - "aspeed,ast2500-lpc-bmc" - "aspeed,ast2600-lpc-bmc" - -- reg: contains the physical address and length values of the -H8S/2168-compatible LPC controller memory region - -Host Node -- - -- compatible: One of: - "aspeed,ast2400-lpc-host", "simple-mfd", "syscon" - "aspeed,ast2500-lpc-host", "simple-mfd", "syscon" - "aspeed,ast2600-lpc-host", "simple-mfd", "syscon" - -- reg: contains the address and length values of the host-related -register space for the Aspeed LPC controller - -- #address-cells: <1> -- #size-cells: <1> -- ranges: Maps 0 to the address and length of the host-related LPC memory +- ranges: Maps 0 to the physical address and length of the LPC memory region Example: lpc: lpc@1e789000 { - compatible = "aspeed,ast2500-lpc", "simple-mfd"; + compatible = "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2500-lpc-bmc"; - reg = <0x0 0x80>; - }; - - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - }; }; -BMC Node Children -== - - -Host Node Children -== LPC Host Interface Controller --- @@ -149,14 +92,12 @@ Optional properties: Example: -lpc-host@80 { - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2500-lpc-ctrl"; - reg = <0x0 0x80>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - memory-region = <&flash_memory>; - flash = <&spi>; - }; +lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2500
[PATCH v4 5/5] soc: aspeed: Adapt to new LPC device tree layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 20 ++-- drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +++ 2 files changed, 29 insertions(+), 14 deletions(-) diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c b/drivers/soc/aspeed/aspeed-lpc-ctrl.c index 439bcd6b8c4a..c557ffd0992c 100644 --- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c +++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c @@ -18,15 +18,15 @@ #define DEVICE_NAME"aspeed-lpc-ctrl" -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_ENL2HBIT(8) #define HICR5_ENFWHBIT(10) -#define HICR6 0x4 +#define HICR6 0x84 #define SW_FWH2AHB BIT(17) -#define HICR7 0x8 -#define HICR8 0xc +#define HICR7 0x88 +#define HICR8 0x8c struct aspeed_lpc_ctrl { struct miscdevice miscdev; @@ -215,6 +215,7 @@ static int aspeed_lpc_ctrl_probe(struct platform_device *pdev) struct device_node *node; struct resource resm; struct device *dev; + struct device_node *np; int rc; dev = &pdev->dev; @@ -270,8 +271,15 @@ static int aspeed_lpc_ctrl_probe(struct platform_device *pdev) } } - lpc_ctrl->regmap = syscon_node_to_regmap( - pdev->dev.parent->of_node); + np = pdev->dev.parent->of_node; + if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + lpc_ctrl->regmap = syscon_node_to_regmap(np); if (IS_ERR(lpc_ctrl->regmap)) { dev_err(dev, "Couldn't get regmap\n"); return -ENODEV; diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c b/drivers/soc/aspeed/aspeed-lpc-snoop.c index 682ba0eb4eba..ab0f0a54fea6 100644 --- a/drivers/soc/aspeed/aspeed-lpc-snoop.c +++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c @@ -28,26 +28,25 @@ #define NUM_SNOOP_CHANNELS 2 #define SNOOP_FIFO_SIZE 2048 -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_EN_SNP0W BIT(0) #define HICR5_ENINT_SNP0W BIT(1) #define HICR5_EN_SNP1W BIT(2) #define HICR5_ENINT_SNP1W BIT(3) - -#define HICR6 0x4 +#define HICR6 0x84 #define HICR6_STR_SNP0WBIT(0) #define HICR6_STR_SNP1WBIT(1) -#define SNPWADR0x10 +#define SNPWADR0x90 #define SNPWADR_CH0_MASK GENMASK(15, 0) #define SNPWADR_CH0_SHIFT 0 #define SNPWADR_CH1_MASK GENMASK(31, 16) #define SNPWADR_CH1_SHIFT 16 -#define SNPWDR 0x14 +#define SNPWDR 0x94 #define SNPWDR_CH0_MASKGENMASK(7, 0) #define SNPWDR_CH0_SHIFT 0 #define SNPWDR_CH1_MASKGENMASK(15, 8) #define SNPWDR_CH1_SHIFT 8 -#define HICRB 0x80 +#define HICRB 0x100 #define HICRB_ENSNP0D BIT(14) #define HICRB_ENSNP1D BIT(15) @@ -258,6 +257,7 @@ static int aspeed_lpc_snoop_probe(struct platform_device *pdev) { struct aspeed_lpc_snoop *lpc_snoop; struct device *dev; + struct device_node *np; u32 port; int rc; @@ -267,8 +267,15 @@ static int aspeed_lpc_snoop_probe(struct platform_device *pdev) if (!lpc_snoop) return -ENOMEM; - lpc_snoop->regmap = syscon_node_to_regmap( - pdev->dev.parent->of_node); + np = pdev->dev.parent->of_node; + if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + lpc_snoop->regmap = syscon_node_to_regmap(np); if (IS_ERR(lpc_snoop->regmap)) { dev_err(dev, "Couldn't get regmap\n"); return -ENODEV; -- 2.17.1
[PATCH v4 3/5] ipmi: kcs: aspeed: Adapt to new LPC DTS layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/char/ipmi/kcs_bmc_aspeed.c | 27 --- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c index a140203c079b..eefe362f65f0 100644 --- a/drivers/char/ipmi/kcs_bmc_aspeed.c +++ b/drivers/char/ipmi/kcs_bmc_aspeed.c @@ -27,7 +27,6 @@ #define KCS_CHANNEL_MAX 4 -/* mapped to lpc-bmc@0 IO space */ #define LPC_HICR00x000 #define LPC_HICR0_LPC3E BIT(7) #define LPC_HICR0_LPC2E BIT(6) @@ -52,15 +51,13 @@ #define LPC_STR1 0x03C #define LPC_STR2 0x040 #define LPC_STR3 0x044 - -/* mapped to lpc-host@80 IO space */ -#define LPC_HICRB0x080 +#define LPC_HICRB0x100 #define LPC_HICRB_IBFIF4 BIT(1) #define LPC_HICRB_LPC4E BIT(0) -#define LPC_LADR40x090 -#define LPC_IDR4 0x094 -#define LPC_ODR4 0x098 -#define LPC_STR4 0x09C +#define LPC_LADR40x110 +#define LPC_IDR4 0x114 +#define LPC_ODR4 0x118 +#define LPC_STR4 0x11C struct aspeed_kcs_bmc { struct regmap *map; @@ -348,12 +345,20 @@ static int aspeed_kcs_probe(struct platform_device *pdev) struct device_node *np; int rc; - np = pdev->dev.of_node; + np = dev->of_node->parent; + if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + np = dev->of_node; if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc") || - of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc")) + of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc")) kcs_bmc = aspeed_kcs_probe_of_v1(pdev); else if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc-v2") || - of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc-v2")) +of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc-v2")) kcs_bmc = aspeed_kcs_probe_of_v2(pdev); else return -EINVAL; -- 2.17.1
[PATCH v4 2/5] ARM: dts: Remove LPC BMC and Host partitions
The LPC controller has no concept of the BMC and the Host partitions. A concrete instance is that the HICRB[5:4] are for the I/O port address configurtaion of KCS channel 1/2. However, the KCS driver cannot access HICRB for channel 1/2 initialization via syscon regmap interface due to the parition boundary. (i.e. offset 80h) In addition, for the HW design backward compatiblity, a newly added HW control bit could be located at any reserved one over the LPC addressing space. Thereby, this patch removes the lpc-bmc and lpc-host child node and thus the LPC partitioning. Note that this change requires the synchronization between device tree change and the driver change. To prevent the misuse of old devicetrees with new drivers, or vice versa, the v2 compatible strings are adopted for the LPC device as listed: "aspeed,ast2400-lpc-v2" "aspeed,ast2500-lpc-v2" "aspeed,ast2600-lpc-v2" Signed-off-by: Chia-Wei, Wang --- arch/arm/boot/dts/aspeed-g4.dtsi | 74 +++-- arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++- arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++- 3 files changed, 148 insertions(+), 196 deletions(-) diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi index b3dafbc8caca..ee22bc036440 100644 --- a/arch/arm/boot/dts/aspeed-g4.dtsi +++ b/arch/arm/boot/dts/aspeed-g4.dtsi @@ -343,58 +343,44 @@ }; lpc: lpc@1e789000 { - compatible = "aspeed,ast2400-lpc", "simple-mfd"; + compatible = "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; + reg-io-width = <4>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2400-lpc-bmc"; - reg = <0x0 0x80>; + lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2400-lpc-ctrl"; + reg = <0x80 0x10>; + clocks = <&syscon ASPEED_CLK_GATE_LCLK>; + status = "disabled"; }; - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2400-lpc-ctrl"; - reg = <0x0 0x10>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - status = "disabled"; - }; - - lpc_snoop: lpc-snoop@10 { - compatible = "aspeed,ast2400-lpc-snoop"; - reg = <0x10 0x8>; - interrupts = <8>; - status = "disabled"; - }; - - lhc: lhc@20 { - compatible = "aspeed,ast2400-lhc"; - reg = <0x20 0x24 0x48 0x8>; - }; - - lpc_reset: reset-controller@18 { - compatible = "aspeed,ast2400-lpc-reset"; - reg = <0x18 0x4>; - #reset-cells = <1>; - }; - - ibt: ibt@c0 { - compatible = "aspeed,ast2400-ibt-bmc"; - reg = <0xc0 0x18>; -
[PATCH v4 0/5] Remove LPC register partitioning
The LPC controller has no concept of the BMC and the Host partitions. The incorrect partitioning can impose unnecessary range restrictions on register access through the syscon regmap interface. For instance, HICRB contains the I/O port address configuration of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access HICRB as it is located at the other LPC partition. In addition, to be backward compatible, the newly added HW control bits could be located at any reserved bits over the LPC addressing space. Thereby, this patch series aims to remove the LPC partitioning for better driver development and maintenance. This requires the change to both the device tree and the driver implementation. To ensure both sides are synchronously updated, a v2 binding check is added. Chagnes since v3: - Revise binding check as suggested by Haiyue Wang Changes since v2: - Add v2 binding check to ensure the synchronization between the device tree change and the driver register offset fix. Changes since v1: - Add the fix to the aspeed-lpc binding documentation. Chia-Wei, Wang (5): dt-bindings: aspeed-lpc: Remove LPC partitioning ARM: dts: Remove LPC BMC and Host partitions ipmi: kcs: aspeed: Adapt to new LPC DTS layout pinctrl: aspeed-g5: Adapt to new LPC device tree layout soc: aspeed: Adapt to new LPC device tree layout .../devicetree/bindings/mfd/aspeed-lpc.txt| 99 +++-- arch/arm/boot/dts/aspeed-g4.dtsi | 74 -- arch/arm/boot/dts/aspeed-g5.dtsi | 135 -- arch/arm/boot/dts/aspeed-g6.dtsi | 135 -- drivers/char/ipmi/kcs_bmc_aspeed.c| 27 ++-- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c| 17 ++- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 20 ++- drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +-- 8 files changed, 225 insertions(+), 305 deletions(-) -- 2.17.1
[PATCH v3 5/5] soc: aspeed: Adapt to new LPC device tree layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 20 ++-- drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +++ 2 files changed, 29 insertions(+), 14 deletions(-) diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c b/drivers/soc/aspeed/aspeed-lpc-ctrl.c index 439bcd6b8c4a..b04074949240 100644 --- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c +++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c @@ -18,15 +18,15 @@ #define DEVICE_NAME"aspeed-lpc-ctrl" -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_ENL2HBIT(8) #define HICR5_ENFWHBIT(10) -#define HICR6 0x4 +#define HICR6 0x84 #define SW_FWH2AHB BIT(17) -#define HICR7 0x8 -#define HICR8 0xc +#define HICR7 0x88 +#define HICR8 0x8c struct aspeed_lpc_ctrl { struct miscdevice miscdev; @@ -215,6 +215,7 @@ static int aspeed_lpc_ctrl_probe(struct platform_device *pdev) struct device_node *node; struct resource resm; struct device *dev; + struct device_node *lpc_np; int rc; dev = &pdev->dev; @@ -270,8 +271,15 @@ static int aspeed_lpc_ctrl_probe(struct platform_device *pdev) } } - lpc_ctrl->regmap = syscon_node_to_regmap( - pdev->dev.parent->of_node); + lpc_np = pdev->dev.parent->of_node; + if (!of_device_is_compatible(lpc_np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + lpc_ctrl->regmap = syscon_node_to_regmap(lpc_np); if (IS_ERR(lpc_ctrl->regmap)) { dev_err(dev, "Couldn't get regmap\n"); return -ENODEV; diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c b/drivers/soc/aspeed/aspeed-lpc-snoop.c index 682ba0eb4eba..63c3d9b8ba61 100644 --- a/drivers/soc/aspeed/aspeed-lpc-snoop.c +++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c @@ -28,26 +28,25 @@ #define NUM_SNOOP_CHANNELS 2 #define SNOOP_FIFO_SIZE 2048 -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_EN_SNP0W BIT(0) #define HICR5_ENINT_SNP0W BIT(1) #define HICR5_EN_SNP1W BIT(2) #define HICR5_ENINT_SNP1W BIT(3) - -#define HICR6 0x4 +#define HICR6 0x84 #define HICR6_STR_SNP0WBIT(0) #define HICR6_STR_SNP1WBIT(1) -#define SNPWADR0x10 +#define SNPWADR0x90 #define SNPWADR_CH0_MASK GENMASK(15, 0) #define SNPWADR_CH0_SHIFT 0 #define SNPWADR_CH1_MASK GENMASK(31, 16) #define SNPWADR_CH1_SHIFT 16 -#define SNPWDR 0x14 +#define SNPWDR 0x94 #define SNPWDR_CH0_MASKGENMASK(7, 0) #define SNPWDR_CH0_SHIFT 0 #define SNPWDR_CH1_MASKGENMASK(15, 8) #define SNPWDR_CH1_SHIFT 8 -#define HICRB 0x80 +#define HICRB 0x100 #define HICRB_ENSNP0D BIT(14) #define HICRB_ENSNP1D BIT(15) @@ -258,6 +257,7 @@ static int aspeed_lpc_snoop_probe(struct platform_device *pdev) { struct aspeed_lpc_snoop *lpc_snoop; struct device *dev; + struct device_node *lpc_np; u32 port; int rc; @@ -267,8 +267,15 @@ static int aspeed_lpc_snoop_probe(struct platform_device *pdev) if (!lpc_snoop) return -ENOMEM; - lpc_snoop->regmap = syscon_node_to_regmap( - pdev->dev.parent->of_node); + lpc_np = pdev->dev.parent->of_node; + if (!of_device_is_compatible(lpc_np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + lpc_snoop->regmap = syscon_node_to_regmap(lpc_np); if (IS_ERR(lpc_snoop->regmap)) { dev_err(dev, "Couldn't get regmap\n"); return -ENODEV; -- 2.17.1
[PATCH v3 3/5] ipmi: kcs: aspeed: Adapt to new LPC DTS layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/char/ipmi/kcs_bmc_aspeed.c | 35 ++ 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c index a140203c079b..6283bfef4ea7 100644 --- a/drivers/char/ipmi/kcs_bmc_aspeed.c +++ b/drivers/char/ipmi/kcs_bmc_aspeed.c @@ -27,7 +27,6 @@ #define KCS_CHANNEL_MAX 4 -/* mapped to lpc-bmc@0 IO space */ #define LPC_HICR00x000 #define LPC_HICR0_LPC3E BIT(7) #define LPC_HICR0_LPC2E BIT(6) @@ -52,15 +51,13 @@ #define LPC_STR1 0x03C #define LPC_STR2 0x040 #define LPC_STR3 0x044 - -/* mapped to lpc-host@80 IO space */ -#define LPC_HICRB0x080 +#define LPC_HICRB0x100 #define LPC_HICRB_IBFIF4 BIT(1) #define LPC_HICRB_LPC4E BIT(0) -#define LPC_LADR40x090 -#define LPC_IDR4 0x094 -#define LPC_ODR4 0x098 -#define LPC_STR4 0x09C +#define LPC_LADR40x110 +#define LPC_IDR4 0x114 +#define LPC_ODR4 0x118 +#define LPC_STR4 0x11C struct aspeed_kcs_bmc { struct regmap *map; @@ -345,15 +342,25 @@ static int aspeed_kcs_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; struct kcs_bmc *kcs_bmc; - struct device_node *np; + struct device_node *kcs_np; + struct device_node *lpc_np; int rc; - np = pdev->dev.of_node; - if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc") || - of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc")) + kcs_np = dev->of_node; + lpc_np = kcs_np->parent; + + if (!of_device_is_compatible(lpc_np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2600-lpc-v2")) { + dev_err(dev, "unsupported LPC device binding\n"); + return -ENODEV; + } + + if (of_device_is_compatible(kcs_np, "aspeed,ast2400-kcs-bmc") || + of_device_is_compatible(kcs_np, "aspeed,ast2500-kcs-bmc")) kcs_bmc = aspeed_kcs_probe_of_v1(pdev); - else if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc-v2") || - of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc-v2")) + else if (of_device_is_compatible(kcs_np, "aspeed,ast2400-kcs-bmc-v2") || + of_device_is_compatible(kcs_np, "aspeed,ast2500-kcs-bmc-v2")) kcs_bmc = aspeed_kcs_probe_of_v2(pdev); else return -EINVAL; -- 2.17.1
[PATCH v3 4/5] pinctrl: aspeed-g5: Adapt to new LPC device tree layout
Add check against LPC device v2 compatible string to ensure that the fixed device tree layout is adopted. The LPC register offsets are also fixed accordingly. Signed-off-by: Chia-Wei, Wang --- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c index 0cab4c2576e2..6e0e5b64e677 100644 --- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c +++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c @@ -60,7 +60,7 @@ #define COND2 { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 } /* LHCR0 is offset from the end of the H8S/2168-compatible registers */ -#define LHCR0 0x20 +#define LHCR0 0xa0 #define GFX064 0x64 #define B14 0 @@ -2648,14 +2648,21 @@ static struct regmap *aspeed_g5_acquire_regmap(struct aspeed_pinmux_data *ctx, } if (ip == ASPEED_IP_LPC) { - struct device_node *node; + struct device_node *lhc_np; + struct device_node *lpc_np; struct regmap *map; - node = of_parse_phandle(ctx->dev->of_node, + lhc_np = of_parse_phandle(ctx->dev->of_node, "aspeed,external-nodes", 1); - if (node) { - map = syscon_node_to_regmap(node->parent); - of_node_put(node); + if (lhc_np) { + lpc_np = lhc_np->parent; + if (!of_device_is_compatible(lpc_np, "aspeed,ast2400-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2500-lpc-v2") && + !of_device_is_compatible(lpc_np, "aspeed,ast2600-lpc-v2")) + return ERR_PTR(-ENODEV); + + map = syscon_node_to_regmap(lpc_np); + of_node_put(lhc_np); if (IS_ERR(map)) return map; } else -- 2.17.1
[PATCH v3 1/5] dt-bindings: aspeed-lpc: Remove LPC partitioning
The LPC controller has no concept of the BMC and the Host partitions. This patch fixes the documentation by removing the description on LPC partitions. The register offsets illustrated in the DTS node examples are also fixed to adapt to the LPC DTS change. Signed-off-by: Chia-Wei, Wang --- .../devicetree/bindings/mfd/aspeed-lpc.txt| 99 --- 1 file changed, 21 insertions(+), 78 deletions(-) diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt index d0a38ba8b9ce..90eb0ecc95d1 100644 --- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt +++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt @@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on the bus conditions it can also take the role of bus master. The LPC controller is represented as a multi-function device to account for the -mix of functionality it provides. The principle split is between the register -layout at the start of the I/O space which is, to quote the Aspeed datasheet, -"basically compatible with the [LPC registers from the] popular BMC controller -H8S/2168[1]", and everything else, where everything else is an eclectic -collection of functions with a esoteric register layout. "Everything else", -here labeled the "host" portion of the controller, includes, but is not limited -to: +mix of functionality, which includes, but is not limited to: * An IPMI Block Transfer[2] Controller @@ -44,80 +38,29 @@ Required properties === - compatible: One of: - "aspeed,ast2400-lpc", "simple-mfd" - "aspeed,ast2500-lpc", "simple-mfd" - "aspeed,ast2600-lpc", "simple-mfd" + "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon" + "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon" + "aspeed,ast2600-lpc-v2", "simple-mfd", "syscon" - reg: contains the physical address and length values of the Aspeed LPC memory region. - #address-cells: <1> - #size-cells: <1> -- ranges: Maps 0 to the physical address and length of the LPC memory -region - -Required LPC Child nodes - - -BMC Node - - -- compatible: One of: - "aspeed,ast2400-lpc-bmc" - "aspeed,ast2500-lpc-bmc" - "aspeed,ast2600-lpc-bmc" - -- reg: contains the physical address and length values of the -H8S/2168-compatible LPC controller memory region - -Host Node -- - -- compatible: One of: - "aspeed,ast2400-lpc-host", "simple-mfd", "syscon" - "aspeed,ast2500-lpc-host", "simple-mfd", "syscon" - "aspeed,ast2600-lpc-host", "simple-mfd", "syscon" - -- reg: contains the address and length values of the host-related -register space for the Aspeed LPC controller - -- #address-cells: <1> -- #size-cells: <1> -- ranges: Maps 0 to the address and length of the host-related LPC memory +- ranges: Maps 0 to the physical address and length of the LPC memory region Example: lpc: lpc@1e789000 { - compatible = "aspeed,ast2500-lpc", "simple-mfd"; + compatible = "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2500-lpc-bmc"; - reg = <0x0 0x80>; - }; - - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - }; }; -BMC Node Children -== - - -Host Node Children -== LPC Host Interface Controller --- @@ -149,14 +92,12 @@ Optional properties: Example: -lpc-host@80 { - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2500-lpc-ctrl"; - reg = <0x0 0x80>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - memory-region = <&flash_memory>; - flash = <&spi>; - }; +lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2500
[PATCH v3 2/5] ARM: dts: Remove LPC BMC and Host partitions
The LPC controller has no concept of the BMC and the Host partitions. A concrete instance is that the HICRB[5:4] are for the I/O port address configurtaion of KCS channel 1/2. However, the KCS driver cannot access HICRB for channel 1/2 initialization via syscon regmap interface due to the parition boundary. (i.e. offset 80h) In addition, for the HW design backward compatibility, a newly added HW control bit could be located at any reserved one over the LPC addressing space. Thereby, this patch removes the lpc-bmc and lpc-host child node and thus the LPC partitioning. Note that this change requires the synchronization between device tree change and the driver change. To prevent the misuse of old devicetrees with new drivers, or vice versa, the v2 compatible strings are adopted for the LPC device as listed: "aspeed,ast2400-lpc-v2" "aspeed,ast2500-lpc-v2" "aspeed,ast2600-lpc-v2" Signed-off-by: Chia-Wei, Wang --- arch/arm/boot/dts/aspeed-g4.dtsi | 74 +++-- arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++- arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++- 3 files changed, 148 insertions(+), 196 deletions(-) diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi index b3dafbc8caca..ee22bc036440 100644 --- a/arch/arm/boot/dts/aspeed-g4.dtsi +++ b/arch/arm/boot/dts/aspeed-g4.dtsi @@ -343,58 +343,44 @@ }; lpc: lpc@1e789000 { - compatible = "aspeed,ast2400-lpc", "simple-mfd"; + compatible = "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; + reg-io-width = <4>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2400-lpc-bmc"; - reg = <0x0 0x80>; + lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2400-lpc-ctrl"; + reg = <0x80 0x10>; + clocks = <&syscon ASPEED_CLK_GATE_LCLK>; + status = "disabled"; }; - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2400-lpc-ctrl"; - reg = <0x0 0x10>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - status = "disabled"; - }; - - lpc_snoop: lpc-snoop@10 { - compatible = "aspeed,ast2400-lpc-snoop"; - reg = <0x10 0x8>; - interrupts = <8>; - status = "disabled"; - }; - - lhc: lhc@20 { - compatible = "aspeed,ast2400-lhc"; - reg = <0x20 0x24 0x48 0x8>; - }; - - lpc_reset: reset-controller@18 { - compatible = "aspeed,ast2400-lpc-reset"; - reg = <0x18 0x4>; - #reset-cells = <1>; - }; - - ibt: ibt@c0 { - compatible = "aspeed,ast2400-ibt-bmc"; - reg = <0xc0 0x18>; -
[PATCH v3 0/5] Remove LPC register partitioning
The LPC controller has no concept of the BMC and the Host partitions. The incorrect partitioning can impose unnecessary range restrictions on register access through the syscon regmap interface. For instance, HICRB contains the I/O port address configuration of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access HICRB as it is located at the other LPC partition. In addition, to be backward compatible, the newly added HW control bits could be located at any reserved bits over the LPC addressing space. Thereby, this patch series aims to remove the LPC partitioning for better driver development and maintenance. This requires the change to both the device tree and the driver implementation. To ensure both sides are synchronously updated, a v2 binding check is added. Changes since v2: - Add v2 binding check to ensure the synchronization between the device tree change and the driver register offset fix. Changes since v1: - Add the fix to the aspeed-lpc binding documentation. Chia-Wei, Wang (5): dt-bindings: aspeed-lpc: Remove LPC partitioning ARM: dts: Remove LPC BMC and Host partitions ipmi: kcs: aspeed: Adapt to new LPC DTS layout pinctrl: aspeed-g5: Adapt to new LPC device tree layout soc: aspeed: Adapt to new LPC device tree layout .../devicetree/bindings/mfd/aspeed-lpc.txt| 99 +++-- arch/arm/boot/dts/aspeed-g4.dtsi | 74 -- arch/arm/boot/dts/aspeed-g5.dtsi | 135 -- arch/arm/boot/dts/aspeed-g6.dtsi | 135 -- drivers/char/ipmi/kcs_bmc_aspeed.c| 35 +++-- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c| 19 ++- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 20 ++- drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +-- 8 files changed, 232 insertions(+), 308 deletions(-) -- 2.17.1
[PATCH] sched: cpufreq_schedutil: restore cached freq when next_f is not changed
We have the raw cached freq to reduce the chance in calling cpufreq driver where it could be costly in some arch/SoC. Currently, the raw cached freq will be reset when next_f is changed for correctness. This patch changes it to maintain the cached value instead of dropping it to honor the purpose of the cached value. This is adapted from https://android-review.googlesource.com/1352810/ Signed-off-by: Wei Wang --- kernel/sched/cpufreq_schedutil.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 5ae7b4e6e8d6..e254745a82cb 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -441,6 +441,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, unsigned long util, max; unsigned int next_f; bool busy; + unsigned int cached_freq = sg_policy->cached_raw_freq; sugov_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; @@ -464,8 +465,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, if (busy && next_f < sg_policy->next_freq) { next_f = sg_policy->next_freq; - /* Reset cached freq as next_freq has changed */ - sg_policy->cached_raw_freq = 0; + /* Restore cached freq as next_freq has changed */ + sg_policy->cached_raw_freq = cached_freq; } /* -- 2.29.0.rc1.297.gfa9743e501-goog
Re: [PATCH] sched: cpufreq_schedutil: maintain raw cache when next_f is not changed
On Fri, Oct 16, 2020 at 10:36 AM Rafael J. Wysocki wrote: > > On Fri, Oct 16, 2020 at 7:18 PM Wei Wang wrote: > > > > On Fri, Oct 16, 2020 at 10:01 AM Rafael J. Wysocki > > wrote: > > > > > > On Fri, Oct 16, 2020 at 6:36 PM Wei Wang wrote: > > > > > > > > Currently, raw cache will be reset when next_f is changed after > > > > get_next_freq for correctness. However, it may introduce more > > > > cycles. This patch changes it to maintain the cached value instead of > > > > dropping it. > > > > > > IMV you need to be more specific about why this helps. > > > > > > > I think the idea of cached_raw_freq is to reduce the chance of calling > > cpufreq drivers (in some arch those may be costly) but sometimes the > > cache will be wiped for correctness. The purpose of this patch is to > > still keep the cached value instead of wiping them. > > Well, I see what the problem is and how the patch is attempting to > address it (which is not the best way to do that because of the extra > struct member that doesn't need to be added if I'm not mistaken), but > IMO the changelog is way too vague from the problem statement > perspective. Just want to bring this up in the mainline kernel. I think we can change the patch to use a variable insides sugov_update_single. This is adapted from Android common kernel where it has some off tree functions making a single variable not possible but also making the issue more obvious.
Re: [PATCH] sched: cpufreq_schedutil: maintain raw cache when next_f is not changed
On Fri, Oct 16, 2020 at 10:01 AM Rafael J. Wysocki wrote: > > On Fri, Oct 16, 2020 at 6:36 PM Wei Wang wrote: > > > > Currently, raw cache will be reset when next_f is changed after > > get_next_freq for correctness. However, it may introduce more > > cycles. This patch changes it to maintain the cached value instead of > > dropping it. > > IMV you need to be more specific about why this helps. > I think the idea of cached_raw_freq is to reduce the chance of calling cpufreq drivers (in some arch those may be costly) but sometimes the cache will be wiped for correctness. The purpose of this patch is to still keep the cached value instead of wiping them. thx wvw > > > This is adapted from https://android-review.googlesource.com/1352810/ > > > > Signed-off-by: Wei Wang > > --- > > kernel/sched/cpufreq_schedutil.c | 7 +-- > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/cpufreq_schedutil.c > > b/kernel/sched/cpufreq_schedutil.c > > index 5ae7b4e6e8d6..ae3ae7fcd027 100644 > > --- a/kernel/sched/cpufreq_schedutil.c > > +++ b/kernel/sched/cpufreq_schedutil.c > > @@ -31,6 +31,7 @@ struct sugov_policy { > > s64 freq_update_delay_ns; > > unsigned intnext_freq; > > unsigned intcached_raw_freq; > > + unsigned intprev_cached_raw_freq; > > > > /* The next fields are only needed if fast switch cannot be used: */ > > struct irq_work irq_work; > > @@ -165,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy > > *sg_policy, > > return sg_policy->next_freq; > > > > sg_policy->need_freq_update = false; > > + sg_policy->prev_cached_raw_freq = sg_policy->cached_raw_freq; > > sg_policy->cached_raw_freq = freq; > > return cpufreq_driver_resolve_freq(policy, freq); > > } > > @@ -464,8 +466,8 @@ static void sugov_update_single(struct update_util_data > > *hook, u64 time, > > if (busy && next_f < sg_policy->next_freq) { > > next_f = sg_policy->next_freq; > > > > - /* Reset cached freq as next_freq has changed */ > > - sg_policy->cached_raw_freq = 0; > > + /* Restore cached freq as next_freq has changed */ > > + sg_policy->cached_raw_freq = > > sg_policy->prev_cached_raw_freq; > > } > > > > /* > > @@ -828,6 +830,7 @@ static int sugov_start(struct cpufreq_policy *policy) > > sg_policy->limits_changed = false; > > sg_policy->need_freq_update = false; > > sg_policy->cached_raw_freq = 0; > > + sg_policy->prev_cached_raw_freq = 0; > > > > for_each_cpu(cpu, policy->cpus) { > > struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); > > -- > > 2.29.0.rc1.297.gfa9743e501-goog > >
[PATCH] sched: cpufreq_schedutil: maintain raw cache when next_f is not changed
Currently, raw cache will be reset when next_f is changed after get_next_freq for correctness. However, it may introduce more cycles. This patch changes it to maintain the cached value instead of dropping it. This is adapted from https://android-review.googlesource.com/1352810/ Signed-off-by: Wei Wang --- kernel/sched/cpufreq_schedutil.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 5ae7b4e6e8d6..ae3ae7fcd027 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -31,6 +31,7 @@ struct sugov_policy { s64 freq_update_delay_ns; unsigned intnext_freq; unsigned intcached_raw_freq; + unsigned intprev_cached_raw_freq; /* The next fields are only needed if fast switch cannot be used: */ struct irq_work irq_work; @@ -165,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, return sg_policy->next_freq; sg_policy->need_freq_update = false; + sg_policy->prev_cached_raw_freq = sg_policy->cached_raw_freq; sg_policy->cached_raw_freq = freq; return cpufreq_driver_resolve_freq(policy, freq); } @@ -464,8 +466,8 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, if (busy && next_f < sg_policy->next_freq) { next_f = sg_policy->next_freq; - /* Reset cached freq as next_freq has changed */ - sg_policy->cached_raw_freq = 0; + /* Restore cached freq as next_freq has changed */ + sg_policy->cached_raw_freq = sg_policy->prev_cached_raw_freq; } /* @@ -828,6 +830,7 @@ static int sugov_start(struct cpufreq_policy *policy) sg_policy->limits_changed = false; sg_policy->need_freq_update = false; sg_policy->cached_raw_freq = 0; + sg_policy->prev_cached_raw_freq = 0; for_each_cpu(cpu, policy->cpus) { struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu); -- 2.29.0.rc1.297.gfa9743e501-goog
[PATCH v2 1/5] ARM: dts: Remove LPC BMC and Host partitions
The LPC controller has no concept of the BMC and the Host partitions. A concrete instance is that the HICRB[5:4] are for the I/O port address configurtaion of KCS channel 1/2. However, the KCS driver cannot access HICRB for channel 1/2 initialization via syscon regmap interface due to the parition boundary. (i.e. offset 80h) In addition, to be backward compatible, the newly added HW control bits could be located at any reserved bits over the LPC addressing space. Thereby, this patch removes the lpc-bmc and lpc-host child node and thus the LPC partitioning. Signed-off-by: Chia-Wei, Wang --- arch/arm/boot/dts/aspeed-g4.dtsi | 74 +++-- arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++- arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++- 3 files changed, 148 insertions(+), 196 deletions(-) diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi index 82f0213e3a3c..22996b3c4a00 100644 --- a/arch/arm/boot/dts/aspeed-g4.dtsi +++ b/arch/arm/boot/dts/aspeed-g4.dtsi @@ -338,58 +338,44 @@ }; lpc: lpc@1e789000 { - compatible = "aspeed,ast2400-lpc", "simple-mfd"; + compatible = "aspeed,ast2400-lpc", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; + reg-io-width = <4>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2400-lpc-bmc"; - reg = <0x0 0x80>; + lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2400-lpc-ctrl"; + reg = <0x80 0x10>; + clocks = <&syscon ASPEED_CLK_GATE_LCLK>; + status = "disabled"; }; - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2400-lpc-ctrl"; - reg = <0x0 0x10>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - status = "disabled"; - }; - - lpc_snoop: lpc-snoop@10 { - compatible = "aspeed,ast2400-lpc-snoop"; - reg = <0x10 0x8>; - interrupts = <8>; - status = "disabled"; - }; - - lhc: lhc@20 { - compatible = "aspeed,ast2400-lhc"; - reg = <0x20 0x24 0x48 0x8>; - }; - - lpc_reset: reset-controller@18 { - compatible = "aspeed,ast2400-lpc-reset"; - reg = <0x18 0x4>; - #reset-cells = <1>; - }; - - ibt: ibt@c0 { - compatible = "aspeed,ast2400-ibt-bmc"; - reg = <0xc0 0x18>; - interrupts = <8>; - status = "disabled"; - }; + lpc_snoop: lpc-snoop@90 { + compatible = "aspeed,ast2400-lpc-snoop"; + reg = <0x90 0
[PATCH v2 3/5] ipmi: kcs: aspeed: Fix LPC register offsets
The LPC register offsets are fixed to adapt to the LPC DTS change, where the LPC partitioning is removed. Signed-off-by: Chia-Wei, Wang --- drivers/char/ipmi/kcs_bmc_aspeed.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c index a140203c079b..8843cf867a5d 100644 --- a/drivers/char/ipmi/kcs_bmc_aspeed.c +++ b/drivers/char/ipmi/kcs_bmc_aspeed.c @@ -27,7 +27,6 @@ #define KCS_CHANNEL_MAX 4 -/* mapped to lpc-bmc@0 IO space */ #define LPC_HICR00x000 #define LPC_HICR0_LPC3E BIT(7) #define LPC_HICR0_LPC2E BIT(6) @@ -52,15 +51,13 @@ #define LPC_STR1 0x03C #define LPC_STR2 0x040 #define LPC_STR3 0x044 - -/* mapped to lpc-host@80 IO space */ -#define LPC_HICRB0x080 +#define LPC_HICRB0x100 #define LPC_HICRB_IBFIF4 BIT(1) #define LPC_HICRB_LPC4E BIT(0) -#define LPC_LADR40x090 -#define LPC_IDR4 0x094 -#define LPC_ODR4 0x098 -#define LPC_STR4 0x09C +#define LPC_LADR40x110 +#define LPC_IDR4 0x114 +#define LPC_ODR4 0x118 +#define LPC_STR4 0x11C struct aspeed_kcs_bmc { struct regmap *map; -- 2.17.1
[PATCH v2 2/5] soc: aspeed: Fix LPC register offsets
The LPC register offsets are fixed to adapt to the LPC DTS change, where the LPC partitioning is removed. Signed-off-by: Chia-Wei, Wang --- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 6 +++--- drivers/soc/aspeed/aspeed-lpc-snoop.c | 11 +-- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c b/drivers/soc/aspeed/aspeed-lpc-ctrl.c index 01ed21e8bfee..36faa0618ada 100644 --- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c +++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c @@ -17,12 +17,12 @@ #define DEVICE_NAME"aspeed-lpc-ctrl" -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_ENL2HBIT(8) #define HICR5_ENFWHBIT(10) -#define HICR7 0x8 -#define HICR8 0xc +#define HICR7 0x88 +#define HICR8 0x8c struct aspeed_lpc_ctrl { struct miscdevice miscdev; diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c b/drivers/soc/aspeed/aspeed-lpc-snoop.c index f3d8d53ab84d..7ce5c9fcc73c 100644 --- a/drivers/soc/aspeed/aspeed-lpc-snoop.c +++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c @@ -28,26 +28,25 @@ #define NUM_SNOOP_CHANNELS 2 #define SNOOP_FIFO_SIZE 2048 -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_EN_SNP0W BIT(0) #define HICR5_ENINT_SNP0W BIT(1) #define HICR5_EN_SNP1W BIT(2) #define HICR5_ENINT_SNP1W BIT(3) - -#define HICR6 0x4 +#define HICR6 0x84 #define HICR6_STR_SNP0WBIT(0) #define HICR6_STR_SNP1WBIT(1) -#define SNPWADR0x10 +#define SNPWADR0x90 #define SNPWADR_CH0_MASK GENMASK(15, 0) #define SNPWADR_CH0_SHIFT 0 #define SNPWADR_CH1_MASK GENMASK(31, 16) #define SNPWADR_CH1_SHIFT 16 -#define SNPWDR 0x14 +#define SNPWDR 0x94 #define SNPWDR_CH0_MASKGENMASK(7, 0) #define SNPWDR_CH0_SHIFT 0 #define SNPWDR_CH1_MASKGENMASK(15, 8) #define SNPWDR_CH1_SHIFT 8 -#define HICRB 0x80 +#define HICRB 0x100 #define HICRB_ENSNP0D BIT(14) #define HICRB_ENSNP1D BIT(15) -- 2.17.1
[PATCH v2 5/5] dt-bindings: aspeed-lpc: Remove LPC partitioning
The LPC controller has no concept of the BMC and the Host partitions. This patch fixes the documentation by removing the description on LPC partitions. The register offsets illustrated in the DTS node examples are also fixed to adapt to the LPC DTS change. Signed-off-by: Chia-Wei, Wang --- .../devicetree/bindings/mfd/aspeed-lpc.txt| 85 +++ 1 file changed, 14 insertions(+), 71 deletions(-) diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt index a92acf1dd491..866f54a09e09 100644 --- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt +++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt @@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on the bus conditions it can also take the role of bus master. The LPC controller is represented as a multi-function device to account for the -mix of functionality it provides. The principle split is between the register -layout at the start of the I/O space which is, to quote the Aspeed datasheet, -"basically compatible with the [LPC registers from the] popular BMC controller -H8S/2168[1]", and everything else, where everything else is an eclectic -collection of functions with a esoteric register layout. "Everything else", -here labeled the "host" portion of the controller, includes, but is not limited -to: +mix of functionality, which includes, but is not limited to: * An IPMI Block Transfer[2] Controller @@ -44,8 +38,8 @@ Required properties === - compatible: One of: - "aspeed,ast2400-lpc", "simple-mfd" - "aspeed,ast2500-lpc", "simple-mfd" + "aspeed,ast2400-lpc", "simple-mfd", "syscon" + "aspeed,ast2500-lpc", "simple-mfd", "syscon" - reg: contains the physical address and length values of the Aspeed LPC memory region. @@ -55,66 +49,17 @@ Required properties - ranges: Maps 0 to the physical address and length of the LPC memory region -Required LPC Child nodes - - -BMC Node - - -- compatible: One of: - "aspeed,ast2400-lpc-bmc" - "aspeed,ast2500-lpc-bmc" - -- reg: contains the physical address and length values of the -H8S/2168-compatible LPC controller memory region - -Host Node -- - -- compatible: One of: - "aspeed,ast2400-lpc-host", "simple-mfd", "syscon" - "aspeed,ast2500-lpc-host", "simple-mfd", "syscon" - -- reg: contains the address and length values of the host-related -register space for the Aspeed LPC controller - -- #address-cells: <1> -- #size-cells: <1> -- ranges: Maps 0 to the address and length of the host-related LPC memory -region - Example: lpc: lpc@1e789000 { - compatible = "aspeed,ast2500-lpc", "simple-mfd"; + compatible = "aspeed,ast2500-lpc", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2500-lpc-bmc"; - reg = <0x0 0x80>; - }; - - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - }; }; -BMC Node Children -== - - -Host Node Children -== LPC Host Interface Controller --- @@ -145,14 +90,12 @@ Optional properties: Example: -lpc-host@80 { - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2500-lpc-ctrl"; - reg = <0x0 0x80>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - memory-region = <&flash_memory>; - flash = <&spi>; - }; +lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2500-lpc-ctrl"; + reg = <0x80 0x80>; + clocks = <&syscon ASPEED_CLK_GATE_LCLK>; + memory-region = <&flash_memory>; + flash = <&spi>; }; LPC Host Controller @@ -174,9 +117,9 @@ Required properties: Example: -lhc: lhc@20 { +lhc: lhc@a0 { compatible = "aspeed,ast2500-lhc"; - reg = <0x20 0x24 0x48 0x8>; + reg = <0xa0 0x24 0xc8 0x8>; }; LPC reset control @@ -194,8 +137,8 @@ Required properties: Example: -lpc_reset: reset-controller@18 { +lpc_reset: reset-controller@98 { compatible = "aspeed,ast2500-lpc-reset"; -reg = <0x18 0x4>; +reg = <0x98 0x4>; #reset-cells = <1>; }; -- 2.17.1
[PATCH v2 4/5] pinctrl: aspeed-g5: Fix LPC register offsets
The LPC register offsets are fixed to adapt to the LPC DTS change, where the LPC partitioning is removed. Signed-off-by: Chia-Wei, Wang --- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c index 0cab4c2576e2..98e62333fa54 100644 --- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c +++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c @@ -60,7 +60,7 @@ #define COND2 { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 } /* LHCR0 is offset from the end of the H8S/2168-compatible registers */ -#define LHCR0 0x20 +#define LHCR0 0xa0 #define GFX064 0x64 #define B14 0 -- 2.17.1
[PATCH v2 0/5] Remove LPC register partitioning
The LPC controller has no concept of the BMC and the Host partitions. The incorrect partitioning can impose unnecessary range restrictions on register access through the syscon regmap interface. For instance, HICRB contains the I/O port address configuration of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access HICRB as it is located at the other LPC partition. In addition, to be backward compatible, the newly added HW control bits could be located at any reserved bits over the LPC addressing space. Thereby, this patch series aims to remove the LPC partitioning for better driver development and maintenance. Changes since v1: - Add the fix to the aspeed-lpc binding documentation. Chia-Wei, Wang (5): ARM: dts: Remove LPC BMC and Host partitions soc: aspeed: Fix LPC register offsets ipmi: kcs: aspeed: Fix LPC register offsets pinctrl: aspeed-g5: Fix LPC register offsets dt-bindings: aspeed-lpc: Remove LPC partitioning .../devicetree/bindings/mfd/aspeed-lpc.txt| 85 ++- arch/arm/boot/dts/aspeed-g4.dtsi | 74 -- arch/arm/boot/dts/aspeed-g5.dtsi | 135 -- arch/arm/boot/dts/aspeed-g6.dtsi | 135 -- drivers/char/ipmi/kcs_bmc_aspeed.c| 13 +- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c| 2 +- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 6 +- drivers/soc/aspeed/aspeed-lpc-snoop.c | 11 +- 8 files changed, 176 insertions(+), 285 deletions(-) -- 2.17.1
Re: [PATCH] sched/rt: Disable RT_RUNTIME_SHARE by default
On Tue, Sep 22, 2020 at 12:04 PM Daniel Bristot de Oliveira wrote: > > On 9/22/20 7:14 PM, Wei Wang wrote: > > On Mon, Sep 21, 2020 at 7:40 AM Daniel Bristot de Oliveira > > wrote: > >> > >> The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime > >> between CPUs, allowing a CPU to run a real-time task up to 100% of the > >> time while leaving more space for non-real-time tasks to run on the CPU > >> that lend rt_runtime. > >> > >> The problem is that a CPU can easily borrow enough rt_runtime to allow > >> a spinning rt-task to run forever, starving per-cpu tasks like kworkers, > >> which are non-real-time by design. > >> > >> This patch disables RT_RUNTIME_SHARE by default, avoiding this problem. > >> The feature will still be present for users that want to enable it, > >> though. > >> > >> Signed-off-by: Daniel Bristot de Oliveira > >> Cc: Ingo Molnar > >> Cc: Peter Zijlstra > >> Cc: Juri Lelli > >> Cc: Vincent Guittot > >> Cc: Dietmar Eggemann > >> Cc: Steven Rostedt > >> Cc: Ben Segall > >> Cc: Mel Gorman > >> Cc: Daniel Bristot de Oliveira > >> Cc: Thomas Gleixner > >> Cc: linux-kernel@vger.kernel.org > >> --- > >> kernel/sched/features.h | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/kernel/sched/features.h b/kernel/sched/features.h > >> index 7481cd96f391..68d369cba9e4 100644 > >> --- a/kernel/sched/features.h > >> +++ b/kernel/sched/features.h > >> @@ -77,7 +77,7 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false) > >> SCHED_FEAT(RT_PUSH_IPI, true) > >> #endif > >> > >> -SCHED_FEAT(RT_RUNTIME_SHARE, true) > >> +SCHED_FEAT(RT_RUNTIME_SHARE, false) > >> SCHED_FEAT(LB_MIN, false) > >> SCHED_FEAT(ATTACH_AGE_LOAD, true) > >> > >> -- > >> 2.26.2 > >> > > > > Tested on an Android device and can no longer see long running RT > > tasks (yes, Android have those for reasons). > > > > So: > > Tested-by: Wei Wang > > ? Yup, thanks for pushing this. > > Thanks! > -- Daniel >
Re: [PATCH] sched/rt: Disable RT_RUNTIME_SHARE by default
On Mon, Sep 21, 2020 at 7:40 AM Daniel Bristot de Oliveira wrote: > > The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime > between CPUs, allowing a CPU to run a real-time task up to 100% of the > time while leaving more space for non-real-time tasks to run on the CPU > that lend rt_runtime. > > The problem is that a CPU can easily borrow enough rt_runtime to allow > a spinning rt-task to run forever, starving per-cpu tasks like kworkers, > which are non-real-time by design. > > This patch disables RT_RUNTIME_SHARE by default, avoiding this problem. > The feature will still be present for users that want to enable it, > though. > > Signed-off-by: Daniel Bristot de Oliveira > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Juri Lelli > Cc: Vincent Guittot > Cc: Dietmar Eggemann > Cc: Steven Rostedt > Cc: Ben Segall > Cc: Mel Gorman > Cc: Daniel Bristot de Oliveira > Cc: Thomas Gleixner > Cc: linux-kernel@vger.kernel.org > --- > kernel/sched/features.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 7481cd96f391..68d369cba9e4 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -77,7 +77,7 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false) > SCHED_FEAT(RT_PUSH_IPI, true) > #endif > > -SCHED_FEAT(RT_RUNTIME_SHARE, true) > +SCHED_FEAT(RT_RUNTIME_SHARE, false) > SCHED_FEAT(LB_MIN, false) > SCHED_FEAT(ATTACH_AGE_LOAD, true) > > -- > 2.26.2 > Tested on an Android device and can no longer see long running RT tasks (yes, Android have those for reasons).
[PATCH 2/4] soc: aspeed: Fix LPC register offsets
The LPC register offsets are fixed to adapt to the LPC DTS change, where the LPC partitioning is removed. Signed-off-by: Chia-Wei, Wang --- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 6 +++--- drivers/soc/aspeed/aspeed-lpc-snoop.c | 11 +-- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c b/drivers/soc/aspeed/aspeed-lpc-ctrl.c index 01ed21e8bfee..36faa0618ada 100644 --- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c +++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c @@ -17,12 +17,12 @@ #define DEVICE_NAME"aspeed-lpc-ctrl" -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_ENL2HBIT(8) #define HICR5_ENFWHBIT(10) -#define HICR7 0x8 -#define HICR8 0xc +#define HICR7 0x88 +#define HICR8 0x8c struct aspeed_lpc_ctrl { struct miscdevice miscdev; diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c b/drivers/soc/aspeed/aspeed-lpc-snoop.c index f3d8d53ab84d..7ce5c9fcc73c 100644 --- a/drivers/soc/aspeed/aspeed-lpc-snoop.c +++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c @@ -28,26 +28,25 @@ #define NUM_SNOOP_CHANNELS 2 #define SNOOP_FIFO_SIZE 2048 -#define HICR5 0x0 +#define HICR5 0x80 #define HICR5_EN_SNP0W BIT(0) #define HICR5_ENINT_SNP0W BIT(1) #define HICR5_EN_SNP1W BIT(2) #define HICR5_ENINT_SNP1W BIT(3) - -#define HICR6 0x4 +#define HICR6 0x84 #define HICR6_STR_SNP0WBIT(0) #define HICR6_STR_SNP1WBIT(1) -#define SNPWADR0x10 +#define SNPWADR0x90 #define SNPWADR_CH0_MASK GENMASK(15, 0) #define SNPWADR_CH0_SHIFT 0 #define SNPWADR_CH1_MASK GENMASK(31, 16) #define SNPWADR_CH1_SHIFT 16 -#define SNPWDR 0x14 +#define SNPWDR 0x94 #define SNPWDR_CH0_MASKGENMASK(7, 0) #define SNPWDR_CH0_SHIFT 0 #define SNPWDR_CH1_MASKGENMASK(15, 8) #define SNPWDR_CH1_SHIFT 8 -#define HICRB 0x80 +#define HICRB 0x100 #define HICRB_ENSNP0D BIT(14) #define HICRB_ENSNP1D BIT(15) -- 2.17.1
[PATCH 0/4] Remove LPC register partitioning
The LPC controller has no concept of the BMC and the Host partitions. The incorrect partitioning can impose unnecessary range restrictions on register access through the syscon regmap interface. For instance, HICRB contains the I/O port address configuration of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access HICRB as it is located at the other LPC partition. In addition, to be backward compatible, the newly added HW control bits could be added at any reserved bits over the LPC addressing space. Thereby, this patch series aims to remove the LPC partitioning for better driver development and maintenance. Chia-Wei, Wang (4): ARM: dts: Remove LPC BMC and Host partitions soc: aspeed: Fix LPC register offsets ipmi: kcs: aspeed: Fix LPC register offsets pinctrl: aspeed-g5: Fix LPC register offsets arch/arm/boot/dts/aspeed-g4.dtsi | 74 +-- arch/arm/boot/dts/aspeed-g5.dtsi | 135 + arch/arm/boot/dts/aspeed-g6.dtsi | 135 + drivers/char/ipmi/kcs_bmc_aspeed.c | 13 +- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 2 +- drivers/soc/aspeed/aspeed-lpc-ctrl.c | 6 +- drivers/soc/aspeed/aspeed-lpc-snoop.c | 11 +- 7 files changed, 162 insertions(+), 214 deletions(-) -- 2.17.1
[PATCH 4/4] pinctrl: aspeed-g5: Fix LPC register offsets
The LPC register offsets are fixed to adapt to the LPC DTS change, where the LPC partitioning is removed. Signed-off-by: Chia-Wei, Wang --- drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c index 0cab4c2576e2..98e62333fa54 100644 --- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c +++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c @@ -60,7 +60,7 @@ #define COND2 { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 } /* LHCR0 is offset from the end of the H8S/2168-compatible registers */ -#define LHCR0 0x20 +#define LHCR0 0xa0 #define GFX064 0x64 #define B14 0 -- 2.17.1
[PATCH 3/4] ipmi: kcs: aspeed: Fix LPC register offsets
The LPC register offsets are fixed to adapt to the LPC DTS change, where the LPC partitioning is removed. Signed-off-by: Chia-Wei, Wang --- drivers/char/ipmi/kcs_bmc_aspeed.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c b/drivers/char/ipmi/kcs_bmc_aspeed.c index a140203c079b..8843cf867a5d 100644 --- a/drivers/char/ipmi/kcs_bmc_aspeed.c +++ b/drivers/char/ipmi/kcs_bmc_aspeed.c @@ -27,7 +27,6 @@ #define KCS_CHANNEL_MAX 4 -/* mapped to lpc-bmc@0 IO space */ #define LPC_HICR00x000 #define LPC_HICR0_LPC3E BIT(7) #define LPC_HICR0_LPC2E BIT(6) @@ -52,15 +51,13 @@ #define LPC_STR1 0x03C #define LPC_STR2 0x040 #define LPC_STR3 0x044 - -/* mapped to lpc-host@80 IO space */ -#define LPC_HICRB0x080 +#define LPC_HICRB0x100 #define LPC_HICRB_IBFIF4 BIT(1) #define LPC_HICRB_LPC4E BIT(0) -#define LPC_LADR40x090 -#define LPC_IDR4 0x094 -#define LPC_ODR4 0x098 -#define LPC_STR4 0x09C +#define LPC_LADR40x110 +#define LPC_IDR4 0x114 +#define LPC_ODR4 0x118 +#define LPC_STR4 0x11C struct aspeed_kcs_bmc { struct regmap *map; -- 2.17.1
[PATCH 1/4] ARM: dts: Remove LPC BMC and Host partitions
The LPC controller has no concept of the BMC and the Host partitions. A concrete instance is that the HICRB[5:4] are for the I/O port address configurtaion of KCS channel 1/2. However, the KCS driver cannot access HICRB for channel 1/2 initialization via syscon regmap interface due to the parition boundary. (i.e. offset 80h) In addition, to be backward compatible, the newly added HW control bits could be located at any reserved bits over the LPC addressing space. Thereby, this patch removes the lpc-bmc and lpc-host child node and thus the LPC partitioning for better driver development and maintenance. Signed-off-by: Chia-Wei, Wang --- arch/arm/boot/dts/aspeed-g4.dtsi | 74 +++-- arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++- arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++- 3 files changed, 148 insertions(+), 196 deletions(-) diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi index 82f0213e3a3c..22996b3c4a00 100644 --- a/arch/arm/boot/dts/aspeed-g4.dtsi +++ b/arch/arm/boot/dts/aspeed-g4.dtsi @@ -338,58 +338,44 @@ }; lpc: lpc@1e789000 { - compatible = "aspeed,ast2400-lpc", "simple-mfd"; + compatible = "aspeed,ast2400-lpc", "simple-mfd", "syscon"; reg = <0x1e789000 0x1000>; + reg-io-width = <4>; #address-cells = <1>; #size-cells = <1>; ranges = <0x0 0x1e789000 0x1000>; - lpc_bmc: lpc-bmc@0 { - compatible = "aspeed,ast2400-lpc-bmc"; - reg = <0x0 0x80>; + lpc_ctrl: lpc-ctrl@80 { + compatible = "aspeed,ast2400-lpc-ctrl"; + reg = <0x80 0x10>; + clocks = <&syscon ASPEED_CLK_GATE_LCLK>; + status = "disabled"; }; - lpc_host: lpc-host@80 { - compatible = "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"; - reg = <0x80 0x1e0>; - reg-io-width = <4>; - - #address-cells = <1>; - #size-cells = <1>; - ranges = <0x0 0x80 0x1e0>; - - lpc_ctrl: lpc-ctrl@0 { - compatible = "aspeed,ast2400-lpc-ctrl"; - reg = <0x0 0x10>; - clocks = <&syscon ASPEED_CLK_GATE_LCLK>; - status = "disabled"; - }; - - lpc_snoop: lpc-snoop@10 { - compatible = "aspeed,ast2400-lpc-snoop"; - reg = <0x10 0x8>; - interrupts = <8>; - status = "disabled"; - }; - - lhc: lhc@20 { - compatible = "aspeed,ast2400-lhc"; - reg = <0x20 0x24 0x48 0x8>; - }; - - lpc_reset: reset-controller@18 { - compatible = "aspeed,ast2400-lpc-reset"; - reg = <0x18 0x4>; - #reset-cells = <1>; - }; - - ibt: ibt@c0 { - compatible = "aspeed,ast2400-ibt-bmc"; - reg = <0xc0 0x18>; - interrupts = <8>; - status = "disabled"; - }; + lpc_snoop: lpc-snoop@90 { + compatible = "aspeed,ast2400-lpc-snoop
Re: [PATCH] virtio_balloon: fix up endian-ness for free cmd id
On 07/28/2020 12:03 AM, Michael S. Tsirkin wrote: free cmd id is read using virtio endian, spec says all fields in balloon are LE. Fix it up. Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT") Cc: sta...@vger.kernel.org Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_balloon.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 774deb65a9bb..798ec304fe3e 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -578,10 +578,14 @@ static int init_vqs(struct virtio_balloon *vb) static u32 virtio_balloon_cmd_id_received(struct virtio_balloon *vb) { if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID, - &vb->config_read_bitmap)) + &vb->config_read_bitmap)) { virtio_cread(vb->vdev, struct virtio_balloon_config, free_page_hint_cmd_id, &vb->cmd_id_received_cache); + /* Legacy balloon config space is LE, unlike all other devices. */ + if (!virtio_has_feature(vb->vdev, VIRTIO_F_VERSION_1)) + vb->cmd_id_received_cache = le32_to_cpu((__force __le32)vb->cmd_id_received_cache); Seems it exceeds 80 character length. Reviewed-by: Wei Wang Best, Wei
[tip: perf/core] perf/x86: Fix variable types for LBR registers
The following commit has been merged into the perf/core branch of tip: Commit-ID: 3cb9d5464c1ceea86f6225089b2f7965989cf316 Gitweb: https://git.kernel.org/tip/3cb9d5464c1ceea86f6225089b2f7965989cf316 Author:Wei Wang AuthorDate:Sat, 13 Jun 2020 16:09:46 +08:00 Committer: Peter Zijlstra CommitterDate: Thu, 02 Jul 2020 15:51:45 +02:00 perf/x86: Fix variable types for LBR registers The MSR variable type can be 'unsigned int', which uses less memory than the longer 'unsigned long'. Fix 'struct x86_pmu' for that. The lbr_nr won't be a negative number, so make it 'unsigned int' as well. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Wei Wang Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200613080958.132489-2-like...@linux.intel.com --- arch/x86/events/perf_event.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index e17a3d8..eb37f6c 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -673,8 +673,8 @@ struct x86_pmu { /* * Intel LBR */ - unsigned long lbr_tos, lbr_from, lbr_to; /* MSR base regs */ - int lbr_nr;/* hardware stack size */ + unsigned intlbr_tos, lbr_from, lbr_to, + lbr_nr;/* LBR base regs and size */ u64 lbr_sel_mask; /* LBR_SELECT valid bits */ const int *lbr_sel_map; /* lbr_select mappings */ boollbr_double_abort; /* duplicated lbr aborts */
Re: [PATCH] cpufreq: schedutil: force frequency update when limits change
On Thu, Jun 25, 2020 at 7:32 PM Viresh Kumar wrote: > > On 26-06-20, 07:44, Viresh Kumar wrote: > > On 25-06-20, 13:47, Wei Wang wrote: > > > On Thu, Jun 25, 2020 at 3:23 AM Viresh Kumar > > > wrote: > > > > I am sorry but I am not fully sure of what the problem is. Can you > > > > describe that by giving an example with some random frequency, and > > > > tell the expected and actual behavior ? > > > > > > > The problem is sugov thought next_freq already updated (but actually > > > skipped by the rate limit thing) and all following updates will be > > > skipped. > > The rate-limiting thing is specific to android and not present in > mainline. Even in android I see next_freq getting updated only after > rate-limiting is verified. > > I think you maybe trying to fix an android only problem in mainline, > which may not be required at all. And I am not sure if Android has a > problem as well :) > Yes, that is Android specific, I added you to the Gerrit already. Thanks! -Wei > > I am sorry, can you please give a detailed example with existing > > frequency and limits, then the limits changed to new values, then what > > exactly happens ? > > > > > Actually this is specifically for Android common kernel 4.19's issue > > > which has sugov_up_down_rate_limit in sugov_update_next_freq, let's > > > continue discussion there. > > > > If it is a mainline problem, we will surely get it fixed here. Just > > that I am not able to understand the problem yet. Sorry about that. > > -- > viresh
Re: [PATCH] cpufreq: schedutil: force frequency update when limits change
On Thu, Jun 25, 2020 at 3:23 AM Viresh Kumar wrote: > > On 24-06-20, 23:46, Wei Wang wrote: > > To avoid reducing the frequency of a CPU prematurely, we skip reducing > > the frequency if the CPU had been busy recently. > > > > This should not be done when the limits of the policy are changed, for > > example due to thermal throttling. We should always get the frequency > > within the new limits as soon as possible. > > > > There was a fix in > > commit 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when > > limits change") upstream which introduced another flag. However, the > > fix didn't address the case when next_freq is the same as previously > > voted, which is then checked in sugov_update_next_freq. As a result, the > > frequency would be stuck at low until the high demanding workload quits. > > > > test trace: > > kworker/u19:0-1872 ( 1872) [002] 347.878871: > > cpu_frequency_limits: min=60 max=2348000 cpu_id=6 > > dhry64-11525 (11525) [007] d.h2 347.880012: > > sugov_should_update_freq: thermal limit on policy6 > > dhry64-11525 (11525) [007] d.h2 347.880012: > > sugov_deferred_update: policy6 skipped update > > dhry64-11525 (11525) [007] d.h2 347.884040: > > sugov_deferred_update: policy6 skipped update > > I am not sure these are helpful in the logs as the code which > generated them isn't there in the kernel. > Yes, those traceprintk were added to those particular functions to help debug. > > ... > > > > This patch fixes this by skipping the check and forcing an update in > > this case. The second flag was kept as the limits_change flag could be > > updated in thermal kworker from another CPU. > > I am sorry but I am not fully sure of what the problem is. Can you > describe that by giving an example with some random frequency, and > tell the expected and actual behavior ? > The problem is sugov thought next_freq already updated (but actually skipped by the rate limit thing) and all following updates will be skipped. Actually this is specifically for Android common kernel 4.19's issue which has sugov_up_down_rate_limit in sugov_update_next_freq, let's continue discussion there. Thanks! -Wei > > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") > > Signed-off-by: Wei Wang > > --- > > kernel/sched/cpufreq_schedutil.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/sched/cpufreq_schedutil.c > > b/kernel/sched/cpufreq_schedutil.c > > index 7fbaee24c824..dc2cd768022e 100644 > > --- a/kernel/sched/cpufreq_schedutil.c > > +++ b/kernel/sched/cpufreq_schedutil.c > > @@ -102,11 +102,12 @@ static bool sugov_should_update_freq(struct > > sugov_policy *sg_policy, u64 time) > > static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 > > time, > > unsigned int next_freq) > > { > > - if (sg_policy->next_freq == next_freq) > > + if (!sg_policy->need_freq_update && sg_policy->next_freq == next_freq) > > AFAIU, if the next freq is same as currently programmed one, there is > no need to force update it. > > > return false; > > > > sg_policy->next_freq = next_freq; > > sg_policy->last_freq_update_time = time; > > + sg_policy->need_freq_update = false; > > > > return true; > > } > > @@ -178,7 +179,6 @@ static unsigned int get_next_freq(struct sugov_policy > > *sg_policy, > > if (freq == sg_policy->cached_raw_freq && > > !sg_policy->need_freq_update) > > return sg_policy->next_freq; > > > > - sg_policy->need_freq_update = false; > > sg_policy->cached_raw_freq = freq; > > return cpufreq_driver_resolve_freq(policy, freq); > > } > > -- > > 2.27.0.212.ge8ba1cc988-goog > > -- > viresh
[PATCH] cpufreq: schedutil: force frequency update when limits change
To avoid reducing the frequency of a CPU prematurely, we skip reducing the frequency if the CPU had been busy recently. This should not be done when the limits of the policy are changed, for example due to thermal throttling. We should always get the frequency within the new limits as soon as possible. There was a fix in commit 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") upstream which introduced another flag. However, the fix didn't address the case when next_freq is the same as previously voted, which is then checked in sugov_update_next_freq. As a result, the frequency would be stuck at low until the high demanding workload quits. test trace: kworker/u19:0-1872 ( 1872) [002] 347.878871: cpu_frequency_limits: min=60 max=2348000 cpu_id=6 dhry64-11525 (11525) [007] d.h2 347.880012: sugov_should_update_freq: thermal limit on policy6 dhry64-11525 (11525) [007] d.h2 347.880012: sugov_deferred_update: policy6 skipped update dhry64-11525 (11525) [007] d.h2 347.884040: sugov_deferred_update: policy6 skipped update ... This patch fixes this by skipping the check and forcing an update in this case. The second flag was kept as the limits_change flag could be updated in thermal kworker from another CPU. Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") Signed-off-by: Wei Wang --- kernel/sched/cpufreq_schedutil.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 7fbaee24c824..dc2cd768022e 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -102,11 +102,12 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, unsigned int next_freq) { - if (sg_policy->next_freq == next_freq) + if (!sg_policy->need_freq_update && sg_policy->next_freq == next_freq) return false; sg_policy->next_freq = next_freq; sg_policy->last_freq_update_time = time; + sg_policy->need_freq_update = false; return true; } @@ -178,7 +179,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) return sg_policy->next_freq; - sg_policy->need_freq_update = false; sg_policy->cached_raw_freq = freq; return cpufreq_driver_resolve_freq(policy, freq); } -- 2.27.0.212.ge8ba1cc988-goog
Re: [PATCH] thermal: create softlink by name for thermal_zone and cooling_device
On Wed, Oct 16, 2019 at 10:16 AM Amit Kucheria wrote: > > On Wed, Oct 16, 2019 at 10:20 PM Amit Kucheria > wrote: > > > > On Tue, Oct 15, 2019 at 11:43 AM Wei Wang wrote: > > > > > > The paths thermal_zone%d and cooling_device%d are not intuitive and the > > > numbers are subject to change due to device tree change. This usually > > > leads to tree traversal in userspace code. > > > The patch creates `tz-by-name' and `cdev-by-name' for thermal zone and > > > cooling_device respectively. > > > > I like this. > > > > > Signed-off-by: Wei Wang > > > --- > > > drivers/thermal/thermal_core.c | 23 +-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/thermal/thermal_core.c > > > b/drivers/thermal/thermal_core.c > > > index d4481cc8958f..0ff8fb1d7b0a 100644 > > > --- a/drivers/thermal/thermal_core.c > > > +++ b/drivers/thermal/thermal_core.c > > > @@ -22,6 +22,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > #define CREATE_TRACE_POINTS > > > #include > > > @@ -46,6 +47,8 @@ static DEFINE_MUTEX(poweroff_lock); > > > > > > static atomic_t in_suspend; > > > static bool power_off_triggered; > > > +static struct kobject *cdev_link_kobj; > > > +static struct kobject *tz_link_kobj; > > > > > > static struct thermal_governor *def_governor; > > > > > > @@ -954,7 +957,7 @@ __thermal_cooling_device_register(struct device_node > > > *np, > > > struct thermal_zone_device *pos = NULL; > > > int result; > > > > > > - if (type && strlen(type) >= THERMAL_NAME_LENGTH) > > > + if (!type || !type[0] || strlen(type) >= THERMAL_NAME_LENGTH) > > > return ERR_PTR(-EINVAL); > > > > This should be a separate fix, if needed. Agree, but the link now requires that "" as invalid _type_. > > > > > if (!ops || !ops->get_max_state || !ops->get_cur_state || > > > @@ -989,9 +992,15 @@ __thermal_cooling_device_register(struct device_node > > > *np, > > > return ERR_PTR(result); > > > } > > > > > > - /* Add 'this' new cdev to the global cdev list */ > > > + /* Add 'this' new cdev to the global cdev list and create link*/ > > > mutex_lock(&thermal_list_lock); > > > list_add(&cdev->node, &thermal_cdev_list); > > > + if (!cdev_link_kobj) > > > + cdev_link_kobj = kobject_create_and_add("cdev-by-name", > > > + cdev->device.kobj.parent); > > > + if (!cdev_link_kobj || sysfs_create_link(cdev_link_kobj, > > > + &cdev->device.kobj, > > > cdev->type)) > > > + dev_err(&cdev->device, "Failed to create cdev-by-name > > > link\n"); > > > > Any reason not to use the following form instead? It seems easier to read. > > > > if (!cdev_link_kobj) { > >cdev_link_kobj = kobject_create_and_add("cdev-by-name", > >cdev->device.kobj.parent); > > ret = sysfs_create_link(cdev_link_kobj, > > &cdev->device.kobj, > > cdev->type)) > > if (ret) > >dev_err(&cdev->device, "Failed to create > > cdev-by-name link\n"); > > } > > I can now see why you had to do that - none of the other links would > get created after the first one. > > Perhaps create the directories in the __init functions and only create > the links here? > AFAICT, this is no such API except the private get_device_parent() under driver/base/. Also the lazy initialization makes sense in such case when there is no thermal device attached. Looks like the class dir is also lazy-initialized when first device registered https://elixir.bootlin.com/linux/v5.3.5/source/drivers/base/core.c#L1790. > > > > mutex_unlock(&thermal_list_lock); > > > > > > /* Update binding information for 'this' new cdev */ > > > @@ -1157,6 +1166,8 @@ void thermal_cooling_device_unregister(struct > > > thermal_cooling_device *cdev
[PATCH] thermal: create softlink by name for thermal_zone and cooling_device
The paths thermal_zone%d and cooling_device%d are not intuitive and the numbers are subject to change due to device tree change. This usually leads to tree traversal in userspace code. The patch creates `tz-by-name' and `cdev-by-name' for thermal zone and cooling_device respectively. Signed-off-by: Wei Wang --- drivers/thermal/thermal_core.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index d4481cc8958f..0ff8fb1d7b0a 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -22,6 +22,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -46,6 +47,8 @@ static DEFINE_MUTEX(poweroff_lock); static atomic_t in_suspend; static bool power_off_triggered; +static struct kobject *cdev_link_kobj; +static struct kobject *tz_link_kobj; static struct thermal_governor *def_governor; @@ -954,7 +957,7 @@ __thermal_cooling_device_register(struct device_node *np, struct thermal_zone_device *pos = NULL; int result; - if (type && strlen(type) >= THERMAL_NAME_LENGTH) + if (!type || !type[0] || strlen(type) >= THERMAL_NAME_LENGTH) return ERR_PTR(-EINVAL); if (!ops || !ops->get_max_state || !ops->get_cur_state || @@ -989,9 +992,15 @@ __thermal_cooling_device_register(struct device_node *np, return ERR_PTR(result); } - /* Add 'this' new cdev to the global cdev list */ + /* Add 'this' new cdev to the global cdev list and create link*/ mutex_lock(&thermal_list_lock); list_add(&cdev->node, &thermal_cdev_list); + if (!cdev_link_kobj) + cdev_link_kobj = kobject_create_and_add("cdev-by-name", + cdev->device.kobj.parent); + if (!cdev_link_kobj || sysfs_create_link(cdev_link_kobj, + &cdev->device.kobj, cdev->type)) + dev_err(&cdev->device, "Failed to create cdev-by-name link\n"); mutex_unlock(&thermal_list_lock); /* Update binding information for 'this' new cdev */ @@ -1157,6 +1166,8 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev) } } } + if (cdev_link_kobj) + sysfs_remove_link(cdev_link_kobj, cdev->type); mutex_unlock(&thermal_list_lock); @@ -1340,6 +1351,12 @@ thermal_zone_device_register(const char *type, int trips, int mask, mutex_lock(&thermal_list_lock); list_add_tail(&tz->node, &thermal_tz_list); + if (!tz_link_kobj) + tz_link_kobj = kobject_create_and_add("tz-by-name", + tz->device.kobj.parent); + if (!tz_link_kobj || sysfs_create_link(tz_link_kobj, + &tz->device.kobj, tz->type)) + dev_err(&tz->device, "Failed to create tz-by-name link\n"); mutex_unlock(&thermal_list_lock); /* Bind cooling devices for this zone */ @@ -1411,6 +1428,8 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz) } } } + if (tz_link_kobj) + sysfs_remove_link(tz_link_kobj, tz->type); mutex_unlock(&thermal_list_lock); -- 2.23.0.700.g56cf767bdb-goog
[PATCH 2/2] dt-bindings: peci: aspeed: Add AST2600 compatible
Document the AST2600 PECI controller compatible string. Signed-off-by: Chia-Wei, Wang --- Documentation/devicetree/bindings/peci/peci-aspeed.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/peci/peci-aspeed.txt b/Documentation/devicetree/bindings/peci/peci-aspeed.txt index cdca73a3b7d8..cddd2d2aa58f 100644 --- a/Documentation/devicetree/bindings/peci/peci-aspeed.txt +++ b/Documentation/devicetree/bindings/peci/peci-aspeed.txt @@ -4,6 +4,7 @@ Required properties: - compatible: Should be one of: "aspeed,ast2400-peci" "aspeed,ast2500-peci" + "aspeed,ast2600-peci" - reg : Should contain PECI controller registers location and length. - #address-cells: Should be <1> required to define a client address. -- 2.17.1
[PATCH 0/2] peci: aspeed: Add AST2600 compatible
Update the Aspeed PECI driver with the AST2600 compatible string. A new comptabile string is needed for the extended HW feature of AST2600. Chia-Wei, Wang (2): peci: aspeed: Add AST2600 compatible string dt-bindings: peci: aspeed: Add AST2600 compatible Documentation/devicetree/bindings/peci/peci-aspeed.txt | 1 + drivers/peci/peci-aspeed.c | 1 + 2 files changed, 2 insertions(+) -- 2.17.1
[PATCH 1/2] peci: aspeed: Add AST2600 compatible string
The AST2600 SoC contains the same register set as AST25xx. Signed-off-by: Chia-Wei, Wang --- drivers/peci/peci-aspeed.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/peci/peci-aspeed.c b/drivers/peci/peci-aspeed.c index 51cb2563ceb6..4eed119dc83b 100644 --- a/drivers/peci/peci-aspeed.c +++ b/drivers/peci/peci-aspeed.c @@ -485,6 +485,7 @@ static int aspeed_peci_remove(struct platform_device *pdev) static const struct of_device_id aspeed_peci_of_table[] = { { .compatible = "aspeed,ast2400-peci", }, { .compatible = "aspeed,ast2500-peci", }, + { .compatible = "aspeed,ast2600-peci", }, { } }; MODULE_DEVICE_TABLE(of, aspeed_peci_of_table); -- 2.17.1
Re: [PATCH v2] ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule
On Tue, Sep 24, 2019 at 7:01 AM Jason A. Donenfeld wrote: > > Commit 7d9e5f422150 removed references from certain dsts, but accounting > for this never translated down into the fib6 suppression code. This bug > was triggered by WireGuard users who use wg-quick(8), which uses the > "suppress-prefix" directive to ip-rule(8) for routing all of their > internet traffic without routing loops. The test case added here > causes the reference underflow by causing packets to evaluate a suppress > rule. > > Cc: sta...@vger.kernel.org > Fixes: 7d9e5f422150 ("ipv6: convert major tx path to use > RT6_LOOKUP_F_DST_NOREF") > Signed-off-by: Jason A. Donenfeld > --- Acked-by: Wei Wang Good catch. Thanks for the fix. > net/ipv6/fib6_rules.c| 3 ++- > tools/testing/selftests/net/fib_tests.sh | 17 - > 2 files changed, 18 insertions(+), 2 deletions(-) > > diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c > index d22b6c140f23..f9e8fe3ff0c5 100644 > --- a/net/ipv6/fib6_rules.c > +++ b/net/ipv6/fib6_rules.c > @@ -287,7 +287,8 @@ static bool fib6_rule_suppress(struct fib_rule *rule, > struct fib_lookup_arg *arg > return false; > > suppress_route: > - ip6_rt_put(rt); > + if (!(arg->flags & FIB_LOOKUP_NOREF)) > + ip6_rt_put(rt); > return true; > } > > diff --git a/tools/testing/selftests/net/fib_tests.sh > b/tools/testing/selftests/net/fib_tests.sh > index 4465fc2dae14..c2c5f2bf0f95 100755 > --- a/tools/testing/selftests/net/fib_tests.sh > +++ b/tools/testing/selftests/net/fib_tests.sh > @@ -9,7 +9,7 @@ ret=0 > ksft_skip=4 > > # all tests in this script. Can be overridden with -t option > -TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt ipv6_addr_metric > ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics ipv4_route_v6_gw > rp_filter" > +TESTS="unregister down carrier nexthop suppress ipv6_rt ipv4_rt > ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics > ipv4_route_v6_gw rp_filter" > > VERBOSE=0 > PAUSE_ON_FAIL=no > @@ -614,6 +614,20 @@ fib_nexthop_test() > cleanup > } > > +fib_suppress_test() > +{ > + $IP link add dummy1 type dummy > + $IP link set dummy1 up > + $IP -6 route add default dev dummy1 > + $IP -6 rule add table main suppress_prefixlength 0 > + ping -f -c 1000 -W 1 1234::1 || true > + $IP -6 rule del table main suppress_prefixlength 0 > + $IP link del dummy1 > + > + # If we got here without crashing, we're good. > + return 0 > +} > + > > > # Tests on route add and replace > > @@ -1591,6 +1605,7 @@ do > fib_carrier_test|carrier) fib_carrier_test;; > fib_rp_filter_test|rp_filter) fib_rp_filter_test;; > fib_nexthop_test|nexthop) fib_nexthop_test;; > + fib_suppress_test|suppress) fib_suppress_test;; > ipv6_route_test|ipv6_rt)ipv6_route_test;; > ipv4_route_test|ipv4_rt)ipv4_route_test;; > ipv6_addr_metric) ipv6_addr_metric_test;; > -- > 2.21.0 >
Re: [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest
On 08/06/2019 03:16 PM, Wei Wang wrote: The guest can access the lbr related msrs only when the vcpu's lbr event has been assigned the lbr feature. A cpu pinned lbr event (though no such event usages in the current upstream kernel) could reclaim the lbr feature from the vcpu's lbr event (task pinned) via ipi calls. If the cpu is running in the non-root mode, this will cause the cpu to vm-exit to handle the host ipi and then vm-entry back to the guest. So on vm-entry (where interrupt has been disabled), we double confirm that the vcpu's lbr event is still assigned the lbr feature via checking event->oncpu. The pass-through of the lbr related msrs will be cancelled if the lbr is reclaimed, and the following guest accesses to the lbr related msrs will vm-exit to the related msr emulation handler in kvm, which will prevent the accesses. Signed-off-by: Wei Wang --- arch/x86/kvm/pmu.c | 6 ++ arch/x86/kvm/pmu.h | 3 +++ arch/x86/kvm/vmx/pmu_intel.c | 35 +++ arch/x86/kvm/x86.c | 13 + 4 files changed, 57 insertions(+) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index afad092..ed10a57 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -339,6 +339,12 @@ bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu) return false; } +void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu) +{ + if (kvm_x86_ops->pmu_ops->enabled_feature_confirm) + kvm_x86_ops->pmu_ops->enabled_feature_confirm(vcpu); +} + void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) { if (lapic_in_kernel(vcpu)) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index f875721..7467907 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -30,6 +30,7 @@ struct kvm_pmu_ops { int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx); bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr); bool (*lbr_enable)(struct kvm_vcpu *vcpu); + void (*enabled_feature_confirm)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void (*sched_in)(struct kvm_vcpu *vcpu, int cpu); @@ -126,6 +127,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp); bool is_vmware_backdoor_pmc(u32 pmc_idx); +void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu); + extern struct kvm_pmu_ops intel_pmu_ops; extern struct kvm_pmu_ops amd_pmu_ops; #endif /* __KVM_X86_PMU_H */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 5580f1a..421051aa 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -781,6 +781,40 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu) intel_pmu_free_lbr_event(vcpu); } +void intel_pmu_lbr_confirm(struct kvm_vcpu *vcpu) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + + /* +* Either lbr_event being NULL or lbr_used being false indicates that +* the lbr msrs haven't been passed through to the guest, so no need +* to cancel passthrough. +*/ + if (!pmu->lbr_event || !pmu->lbr_used) + return; + + /* +* The lbr feature gets reclaimed via IPI calls, so checking of +* lbr_event->oncpu needs to be in an atomic context. Just confirm +* that irq has been disabled already. +*/ + lockdep_assert_irqs_disabled(); + + /* +* Cancel the pass-through of the lbr msrs if lbr has been reclaimed +* by the host perf. +*/ + if (pmu->lbr_event->oncpu != -1) { A mistake here, should be "pmu->lbr_event->oncpu == -1". (It didn't seem to affect the profiling result, but generated more vm-exits due to mistakenly cancelling the passthrough) Best, Wei
[PATCH v8 07/14] perf/x86: support to create a perf event without counter allocation
Hypervisors may create an lbr event for a vcpu's lbr emulation, and the emulation doesn't need a counter fundamentally. This makes the emulation follow the x86 SDM's description about lbr, which doesn't include a counter, and also avoids wasting a counter. The perf scheduler is supported to not assign a counter for a perf event which doesn't need a counter. Define a macro, X86_PMC_IDX_NA, to replace "-1", which represents a never assigned counter id. Cc: Andi Kleen Cc: Peter Zijlstra Signed-off-by: Wei Wang https://lkml.kernel.org/r/20180920162407.ga24...@hirez.programming.kicks-ass.net --- arch/x86/events/core.c| 36 +++- arch/x86/events/intel/core.c | 3 +++ arch/x86/include/asm/perf_event.h | 1 + include/linux/perf_event.h| 11 +++ 4 files changed, 42 insertions(+), 9 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 81b005e..ffa27bb 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -73,7 +73,7 @@ u64 x86_perf_event_update(struct perf_event *event) int idx = hwc->idx; u64 delta; - if (idx == INTEL_PMC_IDX_FIXED_BTS) + if ((idx == INTEL_PMC_IDX_FIXED_BTS) || (idx == X86_PMC_IDX_NA)) return 0; /* @@ -595,7 +595,7 @@ static int __x86_pmu_event_init(struct perf_event *event) atomic_inc(&active_events); event->destroy = hw_perf_event_destroy; - event->hw.idx = -1; + event->hw.idx = X86_PMC_IDX_NA; event->hw.last_cpu = -1; event->hw.last_tag = ~0ULL; @@ -763,6 +763,8 @@ static bool perf_sched_restore_state(struct perf_sched *sched) static bool __perf_sched_find_counter(struct perf_sched *sched) { struct event_constraint *c; + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); + struct perf_event *e = cpuc->event_list[sched->state.event]; int idx; if (!sched->state.unassigned) @@ -772,6 +774,14 @@ static bool __perf_sched_find_counter(struct perf_sched *sched) return false; c = sched->constraints[sched->state.event]; + if (c == &emptyconstraint) + return false; + + if (is_no_counter_event(e)) { + idx = X86_PMC_IDX_NA; + goto done; + } + /* Prefer fixed purpose counters */ if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) { idx = INTEL_PMC_IDX_FIXED; @@ -797,7 +807,7 @@ static bool __perf_sched_find_counter(struct perf_sched *sched) done: sched->state.counter = idx; - if (c->overlap) + if ((idx != X86_PMC_IDX_NA) && c->overlap) perf_sched_save_state(sched); return true; @@ -918,7 +928,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) c = cpuc->event_constraint[i]; /* never assigned */ - if (hwc->idx == -1) + if (hwc->idx == X86_PMC_IDX_NA) break; /* constraint still honored */ @@ -969,7 +979,8 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign) if (!unsched && assign) { for (i = 0; i < n; i++) { e = cpuc->event_list[i]; - if (x86_pmu.commit_scheduling) + if (x86_pmu.commit_scheduling && + (assign[i] != X86_PMC_IDX_NA)) x86_pmu.commit_scheduling(cpuc, i, assign[i]); } } else { @@ -1038,7 +1049,8 @@ static inline void x86_assign_hw_event(struct perf_event *event, hwc->last_cpu = smp_processor_id(); hwc->last_tag = ++cpuc->tags[i]; - if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) { + if ((hwc->idx == INTEL_PMC_IDX_FIXED_BTS) || + (hwc->idx == X86_PMC_IDX_NA)) { hwc->config_base = 0; hwc->event_base = 0; } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) { @@ -1115,7 +1127,7 @@ static void x86_pmu_enable(struct pmu *pmu) * - running on same CPU as last time * - no other event has used the counter since */ - if (hwc->idx == -1 || + if (hwc->idx == X86_PMC_IDX_NA || match_prev_assignment(hwc, cpuc, i)) continue; @@ -1169,7 +1181,7 @@ int x86_perf_event_set_period(struct perf_event *event) s64 period = hwc->sample_period; int ret = 0, idx = hwc->idx; - if (idx == INTEL_PMC_IDX_FIXED_BTS) + if ((idx == INTEL_PMC_IDX_FIXED_BTS) || (idx == X86_PMC_IDX_NA)) return 0; /* @@ -
[PATCH v8 08/14] perf/core: set the event->owner before event_init
Kernel and user events can be distinguished by checking event->owner. Some pmu driver implementation may need to know event->owner in event_init. For example, intel_pmu_setup_lbr_filter treats a kernel event with exclude_host set as an lbr event created for guest lbr emulation, which doesn't need a pmu counter. So move the event->owner assignment into perf_event_alloc to have it set before event_init is called. Signed-off-by: Wei Wang --- kernel/events/core.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 0463c11..7663f85 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10288,6 +10288,7 @@ static void account_event(struct perf_event *event) static struct perf_event * perf_event_alloc(struct perf_event_attr *attr, int cpu, struct task_struct *task, +struct task_struct *owner, struct perf_event *group_leader, struct perf_event *parent_event, perf_overflow_handler_t overflow_handler, @@ -10340,6 +10341,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, event->group_leader = group_leader; event->pmu = NULL; event->oncpu= -1; + event->owner= owner; event->parent = parent_event; @@ -10891,7 +10893,7 @@ SYSCALL_DEFINE5(perf_event_open, if (flags & PERF_FLAG_PID_CGROUP) cgroup_fd = pid; - event = perf_event_alloc(&attr, cpu, task, group_leader, NULL, + event = perf_event_alloc(&attr, cpu, task, current, group_leader, NULL, NULL, NULL, cgroup_fd); if (IS_ERR(event)) { err = PTR_ERR(event); @@ -11153,8 +11155,6 @@ SYSCALL_DEFINE5(perf_event_open, perf_event__header_size(event); perf_event__id_header_size(event); - event->owner = current; - perf_install_in_context(ctx, event, event->cpu); perf_unpin_context(ctx); @@ -11231,16 +11231,13 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, * Get the target context (task or percpu): */ - event = perf_event_alloc(attr, cpu, task, NULL, NULL, + event = perf_event_alloc(attr, cpu, task, TASK_TOMBSTONE, NULL, NULL, overflow_handler, context, -1); if (IS_ERR(event)) { err = PTR_ERR(event); goto err; } - /* Mark owner so we could distinguish it from user events. */ - event->owner = TASK_TOMBSTONE; - ctx = find_get_context(event->pmu, task, event); if (IS_ERR(ctx)) { err = PTR_ERR(ctx); @@ -11677,6 +11674,7 @@ inherit_event(struct perf_event *parent_event, child_event = perf_event_alloc(&parent_event->attr, parent_event->cpu, + parent_event->owner, child, group_leader, parent_event, NULL, NULL, -1); -- 2.7.4
[PATCH v8 10/14] perf/x86/lbr: don't share lbr for the vcpu usage case
Perf event scheduling lets multiple lbr events share the lbr if they use the same config for LBR_SELECT. For the vcpu case, the vcpu's lbr event created on the host deliberately sets the config to the user callstack mode to have the host support to save/restore the lbr state on vcpu context switching, and the config won't be written to the LBR_SELECT, as the LBR_SELECT is configured by the guest, which might not be the same as the user callstack mode. So don't allow the vcpu's lbr event to share lbr with other host lbr events. Signed-off-by: Wei Wang --- arch/x86/events/intel/lbr.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 4f4bd18..a0f3686 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -45,6 +45,12 @@ static const enum { #define LBR_CALL_STACK_BIT 9 /* enable call stack */ /* + * Set this hardware reserved bit if the lbr perf event is for the vcpu lbr + * emulation. This makes the reg->config different from other regular lbr + * events' config, so that they will not share the lbr feature. + */ +#define LBR_VCPU_BIT 62 +/* * Following bit only exists in Linux; we mask it out before writing it to * the actual MSR. But it helps the constraint perf code to understand * that this is a separate configuration. @@ -62,6 +68,7 @@ static const enum { #define LBR_FAR(1 << LBR_FAR_BIT) #define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT) #define LBR_NO_INFO(1ULL << LBR_NO_INFO_BIT) +#define LBR_VCPU (1ULL << LBR_VCPU_BIT) #define LBR_PLM (LBR_KERNEL | LBR_USER) @@ -818,6 +825,26 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event) (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)) reg->config |= LBR_NO_INFO; + /* +* An lbr perf event without a counter indicates this is for the vcpu +* lbr emulation. The vcpu lbr emulation does not allow the lbr +* feature to be shared with other lbr events on the host, because the +* LBR_SELECT msr is configured by the guest itself. The reg->config +* is deliberately configured to be user call stack mode via the +* related attr fileds to get the host perf's help to save/restore the +* lbr state on vcpu context switching. It doesn't represent what +* LBR_SELECT will be configured. +* +* Set the reserved LBR_VCPU bit for the vcpu usage case, so that the +* vcpu's lbr perf event will not share the lbr feature with other perf +* events. (see __intel_shared_reg_get_constraints, failing to share +* makes it return the emptyconstraint, which finally makes +* x86_schedule_events fail to schedule the lower priority lbr event on +* the lbr feature). +*/ + if (is_no_counter_event(event)) + reg->config |= LBR_VCPU; + return 0; } -- 2.7.4
[PATCH v8 09/14] KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread
VMX transition is much more frequent than vcpu switching, and saving/restoring tens of lbr msrs (e.g. 32 lbr stack entries) would add too much overhead to the frequent vmx transition, which is not necessary. So the vcpu's lbr state only gets saved/restored on the vcpu context switching. The main purposes of using the vcpu's lbr perf event are - follow the host perf scheduling rules to manage the vcpu's usage of lbr (e.g. a cpu pinned lbr event could reclaim lbr and thus stopping the vcpu's use); - have the host perf do context switching of the lbr state on the vcpu thread context switching. Please see the comments in intel_pmu_create_lbr_event for more details. To achieve the pure lbr emulation, the perf event is created only to claim for the lbr feature, and no perf counter is needed for it. The vcpu_lbr field is added to indicate to the host lbr driver that the lbr is currently assigned to a vcpu to use. The guest driver inside the vcpu has its own logic to use the lbr, thus the host side lbr driver doesn't need to enable and use the lbr feature in this case. Some design choice considerations: - Why using "is_kernel_event", instead of checking the PF_VCPU flag, to determine that it is a vcpu perf event for lbr emulation? This is because PF_VCPU is set right before vm-entry into the guest, and cleared after the guest vm-exits to the host. So that flag doesn't remain set when running the host code. Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra Co-developed-by: Like Xu Signed-off-by: Like Xu Signed-off-by: Wei Wang --- arch/x86/events/intel/lbr.c | 38 ++-- arch/x86/events/perf_event.h| 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/pmu_intel.c| 64 + include/linux/perf_event.h | 7 + kernel/events/core.c| 7 - 6 files changed, 108 insertions(+), 10 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 9b2d05c..4f4bd18 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -462,6 +462,14 @@ void intel_pmu_lbr_add(struct perf_event *event) if (!x86_pmu.lbr_nr) return; + /* +* An lbr event without a counter indicates this is for the vcpu lbr +* emulation, so set the vcpu_lbr flag when the vcpu lbr event +* gets scheduled on the lbr here. +*/ + if (is_no_counter_event(event)) + cpuc->vcpu_lbr = 1; + cpuc->br_sel = event->hw.branch_reg.reg; if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) { @@ -509,6 +517,14 @@ void intel_pmu_lbr_del(struct perf_event *event) task_ctx->lbr_callstack_users--; } + /* +* An lbr event without a counter indicates this is for the vcpu lbr +* emulation, so clear the vcpu_lbr flag when the vcpu's lbr event +* gets scheduled out from the lbr. +*/ + if (is_no_counter_event(event)) + cpuc->vcpu_lbr = 0; + if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0) cpuc->lbr_pebs_users--; cpuc->lbr_users--; @@ -521,7 +537,12 @@ void intel_pmu_lbr_enable_all(bool pmi) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); - if (cpuc->lbr_users) + /* +* The vcpu lbr emulation doesn't need host to enable lbr at this +* point, because the guest will set the enabling at a proper time +* itself. +*/ + if (cpuc->lbr_users && !cpuc->vcpu_lbr) __intel_pmu_lbr_enable(pmi); } @@ -529,7 +550,11 @@ void intel_pmu_lbr_disable_all(void) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); - if (cpuc->lbr_users) + /* +* Same as intel_pmu_lbr_enable_all, the guest is responsible for +* clearing the enabling itself. +*/ + if (cpuc->lbr_users && !cpuc->vcpu_lbr) __intel_pmu_lbr_disable(); } @@ -668,8 +693,12 @@ void intel_pmu_lbr_read(void) * * This could be smarter and actually check the event, * but this simple approach seems to work for now. +* +* And no need to read the lbr msrs here if the vcpu lbr event +* is using it, as the guest will read them itself. */ - if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users) + if (!cpuc->lbr_users || cpuc->vcpu_lbr || + cpuc->lbr_users == cpuc->lbr_pebs_users) return; if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) @@ -802,6 +831,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event) if (!x86_pmu.lbr_nr) return -EOPNOTSUPP; + if (event
[PATCH v8 14/14] KVM/x86: remove the common handling of the debugctl msr
The debugctl msr is not completely identical on AMD and Intel CPUs, for example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this msr is handled separatedly in svm.c and intel_pmu.c. So remove the common debugctl msr handling code in kvm_get/set_msr_common. Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra Signed-off-by: Wei Wang --- arch/x86/kvm/x86.c | 13 - 1 file changed, 13 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index efaf0e8..3839ebd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2528,18 +2528,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; } break; - case MSR_IA32_DEBUGCTLMSR: - if (!data) { - /* We support the non-activated case already */ - break; - } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) { - /* Values other than LBR and BTF are vendor-specific, - thus reserved and should throw a #GP */ - return 1; - } - vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n", - __func__, data); - break; case 0x200 ... 0x2ff: return kvm_mtrr_set_msr(vcpu, msr, data); case MSR_IA32_APICBASE: @@ -2800,7 +2788,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) switch (msr_info->index) { case MSR_IA32_PLATFORM_ID: case MSR_IA32_EBL_CR_POWERON: - case MSR_IA32_DEBUGCTLMSR: case MSR_IA32_LASTBRANCHFROMIP: case MSR_IA32_LASTBRANCHTOIP: case MSR_IA32_LASTINTFROMIP: -- 2.7.4
[PATCH v8 11/14] perf/x86: save/restore LBR_SELECT on vcpu switching
The regular host lbr perf event doesn't save/restore the LBR_SELECT msr during a thread context switching, because the LBR_SELECT value is generated from attr.branch_sample_type and already stored in event->hw.branch_reg (please see intel_pmu_setup_hw_filter), which doesn't get lost during thread context switching. The attr.branch_sample_type for the vcpu lbr event is deliberately set to the user call stack mode to enable the perf core to save/restore the lbr related msrs on vcpu switching. So the attr.branch_sample_type essentially doesn't represent what the guest pmu driver will write to LBR_SELECT. Meanwhile, the host lbr driver doesn't configure the lbr msrs, including the LBR_SELECT msr, for the vcpu thread case, as the pmu driver inside the vcpu will do that. So for the vcpu case, add the LBR_SELECT save/restore to ensure what the guest writes to the LBR_SELECT msr doesn't get lost during the vcpu context switching. Cc: Peter Zijlstra Cc: Andi Kleen Cc: Kan Liang Signed-off-by: Wei Wang --- arch/x86/events/intel/lbr.c | 7 +++ arch/x86/events/perf_event.h | 1 + 2 files changed, 8 insertions(+) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index a0f3686..236f8352 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -390,6 +390,9 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx) wrmsrl(x86_pmu.lbr_tos, tos); task_ctx->lbr_stack_state = LBR_NONE; + + if (cpuc->vcpu_lbr) + wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel); } static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx) @@ -416,6 +419,10 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx) if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) rdmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]); } + + if (cpuc->vcpu_lbr) + rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel); + task_ctx->valid_lbrs = i; task_ctx->tos = tos; task_ctx->lbr_stack_state = LBR_VALID; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 8b90a25..0b2f660 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -699,6 +699,7 @@ struct x86_perf_task_context { u64 lbr_from[MAX_LBR_ENTRIES]; u64 lbr_to[MAX_LBR_ENTRIES]; u64 lbr_info[MAX_LBR_ENTRIES]; + u64 lbr_sel; int tos; int valid_lbrs; int lbr_callstack_users; -- 2.7.4
[PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest
The guest can access the lbr related msrs only when the vcpu's lbr event has been assigned the lbr feature. A cpu pinned lbr event (though no such event usages in the current upstream kernel) could reclaim the lbr feature from the vcpu's lbr event (task pinned) via ipi calls. If the cpu is running in the non-root mode, this will cause the cpu to vm-exit to handle the host ipi and then vm-entry back to the guest. So on vm-entry (where interrupt has been disabled), we double confirm that the vcpu's lbr event is still assigned the lbr feature via checking event->oncpu. The pass-through of the lbr related msrs will be cancelled if the lbr is reclaimed, and the following guest accesses to the lbr related msrs will vm-exit to the related msr emulation handler in kvm, which will prevent the accesses. Signed-off-by: Wei Wang --- arch/x86/kvm/pmu.c | 6 ++ arch/x86/kvm/pmu.h | 3 +++ arch/x86/kvm/vmx/pmu_intel.c | 35 +++ arch/x86/kvm/x86.c | 13 + 4 files changed, 57 insertions(+) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index afad092..ed10a57 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -339,6 +339,12 @@ bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu) return false; } +void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu) +{ + if (kvm_x86_ops->pmu_ops->enabled_feature_confirm) + kvm_x86_ops->pmu_ops->enabled_feature_confirm(vcpu); +} + void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) { if (lapic_in_kernel(vcpu)) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index f875721..7467907 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -30,6 +30,7 @@ struct kvm_pmu_ops { int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx); bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr); bool (*lbr_enable)(struct kvm_vcpu *vcpu); + void (*enabled_feature_confirm)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void (*sched_in)(struct kvm_vcpu *vcpu, int cpu); @@ -126,6 +127,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp); bool is_vmware_backdoor_pmc(u32 pmc_idx); +void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu); + extern struct kvm_pmu_ops intel_pmu_ops; extern struct kvm_pmu_ops amd_pmu_ops; #endif /* __KVM_X86_PMU_H */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 5580f1a..421051aa 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -781,6 +781,40 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu) intel_pmu_free_lbr_event(vcpu); } +void intel_pmu_lbr_confirm(struct kvm_vcpu *vcpu) +{ + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); + + /* +* Either lbr_event being NULL or lbr_used being false indicates that +* the lbr msrs haven't been passed through to the guest, so no need +* to cancel passthrough. +*/ + if (!pmu->lbr_event || !pmu->lbr_used) + return; + + /* +* The lbr feature gets reclaimed via IPI calls, so checking of +* lbr_event->oncpu needs to be in an atomic context. Just confirm +* that irq has been disabled already. +*/ + lockdep_assert_irqs_disabled(); + + /* +* Cancel the pass-through of the lbr msrs if lbr has been reclaimed +* by the host perf. +*/ + if (pmu->lbr_event->oncpu != -1) { + pmu->lbr_used = false; + intel_pmu_set_intercept_for_lbr_msrs(vcpu, true); + } +} + +void intel_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu) +{ + intel_pmu_lbr_confirm(vcpu); +} + struct kvm_pmu_ops intel_pmu_ops = { .find_arch_event = intel_find_arch_event, .find_fixed_event = intel_find_fixed_event, @@ -790,6 +824,7 @@ struct kvm_pmu_ops intel_pmu_ops = { .is_valid_msr_idx = intel_is_valid_msr_idx, .is_valid_msr = intel_is_valid_msr, .lbr_enable = intel_pmu_lbr_enable, + .enabled_feature_confirm = intel_pmu_enabled_feature_confirm, .get_msr = intel_pmu_get_msr, .set_msr = intel_pmu_set_msr, .sched_in = intel_pmu_sched_in, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b76f019..efaf0e8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7985,6 +7985,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) smp_mb__after_srcu_read_unlock(); /* +* Higher priority host perf events (e.g. cpu pinned) could reclaim the +* pmu resources (e.g. lbr) that were assigned to the vcpu. This is +* usually done via ipi calls (see perf_install_in_context for +* details). +* +* Before enteri
[PATCH v8 12/14] KVM/x86/lbr: lbr emulation
In general, the lbr emulation works in this way: Guest first access (since vcpu scheduled in) to the lbr related msr gets trapped to kvm, and the handler will do the following things: - create an lbr perf event to have the vcpu get the lbr feature from host perf following the perf scheduling rules; - pass the lbr related msrs through to the guest for direct accesses without vm-exits till the end of this vcpu time slice. The guest first access is made interceptible so that the kvm side lbr emulation can always get if the lbr feature has been used during the vcpu time slice. If the lbr feature isn't used during a time slice, the lbr event created for the vcpu will be freed. Some considerations: - Why not free the vcpu lbr event when the guest clears the lbr enable bit? Guest may frequently clear the lbr enable bit (in the debugctl msr) during its use of the lbr feature, e.g. in PMI handler. This will cause the kvm emulation to frequently alloc/free the vcpu lbr event, which is unnecessary. Technically, we want to free the vcpu lbr event when the guest doesn't need to run lbr anymore. Heuristically, we free the vcpu lbr event when the guest doesn't touch any of the lbr msrs during an entire vcpu time slice. Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra Suggested-by: Andi Kleen Signed-off-by: Wei Wang --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/pmu.c | 6 ++ arch/x86/kvm/pmu.h | 2 + arch/x86/kvm/vmx/pmu_intel.c| 206 arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/vmx/vmx.h | 2 + arch/x86/kvm/x86.c | 2 + 7 files changed, 222 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 692a0c2..ecd22b5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -469,6 +469,8 @@ struct kvm_pmu { u64 global_ctrl_mask; u64 global_ovf_ctrl_mask; u64 reserved_bits; + /* Indicate if the lbr msrs were accessed in this vcpu time slice */ + bool lbr_used; u8 version; struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC]; struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED]; diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 1a291ed..afad092 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -360,6 +360,12 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info); } +void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ + if (kvm_x86_ops->pmu_ops->sched_in) + kvm_x86_ops->pmu_ops->sched_in(vcpu, cpu); +} + /* refresh PMU settings. This function generally is called when underlying * settings are changed (such as changes of PMU CPUID by guest VMs), which * should rarely happen. diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index f61024e..f875721 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -32,6 +32,7 @@ struct kvm_pmu_ops { bool (*lbr_enable)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); + void (*sched_in)(struct kvm_vcpu *vcpu, int cpu); void (*refresh)(struct kvm_vcpu *vcpu); void (*init)(struct kvm_vcpu *vcpu); void (*reset)(struct kvm_vcpu *vcpu); @@ -116,6 +117,7 @@ int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx); bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr); int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); +void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu); void kvm_pmu_refresh(struct kvm_vcpu *vcpu); void kvm_pmu_reset(struct kvm_vcpu *vcpu); void kvm_pmu_init(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 89730f8..5580f1a 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -17,6 +17,7 @@ #include "cpuid.h" #include "lapic.h" #include "pmu.h" +#include "vmx.h" static struct kvm_event_hw_type_mapping intel_arch_events[] = { /* Index must match CPUID 0x0A.EBX bit vector */ @@ -141,6 +142,19 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, return &counters[idx]; } +/* Return true if it is one of the lbr related msrs. */ +static inline bool is_lbr_msr(struct kvm_vcpu *vcpu, u32 index) +{ + struct x86_perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack; + int nr = stack->nr; + + return !!(index == MSR_LBR_SELECT || + index == stack->tos || + (index >= stack->from && index < stack->from + nr) || +
[PATCH v8 06/14] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of the addresses stored in the lbr stack. Expose those bits to the guest when the guest lbr feature is enabled. Cc: Paolo Bonzini Cc: Andi Kleen Signed-off-by: Wei Wang --- arch/x86/include/asm/perf_event.h | 2 ++ arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/vmx/pmu_intel.c | 16 3 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 2606100..aa77da2 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -95,6 +95,8 @@ #define PEBS_DATACFG_LBRS BIT_ULL(3) #define PEBS_DATACFG_LBR_SHIFT 24 +#define X86_PERF_CAP_MASK_LBR_FMT 0x3f + /* * Intel "Architectural Performance Monitoring" CPUID * detection/enumeration details: diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 22c2720..826b2dc 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -458,7 +458,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function, F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | - F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ | + F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) | F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) | F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) | 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) | diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 53bb95e..f0ad78f 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -151,6 +151,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_STATUS: case MSR_CORE_PERF_GLOBAL_CTRL: case MSR_CORE_PERF_GLOBAL_OVF_CTRL: + case MSR_IA32_PERF_CAPABILITIES: ret = pmu->version > 1; break; default: @@ -316,6 +317,19 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: msr_info->data = pmu->global_ovf_ctrl; return 0; + case MSR_IA32_PERF_CAPABILITIES: { + u64 data; + + if (!boot_cpu_has(X86_FEATURE_PDCM) || + (!msr_info->host_initiated && +!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))) + return 1; + data = native_read_msr(MSR_IA32_PERF_CAPABILITIES); + msr_info->data = 0; + if (vcpu->kvm->arch.lbr_in_guest) + msr_info->data |= (data & X86_PERF_CAP_MASK_LBR_FMT); + return 0; + } default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) { u64 val = pmc_read_counter(pmc); @@ -374,6 +388,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_IA32_PERF_CAPABILITIES: + return 1; /* RO MSR */ default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) { if (msr_info->host_initiated) -- 2.7.4
[PATCH v8 04/14] KVM/x86: intel_pmu_lbr_enable
The lbr stack is model specific, for example, SKX has 32 lbr stack entries while HSW has 16 entries, so a HSW guest running on a SKX machine may not get accurate perf results. Currently, we forbid the guest lbr enabling when the guest and host see different lbr stack entries or the host and guest see different lbr stack msr indices. Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra Cc: Kan Liang Signed-off-by: Wei Wang --- arch/x86/kvm/pmu.c | 8 +++ arch/x86/kvm/pmu.h | 2 + arch/x86/kvm/vmx/pmu_intel.c | 136 +++ arch/x86/kvm/x86.c | 3 +- 4 files changed, 147 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 46875bb..26fac6c 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -331,6 +331,14 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) return 0; } +bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu) +{ + if (kvm_x86_ops->pmu_ops->lbr_enable) + return kvm_x86_ops->pmu_ops->lbr_enable(vcpu); + + return false; +} + void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) { if (lapic_in_kernel(vcpu)) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 58265f7..d9eec9a 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -29,6 +29,7 @@ struct kvm_pmu_ops { u64 *mask); int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx); bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr); + bool (*lbr_enable)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void (*refresh)(struct kvm_vcpu *vcpu); @@ -107,6 +108,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); +bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu); void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu); void kvm_pmu_handle_event(struct kvm_vcpu *vcpu); int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data); diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 4dea0e0..6294a86 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "x86.h" #include "cpuid.h" #include "lapic.h" @@ -162,6 +163,140 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) return ret; } +static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu->kvm; + u8 vcpu_model = guest_cpuid_model(vcpu); + unsigned int vcpu_lbr_from, vcpu_lbr_nr; + + if (x86_perf_get_lbr_stack(&kvm->arch.lbr_stack)) + return false; + + if (guest_cpuid_family(vcpu) != boot_cpu_data.x86) + return false; + + /* +* It could be possible that people have vcpus of old model run on +* physcal cpus of newer model, for example a BDW guest on a SKX +* machine (but not possible to be the other way around). +* The BDW guest may not get accurate results on a SKX machine as it +* only reads 16 entries of the lbr stack while there are 32 entries +* of recordings. We currently forbid the lbr enabling when the vcpu +* and physical cpu see different lbr stack entries or the guest lbr +* msr indices are not compatible with the host. +*/ + switch (vcpu_model) { + case INTEL_FAM6_CORE2_MEROM: + case INTEL_FAM6_CORE2_MEROM_L: + case INTEL_FAM6_CORE2_PENRYN: + case INTEL_FAM6_CORE2_DUNNINGTON: + /* intel_pmu_lbr_init_core() */ + vcpu_lbr_nr = 4; + vcpu_lbr_from = MSR_LBR_CORE_FROM; + break; + case INTEL_FAM6_NEHALEM: + case INTEL_FAM6_NEHALEM_EP: + case INTEL_FAM6_NEHALEM_EX: + /* intel_pmu_lbr_init_nhm() */ + vcpu_lbr_nr = 16; + vcpu_lbr_from = MSR_LBR_NHM_FROM; + break; + case INTEL_FAM6_ATOM_BONNELL: + case INTEL_FAM6_ATOM_BONNELL_MID: + case INTEL_FAM6_ATOM_SALTWELL: + case INTEL_FAM6_ATOM_SALTWELL_MID: + case INTEL_FAM6_ATOM_SALTWELL_TABLET: + /* intel_pmu_lbr_init_atom() */ + vcpu_lbr_nr = 8; + vcpu_lbr_from = MSR_LBR_CORE_FROM; + break; + case INTEL_FAM6_ATOM_SILVERMONT: + case INTEL_FAM6_ATOM_SILVERMONT_X: + case INTEL_FAM6_ATOM_SILVERMONT_MID: + case INTEL_FAM6_ATOM_AIRMONT: + case INTEL_FAM6_ATOM_AIRMONT_MID: + /* intel_pmu_lbr_init_slm() */ + vcpu_lbr_nr = 8; + vcpu_lbr_from = MSR_LBR_CORE_FROM; +
[PATCH v8 05/14] KVM/x86/vPMU: tweak kvm_pmu_get_msr
Change kvm_pmu_get_msr to get the msr_data struct, as the host_initiated field from the struct could be used by get_msr. This also makes this API consistent with kvm_pmu_set_msr. Cc: Paolo Bonzini Cc: Andi Kleen Signed-off-by: Wei Wang --- arch/x86/kvm/pmu.c | 4 ++-- arch/x86/kvm/pmu.h | 4 ++-- arch/x86/kvm/pmu_amd.c | 7 --- arch/x86/kvm/vmx/pmu_intel.c | 19 +++ arch/x86/kvm/x86.c | 4 ++-- 5 files changed, 21 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 26fac6c..1a291ed 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -350,9 +350,9 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) return kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, msr); } -int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data) +int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { - return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data); + return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr_info); } int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index d9eec9a..f61024e 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -30,7 +30,7 @@ struct kvm_pmu_ops { int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx); bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr); bool (*lbr_enable)(struct kvm_vcpu *vcpu); - int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data); + int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void (*refresh)(struct kvm_vcpu *vcpu); void (*init)(struct kvm_vcpu *vcpu); @@ -114,7 +114,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu); int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data); int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx); bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr); -int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data); +int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void kvm_pmu_refresh(struct kvm_vcpu *vcpu); void kvm_pmu_reset(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c index c838838..4a64a3f 100644 --- a/arch/x86/kvm/pmu_amd.c +++ b/arch/x86/kvm/pmu_amd.c @@ -208,21 +208,22 @@ static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) return ret; } -static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data) +static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + u32 msr = msr_info->index; /* MSR_PERFCTRn */ pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER); if (pmc) { - *data = pmc_read_counter(pmc); + msr_info->data = pmc_read_counter(pmc); return 0; } /* MSR_EVNTSELn */ pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL); if (pmc) { - *data = pmc->eventsel; + msr_info->data = pmc->eventsel; return 0; } diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 6294a86..53bb95e 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -297,35 +297,38 @@ static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu) return true; } -static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data) +static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); struct kvm_pmc *pmc; + u32 msr = msr_info->index; switch (msr) { case MSR_CORE_PERF_FIXED_CTR_CTRL: - *data = pmu->fixed_ctr_ctrl; + msr_info->data = pmu->fixed_ctr_ctrl; return 0; case MSR_CORE_PERF_GLOBAL_STATUS: - *data = pmu->global_status; + msr_info->data = pmu->global_status; return 0; case MSR_CORE_PERF_GLOBAL_CTRL: - *data = pmu->global_ctrl; + msr_info->data = pmu->global_ctrl; return 0; case MSR_CORE_PERF_GLOBAL_OVF_CTRL: - *data = pmu->global_ovf_ctrl; + msr_info->data = pmu->global_ovf_ctrl; return 0; default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) { u64 val = pmc_read_counter(pmc); - *data = val & pmu->counter_bitmask[KVM_PMC_GP]; + msr_info->data = + val & pmu->count
[PATCH v8 01/14] perf/x86: fix the variable type of the lbr msrs
The msr variable type can be "unsigned int", which uses less memory than the longer unsigned long. The lbr nr won't be a negative number, so make it "unsigned int" as well. Cc: Peter Zijlstra Cc: Andi Kleen Suggested-by: Peter Zijlstra Signed-off-by: Wei Wang --- arch/x86/events/perf_event.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 8751008..27e4d32 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -660,8 +660,8 @@ struct x86_pmu { /* * Intel LBR */ - unsigned long lbr_tos, lbr_from, lbr_to; /* MSR base regs */ - int lbr_nr;/* hardware stack size */ + unsigned intlbr_tos, lbr_from, lbr_to, + lbr_nr;/* lbr stack and size */ u64 lbr_sel_mask; /* LBR_SELECT valid bits */ const int *lbr_sel_map; /* lbr_select mappings */ boollbr_double_abort; /* duplicated lbr aborts */ -- 2.7.4
[PATCH v8 03/14] KVM/x86: KVM_CAP_X86_GUEST_LBR
Introduce KVM_CAP_X86_GUEST_LBR to allow per-VM enabling of the guest lbr feature. Signed-off-by: Wei Wang --- Documentation/virt/kvm/api.txt | 26 ++ arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c | 16 include/uapi/linux/kvm.h| 1 + 4 files changed, 45 insertions(+) diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt index 2d06776..64632a8 100644 --- a/Documentation/virt/kvm/api.txt +++ b/Documentation/virt/kvm/api.txt @@ -5046,6 +5046,32 @@ it hard or impossible to use it correctly. The availability of KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed. Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT. +7.19 KVM_CAP_X86_GUEST_LBR +Architectures: x86 +Parameters: args[0] whether feature should be enabled or not +args[1] pointer to the userspace memory to load the lbr stack info + +The lbr stack info is described by +struct x86_perf_lbr_stack { + unsigned intnr; + unsigned inttos; + unsigned intfrom; + unsigned intto; + unsigned intinfo; +}; + +@nr: number of lbr stack entries +@tos: index of the top of stack msr +@from: index of the msr that stores a branch source address +@to: index of the msr that stores a branch destination address +@info: index of the msr that stores lbr related flags + +Enabling this capability allows guest accesses to the lbr feature. Otherwise, +#GP will be injected to the guest when it accesses to the lbr related msrs. + +After the feature is enabled, before exiting to userspace, kvm handlers should +fill the lbr stack info into the userspace memory pointed by args[1]. + 8. Other capabilities. -- diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7b0a4ee..d29 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -875,6 +875,7 @@ struct kvm_arch { atomic_t vapics_in_nmi_mode; struct mutex apic_map_lock; struct kvm_apic_map *apic_map; + struct x86_perf_lbr_stack lbr_stack; bool apic_access_page_done; @@ -884,6 +885,7 @@ struct kvm_arch { bool hlt_in_guest; bool pause_in_guest; bool cstate_in_guest; + bool lbr_in_guest; unsigned long irq_sources_bitmap; s64 kvmclock_offset; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c6d951c..e1eb1be 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3129,6 +3129,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_EXCEPTION_PAYLOAD: r = 1; break; + case KVM_CAP_X86_GUEST_LBR: + r = sizeof(struct x86_perf_lbr_stack); + break; case KVM_CAP_SYNC_REGS: r = KVM_SYNC_X86_VALID_FIELDS; break; @@ -4670,6 +4673,19 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, kvm->arch.exception_payload_enabled = cap->args[0]; r = 0; break; + case KVM_CAP_X86_GUEST_LBR: + r = -EINVAL; + if (cap->args[0] && + x86_perf_get_lbr_stack(&kvm->arch.lbr_stack)) + break; + + if (copy_to_user((void __user *)cap->args[1], +&kvm->arch.lbr_stack, +sizeof(struct x86_perf_lbr_stack))) + break; + kvm->arch.lbr_in_guest = cap->args[0]; + r = 0; + break; default: r = -EINVAL; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 5e3f12d..dd53edc 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -996,6 +996,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172 #define KVM_CAP_PMU_EVENT_FILTER 173 +#define KVM_CAP_X86_GUEST_LBR 174 #ifdef KVM_CAP_IRQ_ROUTING -- 2.7.4
[PATCH v8 02/14] perf/x86: add a function to get the addresses of the lbr stack msrs
The lbr stack msrs are model specific. The perf subsystem has already assigned the abstracted msr address values based on the cpu model. So add a function to enable callers outside the perf subsystem to get the lbr stack addresses. This is useful for hypervisors to emulate the lbr feature for the guest. Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra Signed-off-by: Wei Wang --- arch/x86/events/intel/lbr.c | 23 +++ arch/x86/include/asm/perf_event.h | 14 ++ 2 files changed, 37 insertions(+) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 6f814a2..9b2d05c 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1311,3 +1311,26 @@ void intel_pmu_lbr_init_knl(void) if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP) x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS; } + +/** + * x86_perf_get_lbr_stack - get the lbr stack related msrs + * + * @stack: the caller's memory to get the lbr stack + * + * Returns: 0 indicates that the lbr stack has been successfully obtained. + */ +int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack) +{ + stack->nr = x86_pmu.lbr_nr; + stack->tos = x86_pmu.lbr_tos; + stack->from = x86_pmu.lbr_from; + stack->to = x86_pmu.lbr_to; + + if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) + stack->info = MSR_LBR_INFO_0; + else + stack->info = 0; + + return 0; +} +EXPORT_SYMBOL_GPL(x86_perf_get_lbr_stack); diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 1392d5e..2606100 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -318,7 +318,16 @@ struct perf_guest_switch_msr { u64 host, guest; }; +struct x86_perf_lbr_stack { + unsigned intnr; + unsigned inttos; + unsigned intfrom; + unsigned intto; + unsigned intinfo; +}; + extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); +extern int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack); extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap); extern void perf_check_microcode(void); extern int x86_perf_rdpmc_index(struct perf_event *event); @@ -329,6 +338,11 @@ static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr) return NULL; } +static inline int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack) +{ + return -1; +} + static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap) { memset(cap, 0, sizeof(*cap)); -- 2.7.4
[PATCH v8 00/14] Guest LBR Enabling
t userspace do a compatibility check. - Patch 7: -- support perf scheduler to not assign a counter for the perf event that has PERF_EV_CAP_NO_COUNTER set (rather than skipping the perf scheduler). This allows the scheduler to detect lbr usage conflicts via get_event_constraints, and lower priority events will finally fail to use lbr. -- define X86_PMC_IDX_NA as "-1", which represents a never assigned counter id. There are other places that use "-1", but could be updated to use the new macro in another patch series. - Patch 8: -- move the event->owner assignment into perf_event_alloc to have it set before event_init is called. Please see this patch's commit for reasons. - Patch 9: -- use "exclude_host" and "is_kernel_event" to decide if the lbr event is used for the vcpu lbr emulation, which doesn't need a counter, and removes the usage of the previous new perf_event_create API. -- remove the unused attr fields. - Patch 10: -- set a hardware reserved bit (bit 62 of LBR_SELECT) to reg->config for the vcpu lbr emulation event. This makes the config different from other host lbr event, so that they don't share the lbr. Please see the comments in the patch for the reasons why they shouldn't share. - Patch 12: -- disable interrupt and check if the vcpu lbr event owns the lbr feature before kvm writing to the lbr related msr. This avoids kvm updating the lbr msrs after lbr has been reclaimed by other events via ipi. -- remove arch v4 related support. - Patch 13: -- double check if the vcpu lbr event owns the lbr feature before vm-entry into the guest. The lbr pass-through will be cancelled if lbr feature has been reclaimed by a cpu pinned lbr event. Previous: https://lkml.kernel.org/r/1562548999-37095-1-git-send-email-wei.w.w...@intel.com Wei Wang (14): perf/x86: fix the variable type of the lbr msrs perf/x86: add a function to get the addresses of the lbr stack msrs KVM/x86: KVM_CAP_X86_GUEST_LBR KVM/x86: intel_pmu_lbr_enable KVM/x86/vPMU: tweak kvm_pmu_get_msr KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest perf/x86: support to create a perf event without counter allocation perf/core: set the event->owner before event_init KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread perf/x86/lbr: don't share lbr for the vcpu usage case perf/x86: save/restore LBR_SELECT on vcpu switching KVM/x86/lbr: lbr emulation KVM/x86/vPMU: check the lbr feature before entering guest KVM/x86: remove the common handling of the debugctl msr Documentation/virt/kvm/api.txt| 26 +++ arch/x86/events/core.c| 36 ++- arch/x86/events/intel/core.c | 3 + arch/x86/events/intel/lbr.c | 95 +++- arch/x86/events/perf_event.h | 6 +- arch/x86/include/asm/kvm_host.h | 5 + arch/x86/include/asm/perf_event.h | 17 ++ arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/pmu.c| 24 +- arch/x86/kvm/pmu.h| 11 +- arch/x86/kvm/pmu_amd.c| 7 +- arch/x86/kvm/vmx/pmu_intel.c | 476 +- arch/x86/kvm/vmx/vmx.c| 4 +- arch/x86/kvm/vmx/vmx.h| 2 + arch/x86/kvm/x86.c| 47 ++-- include/linux/perf_event.h| 18 ++ include/uapi/linux/kvm.h | 1 + kernel/events/core.c | 19 +- 18 files changed, 738 insertions(+), 61 deletions(-) -- 2.7.4
Re: [PATCH] KVM: x86: Add fixed counters to PMU filter
On 07/19/2019 02:38 AM, Eric Hankland wrote: From: ehankland Updates KVM_CAP_PMU_EVENT_FILTER so it can also whitelist or blacklist fixed counters. Signed-off-by: ehankland --- Documentation/virtual/kvm/api.txt | 13 - arch/x86/include/uapi/asm/kvm.h | 9 ++--- arch/x86/kvm/pmu.c| 30 +- 3 files changed, 39 insertions(+), 13 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 2cd6250b2896..96bcf1aa1931 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -4090,17 +4090,20 @@ Parameters: struct kvm_pmu_event_filter (in) Returns: 0 on success, -1 on error struct kvm_pmu_event_filter { - __u32 action; - __u32 nevents; - __u64 events[0]; + __u32 action; + __u32 nevents; + __u32 fixed_counter_bitmap; + __u32 flags; + __u32 pad[4]; + __u64 events[0]; }; This ioctl restricts the set of PMU events that the guest can program. The argument holds a list of events which will be allowed or denied. The eventsel+umask of each event the guest attempts to program is compared against the events field to determine whether the guest should have access. -This only affects general purpose counters; fixed purpose counters can -be disabled by changing the perfmon CPUID leaf. +The events field only controls general purpose counters; fixed purpose +counters are controlled by the fixed_counter_bitmap. Valid values for 'action': #define KVM_PMU_EVENT_ALLOW 0 diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index e901b0ab116f..503d3f42da16 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -435,9 +435,12 @@ struct kvm_nested_state { /* for KVM_CAP_PMU_EVENT_FILTER */ struct kvm_pmu_event_filter { - __u32 action; - __u32 nevents; - __u64 events[0]; + __u32 action; + __u32 nevents; + __u32 fixed_counter_bitmap; + __u32 flags; + __u32 pad[4]; + __u64 events[0]; }; #define KVM_PMU_EVENT_ALLOW 0 diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index aa5a2597305a..ae5cd1b02086 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -19,8 +19,8 @@ #include "lapic.h" #include "pmu.h" -/* This keeps the total size of the filter under 4k. */ -#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63 +/* This is enough to filter the vast majority of currently defined events. */ +#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 300 /* NOTE: * - Each perf counter is defined as "struct kvm_pmc"; @@ -206,12 +206,25 @@ void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int idx) { unsigned en_field = ctrl & 0x3; bool pmi = ctrl & 0x8; + struct kvm_pmu_event_filter *filter; + struct kvm *kvm = pmc->vcpu->kvm; + unnecessary white space here, other part looks good to me. Reviewed-by: Wei Wang Best, Wei
[PATCH v2] mm/balloon_compaction: avoid duplicate page removal
Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces) A #GP is reported in the guest when requesting balloon inflation via virtio-balloon. The reason is that the virtio-balloon driver has removed the page from its internal page list (via balloon_page_pop), but balloon_page_enqueue_one also calls "list_del" to do the removal. This is necessary when it's used from balloon_page_enqueue_list, but not from balloon_page_enqueue_one. So remove the list_del balloon_page_enqueue_one, and update some comments as a reminder. Signed-off-by: Wei Wang --- ChangeLong: v1->v2: updated some comments mm/balloon_compaction.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 83a7b61..8639bfc 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -21,7 +21,6 @@ static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info, * memory corruption is possible and we should stop execution. */ BUG_ON(!trylock_page(page)); - list_del(&page->lru); balloon_page_insert(b_dev_info, page); unlock_page(page); __count_vm_event(BALLOON_INFLATE); @@ -33,7 +32,7 @@ static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info, * @b_dev_info: balloon device descriptor where we will insert a new page to * @pages: pages to enqueue - allocated using balloon_page_alloc. * - * Driver must call it to properly enqueue a balloon pages before definitively + * Driver must call it to properly enqueue balloon pages before definitively * removing it from the guest system. * * Return: number of pages that were enqueued. @@ -47,6 +46,7 @@ size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info, spin_lock_irqsave(&b_dev_info->pages_lock, flags); list_for_each_entry_safe(page, tmp, pages, lru) { + list_del(&page->lru); balloon_page_enqueue_one(b_dev_info, page); n_pages++; } @@ -128,13 +128,19 @@ struct page *balloon_page_alloc(void) EXPORT_SYMBOL_GPL(balloon_page_alloc); /* - * balloon_page_enqueue - allocates a new page and inserts it into the balloon - * page list. + * balloon_page_enqueue - inserts a new page into the balloon page list. + * * @b_dev_info: balloon device descriptor where we will insert a new page to * @page: new page to enqueue - allocated using balloon_page_alloc. * * Driver must call it to properly enqueue a new allocated balloon page * before definitively removing it from the guest system. + * + * Drivers must not call balloon_page_enqueue on pages that have been + * pushed to a list with balloon_page_push before removing them with + * balloon_page_pop. To all pages on a list, use balloon_page_list_enqueue + * instead. + * * This function returns the page address for the recently enqueued page or * NULL in the case we fail to allocate a new page this turn. */ -- 2.7.4
Re: use of shrinker in virtio balloon free page hinting
On 07/18/2019 02:47 PM, Michael S. Tsirkin wrote: On Thu, Jul 18, 2019 at 02:30:01PM +0800, Wei Wang wrote: On 07/18/2019 01:58 PM, Michael S. Tsirkin wrote: what if it does not fail? Shrinker is called on system memory pressure. On memory pressure get_free_page_and_send will fail memory allocation, so it stops allocating more. Memory pressure could be triggered by an unrelated allocation e.g. from another driver. As memory pressure is system-wide (no matter who triggers it), free page hinting will fail on memory pressure, same as other drivers. That would be good. Except instead of failing it can hit a race condition where it will reallocate memory freed by shrinker. Not good. OK..I could see this when another module does allocation, which triggers kswapd to have balloon's shrinker release some memory, which could be eaten by balloon quickly again before that module takes it, and this could happen repeatedly in theory. So add a vb->stop_free_page_report boolean, set it in shrinker_count, and clear it in virtio_balloon_queue_free_page_work? Best, Wei
Re: [PATCH v1] mm/balloon_compaction: avoid duplicate page removal
On 07/18/2019 12:31 PM, Michael S. Tsirkin wrote: On Thu, Jul 18, 2019 at 10:23:30AM +0800, Wei Wang wrote: Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces) A #GP is reported in the guest when requesting balloon inflation via virtio-balloon. The reason is that the virtio-balloon driver has removed the page from its internal page list (via balloon_page_pop), but balloon_page_enqueue_one also calls "list_del" to do the removal. I would add here "this is necessary when it's used from balloon_page_enqueue_list but not when it's called from balloon_page_enqueue". So remove the list_del in balloon_page_enqueue_one, and have the callers do the page removal from their own page lists. Signed-off-by: Wei Wang Patch is good but comments need some work. --- mm/balloon_compaction.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 83a7b61..1a5ddc4 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -11,6 +11,7 @@ #include #include +/* Callers ensure that @page has been removed from its original list. */ This comment does not make sense. E.g. balloon_page_enqueue does nothing to ensure this. And drivers are not supposed to care how the page lists are managed. Pls drop. Instead please add the following to balloon_page_enqueue: Note: drivers must not call balloon_page_list_enqueue on Probably, you meant balloon_page_enqueue here. The description for balloon_page_enqueue also seems incorrect: "allocates a new page and inserts it into the balloon page list." This function doesn't do any allocation itself. Plan to reword it: inserts a new page into the balloon page list." Best, Wei
Re: use of shrinker in virtio balloon free page hinting
On 07/18/2019 01:58 PM, Michael S. Tsirkin wrote: what if it does not fail? Shrinker is called on system memory pressure. On memory pressure get_free_page_and_send will fail memory allocation, so it stops allocating more. Memory pressure could be triggered by an unrelated allocation e.g. from another driver. As memory pressure is system-wide (no matter who triggers it), free page hinting will fail on memory pressure, same as other drivers. As long as the page allocation succeeds, we could just think the system is not in the memory pressure situation, then thing could go on normally. Also, the VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG includes NORETRY and NOMEMALLOC, which makes it easier than most other drivers to fail allocation first. Best, Wei
Re: use of shrinker in virtio balloon free page hinting
On 07/18/2019 12:13 PM, Michael S. Tsirkin wrote: It makes sense for pages in the balloon (requested by hypervisor). However free page hinting can freeze up lots of memory for its own internal reasons. It does not make sense to ask hypervisor to set flags in order to fix internal guest issues. Sounds reasonable to me. Probably we could move the flag check to shrinker_count and shrinker_scan as a reclaiming condition for ballooning pages only? Right. But that does not include the pages in the hint vq, which could be a significant amount of memory. I think it includes, as vb->num_free_page_blocks records the total number of free page blocks that balloon has taken from mm. For shrink_free_pages, it calls return_free_pages_to_mm, which pops pages from vb->free_page_list (this is the list where pages get enlisted after they are put to the hint vq, see get_free_page_and_send). - if free pages are being reported, pages freed by shrinker will just get re-allocated again fill_balloon will re-try the allocation after sleeping 200ms once allocation fails. Even if ballon was never inflated, if shrinker frees some memory while we are hinting, hint vq will keep going and allocate it back without sleeping. Still see get_free_page_and_send. -EINTR is returned when page allocation fails, and reporting ends then. Shrinker is called on system memory pressure. On memory pressure get_free_page_and_send will fail memory allocation, so it stops allocating more. Best, Wei
[PATCH v1] mm/balloon_compaction: avoid duplicate page removal
Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces) A #GP is reported in the guest when requesting balloon inflation via virtio-balloon. The reason is that the virtio-balloon driver has removed the page from its internal page list (via balloon_page_pop), but balloon_page_enqueue_one also calls "list_del" to do the removal. So remove the list_del in balloon_page_enqueue_one, and have the callers do the page removal from their own page lists. Signed-off-by: Wei Wang --- mm/balloon_compaction.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 83a7b61..1a5ddc4 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -11,6 +11,7 @@ #include #include +/* Callers ensure that @page has been removed from its original list. */ static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info, struct page *page) { @@ -21,7 +22,6 @@ static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info, * memory corruption is possible and we should stop execution. */ BUG_ON(!trylock_page(page)); - list_del(&page->lru); balloon_page_insert(b_dev_info, page); unlock_page(page); __count_vm_event(BALLOON_INFLATE); @@ -47,6 +47,7 @@ size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info, spin_lock_irqsave(&b_dev_info->pages_lock, flags); list_for_each_entry_safe(page, tmp, pages, lru) { + list_del(&page->lru); balloon_page_enqueue_one(b_dev_info, page); n_pages++; } -- 2.7.4
Re: [PATCH v2] KVM: x86: PMU Event Filter
On 07/16/2019 08:10 AM, Eric Hankland wrote: I think just disabling guest cpuid might not be enough, since guest could write to the msr without checking the cpuid. Why not just add a bitmap for fixed counter? e.g. fixed_counter_reject_bitmap At the beginning of reprogram_fixed_counter, we could add the check: if (test_bit(idx, &kvm->arch.fixed_counter_reject_bitmap)) return -EACCES; (Please test with your old guest and see if they have issues if we inject #GP when they try to set the fixed_ctrl msr. If there is, we could drop -EACCESS above) The bitmap could be set at kvm_vm_ioctl_set_pmu_event_filter. intel_pmu_refresh() checks the guest cpuid and sets the number of fixed counters according to that: pmu->nr_arch_fixed_counters = min_t(int, edx.split.num_counters_fixed, INTEL_PMC_MAX_FIXED); and reprogram_fixed_counters()/get_fixed_pmc() respect this so the guest can't just ignore the cpuid. Yes, but as you noticed, we couldn't disable fixed counter 2 while keeping counter 3 running via that. Adding a bitmap does let you do things like disable the first counter but keep the second and third, but given that there are only three and the events are likely to be on a whitelist anyway, it seemed like adding the bitmap wasn't worth it. If you still feel the same way even though we can disable them via the cpuid, I can add this in. We need the design to be architecturally clean. For example, if the hardware later comes up with fixed counter 5 and 6. 5 is something we really want to expose to the guest to use while 6 isn't, can our design here (further) support that? We don't want to re-design this at that time. However, extending what we have would be acceptable. So, if you hesitate to add the bitmap method that I described, please add GP tags to the ACTIONs defined, e.g. enum kvm_pmu_action_type { KVM_PMU_EVENT_ACTION_GP_NONE = 0, KVM_PMU_EVENT_ACTION_GP_ACCEPT = 1, KVM_PMU_EVENT_ACTION_GP_REJECT = 2, KVM_PMU_EVENT_ACTION_MAX }; and add comments to explain something like below: Those GP actions are for the filtering of guest events running on the virtual general purpose counters. The actions to filter guest events running on the virtual fixed function counters are not added currently as they all seem fine to be used by the guest so far, but that could be supported on demand in the future via adding new actions. I think it would be better to add more, please see below: enum kvm_pmu_action_type { KVM_PMU_EVENT_ACTION_NONE = 0, KVM_PMU_EVENT_ACTION_ACCEPT = 1, KVM_PMU_EVENT_ACTION_REJECT = 2, KVM_PMU_EVENT_ACTION_MAX }; and do a check in kvm_vm_ioctl_set_pmu_event_filter() if (filter->action >= KVM_PMU_EVENT_ACTION_MAX) return -EINVAL; This is for detecting the case that we add a new action in userspace, while the kvm hasn't been updated to support that. KVM_PMU_EVENT_ACTION_NONE is for userspace to remove the filter after they set it. We can achieve the same result by using a reject action with an empty set of events - is there some advantage to "none" over that? I can add that check for valid actions. Yes, we could also make it work via passing nevents=0. I slightly prefer the use of the NONE action here. The advantage is simpler (less) userspace command. For QEMU, people could disable the filter list via qmp command, e.g. pmu-filter=none, instead of pmu-filter=accept,nevents=0. It also seems more straightforward when people read the usage manual - a NONE command to cancel the filtering, instead of "first setting this, then setting that.." I don't see advantages of using nevents over NONE action. Anyway, this one isn't a critical issue, and it's up to you here. +#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63 Why is this limit needed? Serves to keep the filters on the smaller side and ensures the size calculation can't overflow if users attempt to. Keeping the filter under 4k is nicer for allocation - also, if we want really large filters we might want to do something smarter than a linear traversal of the filter when guests program counters. I think 63 is too small, and it looks like a random number being put here. From the SDM table 19-3, it seems there are roughly 300 events. Functionally, the design should support to filter most of them. (optimization with smarter traversal is another story that could be done later) Maybe #define KVM_PMU_EVENT_FILTER_MAX_EVENTS (PAGE_SIZE - sizeof(struct kvm_pmu_event_filter)) / sizeof (__u64) ? and please add some comments above about the consideration that we set this number. + kvfree(filter); Probably better to have it conditionally? if (filter) { synchronize_srcu(); kfree(filter) } You may want to factor it out, so that kvm_pmu_destroy could reuse. Do you mean kvm_arch_destroy_vm? It looks like that's where kvm_arch members are freed. I can do that. Sounds good. Best, Wei
Re: [PATCH v2] KVM: x86: PMU Event Filter
On 07/11/2019 09:25 AM, Eric Hankland wrote: - Add a VM ioctl that can control which events the guest can monitor. Signed-off-by: ehankland --- Changes since v1: -Moved to a vm ioctl rather than a vcpu one -Changed from a whitelist to a configurable filter which can either be white or black -Only restrict GP counters since fixed counters require extra handling and they can be disabled by setting the guest cpuid (though only by setting the number - they can't be disabled individually) I think just disabling guest cpuid might not be enough, since guest could write to the msr without checking the cpuid. Why not just add a bitmap for fixed counter? e.g. fixed_counter_reject_bitmap At the beginning of reprogram_fixed_counter, we could add the check: if (test_bit(idx, &kvm->arch.fixed_counter_reject_bitmap)) return -EACCES; (Please test with your old guest and see if they have issues if we inject #GP when they try to set the fixed_ctrl msr. If there is, we could drop -EACCESS above) The bitmap could be set at kvm_vm_ioctl_set_pmu_event_filter. +/* for KVM_CAP_PMU_EVENT_FILTER */ +struct kvm_pmu_event_filter { + __u32 type; + __u32 nevents; + __u64 events[0]; +}; + +#define KVM_PMU_EVENT_WHITELIST 0 +#define KVM_PMU_EVENT_BLACKLIST 1 I think it would be better to add more, please see below: enum kvm_pmu_action_type { KVM_PMU_EVENT_ACTION_NONE = 0, KVM_PMU_EVENT_ACTION_ACCEPT = 1, KVM_PMU_EVENT_ACTION_REJECT = 2, KVM_PMU_EVENT_ACTION_MAX }; and do a check in kvm_vm_ioctl_set_pmu_event_filter() if (filter->action >= KVM_PMU_EVENT_ACTION_MAX) return -EINVAL; This is for detecting the case that we add a new action in userspace, while the kvm hasn't been updated to support that. KVM_PMU_EVENT_ACTION_NONE is for userspace to remove the filter after they set it. + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index dd745b58ffd8..d674b79ff8da 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -22,6 +22,9 @@ #include "lapic.h" #include "pmu.h" +/* This keeps the total size of the filter under 4k. */ +#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63 + Why is this limit needed? /* NOTE: * - Each perf counter is defined as "struct kvm_pmc"; * - There are two types of perf counters: general purpose (gp) and fixed. @@ -144,6 +147,10 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel) { unsigned config, type = PERF_TYPE_RAW; u8 event_select, unit_mask; + struct kvm_arch *arch = &pmc->vcpu->kvm->arch; + struct kvm_pmu_event_filter *filter; + int i; + bool allow_event = true; if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL) printk_once("kvm pmu: pin control bit is ignored\n"); @@ -155,6 +162,24 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel) if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_is_enabled(pmc)) return; + rcu_read_lock(); + filter = rcu_dereference(arch->pmu_event_filter); + if (filter) { + for (i = 0; i < filter->nevents; i++) + if (filter->events[i] == + (eventsel & AMD64_RAW_EVENT_MASK_NB)) + break; + if (filter->type == KVM_PMU_EVENT_WHITELIST && + i == filter->nevents) + allow_event = false; + if (filter->type == KVM_PMU_EVENT_BLACKLIST && + i < filter->nevents) + allow_event = false; + } + rcu_read_unlock(); + if (!allow_event) + return; + I think it looks tidier to wrap the changes above into a function: if (kvm_pmu_filter_event(kvm, eventsel & AMD64_RAW_EVENT_MASK_NB)) return; event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT; unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8; @@ -351,3 +376,39 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu) { kvm_pmu_reset(vcpu); } + +int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp) +{ + struct kvm_pmu_event_filter tmp, *filter; + size_t size; + int r; + + if (copy_from_user(&tmp, argp, sizeof(tmp))) + return -EFAULT; + + if (tmp.nevents > KVM_PMU_EVENT_FILTER_MAX_EVENTS) + return -E2BIG; + + size = sizeof(tmp) + sizeof(tmp.events[0]) * tmp.nevents; + filter = vmalloc(size); + if (!filter) + return -ENOMEM; + + r = -EFAULT; + if (copy_from_user(filter, argp, size)) Though the above functions correctly, I would just move "r = -EFAULT" here to have it executed conditionally. + goto cleanup; + + /* Ensure nevents can't be changed between the user copies. */ + *filter = tmp; + + mutex_lock(&kvm->lock); + rcu_swap_protected(kvm->arch.pmu_event_filter,
Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
On 07/09/2019 07:35 PM, Peter Zijlstra wrote: Yeah; although I'm not sure if its an implementation or specification problem. But as it exists it is of very limited use. Fundamentally our events (with exception of event groups) are independent. Events should always count, except when the PMI is running -- so as to not include the measurement overhead in the measurement itself. But this (mis)feature stops the entire PMU as soon as a single counter overflows, inhibiting all other counters from running (as they should) until the PMI has happened and reset the state. (Note that, strictly speaking, we even expect the overflowing counter to continue counting until the PMI happens. Having an overflow should not mean we loose events. A sampling and !sampling event should produce the same event count.) So even when there's only a single event (group) scheduled, it isn't strictly right. And when there's multiple events scheduled it is definitely wrong. And while I understand the purpose of the current semantics; it makes a single event group sample count more coherent, the fact that is looses events just bugs me something fierce -- and as shown, it breaks tools. Thanks for sharing the finding. If I understand this correctly, you observed that counter getting freezed earlier than expected (expected to freeze at the time PMI gets generated). Have you talked to anyone for possible freeze adjustment from the hardware? Best, Wei
Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
On 07/09/2019 07:45 PM, Peter Zijlstra wrote: +* config:Actually this field won't be used by the perf core +*as this event doesn't have a perf counter. +* sample_period: Same as above. If it's unused; why do we need to set it at all? OK, we'll remove the unused fields. +* sample_type: tells the perf core that it is an lbr event. +* branch_sample_type: tells the perf core that the lbr event works in +*the user callstack mode so that the lbr stack will be +*saved/restored on vCPU switching. Again; doesn't make sense. What does the user part have to do with save/restore? What happens when this vcpu thread drops to userspace for an assist? This is a fake event which doesn't run the lbr on the host. Returning to userspace doesn't need save/restore lbr, because userspace wouldn't change those lbr msrs. The event is created to save/restore lbr msrs on vcpu switching. Host perf only do this save/restore for "user callstack mode" lbr event, so we construct the event to be "user callstack mode" to reuse the host lbr save/restore. An alternative method is to have KVM vcpu sched_in/out callback save/restore those msrs, which don't need to create this fake event. Please let me know if you prefer the second method. Best, Wei
Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
On 07/09/2019 08:19 PM, Peter Zijlstra wrote: For the lbr feature, could we thought of it as first come, first served? For example, if we have 2 host threads who want to use lbr at the same time, I think one of them would simply fail to use. So if guest first gets the lbr, host wouldn't take over unless some userspace command (we added to QEMU) is executed to have the vCPU actively stop using lbr. Doesn't work that way. Say you start KVM with LBR emulation, it creates this task event, it gets the LBR (nobody else wants it) and the guest works and starts using the LBR. Then the host creates a CPU LBR event and the vCPU suddenly gets denied the LBR and the guest no longer functions correctly. Or you should fail to VMENTER, in which case you starve the guest, but at least it doesn't malfunction. Ok, I think we could go with the first option for now, like this: When there are other host threads starting to use lbr, host perf notifies kvm to disable the guest use of lbr (add a notifier_block), which includes: - cancel passing through lbr msrs to guest - upon guest access to the lbr msr (will trap to KVM now), return #GP (or 0). Best, Wei
Re: [PATCH v7 07/12] perf/x86: no counter allocation support
On 07/09/2019 05:43 PM, Peter Zijlstra wrote: That's almost a year ago; I really can't remember that and you didn't put any of that in your Changelog to help me remember. (also please use: https://lkml.kernel.org/r/$msgid style links) OK, I'll put this link in the cover letter or commit log for a reminder. In the previous version, we added a "no_counter" bit to perf_event_attr, and that will be exposed to user ABI, which seems not good. (https://lkml.org/lkml/2019/2/14/791) So we wrap a new kernel API above to support this. Do you have a different suggestion to do this? (exclude host/guest just clears the enable bit when on VM-exit/entry, still consumes the counter) Just add an argument to perf_event_create_kernel_counter() ? Yes. I didn't find a proper place to add this "no_counter" indicator, so added a wrapper to avoid changing existing callers of perf_event_create_kernel_counter. Best, Wei
Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
On 07/09/2019 05:39 PM, Peter Zijlstra wrote: On Tue, Jul 09, 2019 at 11:04:21AM +0800, Wei Wang wrote: On 07/08/2019 10:48 PM, Peter Zijlstra wrote: *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do this? Because the VMX transition is much more frequent than the vCPU switching. On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this would be too large for the frequent VMX transitions. LBR state is saved when vCPU is scheduled out to ensure that this vCPU's LBR data doesn't get lost (as another vCPU or host thread that is scheduled in may use LBR) But VMENTER/VMEXIT still have to enable/disable the LBR, right? Otherwise the host will pollute LBR contents. And you then rely on this 'fake' event to ensure the host doesn't use LBR when the VCPU is running. Yes, only the debugctl msr is save/restore on vmx tranisions. But what about the counter scheduling rules; The counter is emulated independent of the lbr emulation. Here is the background reason: The direction we are going is the architectural emulation, where the features are emulated based on the hardware behavior described in the spec. So the lbr emulation path only offers the lbr feature to the guest (no counters associated, as the lbr feature doesn't have a counter essentially). If the above isn't clear, please see this example: the guest could run any software to use the lbr feature (non-perf or non-linux, or even a testing kernel module to try lbr for their own purpose), and it could choose to use a regular timer to do sampling. If the lbr emulation takes a counter to generate a PMI to the guest to do sampling, that pmi isn't expected from the guest perspective. So the counter scheduling isn't considered by the lbr emulation here, it is considered by the counter emulation. If the guest needs a counter, it configures the related msr, which traps to KVM, and the counter emulation has it own emulation path (e.g. reprogram_gp_counter which is called when the guest writes to the emulated eventsel msr). what happens when a CPU event claims the LBR before the task event can claim it? CPU events have precedence over task events. I think the precedence (cpu pined and task pined) is for the counter multiplexing, right? For the lbr feature, could we thought of it as first come, first served? For example, if we have 2 host threads who want to use lbr at the same time, I think one of them would simply fail to use. So if guest first gets the lbr, host wouldn't take over unless some userspace command (we added to QEMU) is executed to have the vCPU actively stop using lbr. I'm missing all these details in the Changelogs. Please describe the whole setup and explain why this approach. OK, just shared some important background above. I'll see if any more important details missed. Best, Wei
Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
On 07/08/2019 11:09 PM, Peter Zijlstra wrote: On Mon, Jul 08, 2019 at 09:23:19AM +0800, Wei Wang wrote: This patch enables the LBR related features in Arch v4 in advance, though the current vPMU only has v2 support. Other arch v4 related support will be enabled later in another series. Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM, the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi has been set and a PMI is generated. The CTR_FRZ bit is set when debugctl.freeze_perfmon_on_pmi is set and a PMI is generated. (that's still a misnomer; it is: freeze_perfmon_on_overflow) OK. (but that was directly copied from the sdm 18.2.4.1) Why? Who uses that v4 crud? I saw the native perf driver has been updated to v4. After the vPMU gets updated to v4, the guest perf would use that. If you prefer to hold on this patch until vPMU v4 support, we could do that as well. It's broken. It looses events between overflow and PMI. Do you mean it's a v4 hardware issue? Best, Wei
Re: [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack
On 07/08/2019 10:53 PM, Peter Zijlstra wrote: On Mon, Jul 08, 2019 at 09:23:17AM +0800, Wei Wang wrote: When the vCPU is scheduled in: - if the lbr feature was used in the last vCPU time slice, set the lbr stack to be interceptible, so that the host can capture whether the lbr feature will be used in this time slice; - if the lbr feature wasn't used in the last vCPU time slice, disable the vCPU support of the guest lbr switching. Upon the first access to one of the lbr related MSRs (since the vCPU was scheduled in): - record that the guest has used the lbr; - create a host perf event to help save/restore the guest lbr stack; - pass the stack through to the guest. I don't understand a word of that. Who cares if the LBR MSRs are touched; the guest expects up-to-date values when it does reads them. Another host thread who shares the same pCPU with this vCPU thread may use the lbr stack, so the host needs to save/restore the vCPU's lbr state. Otherwise the guest perf inside the vCPU wouldn't read the correct data from the lbr msr (as the msrs are changed by another host thread already). As Andi also replied, if the vCPU isn't using lbr anymore, host doesn't need to save the lbr msr then. Best, Wei
Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
On 07/08/2019 10:48 PM, Peter Zijlstra wrote: On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote: From: Like Xu This patch adds support to enable/disable the host side save/restore This patch should be disqualified on Changelog alone... Documentation/process/submitting-patches.rst:instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy OK, we will discard "This patch" in the description: To enable/disable the host side save/restore for the guest lbr stack on vCPU switching, the host creates a perf event for the vCPU.. for the guest lbr stack on vCPU switching. To enable that, the host creates a perf event for the vCPU, and the event attributes are set to the user callstack mode lbr so that all the conditions are meet in the host perf subsystem to save the lbr stack on task switching. The host side lbr perf event are created only for the purpose of saving and restoring the lbr stack. There is no need to enable the lbr functionality for this perf event, because the feature is essentially used in the vCPU. So perf_event_create is invoked with need_counter=false to get no counter assigned for the perf event. The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is used by the vCPU only for context switching. When the perf subsystem handles this event (e.g. lbr enable or read lbr stack on PMI) and finds it's non-zero, it simply returns. *WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do this? Because the VMX transition is much more frequent than the vCPU switching. On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this would be too large for the frequent VMX transitions. LBR state is saved when vCPU is scheduled out to ensure that this vCPU's LBR data doesn't get lost (as another vCPU or host thread that is scheduled in may use LBR) Many of these patches don't explain why things are done; that's a problem. We'll improve, please help when you find anywhere isn't clear to you, thanks. Best, Wei
Re: [PATCH v7 07/12] perf/x86: no counter allocation support
On 07/08/2019 10:29 PM, Peter Zijlstra wrote: Thanks for the comments. diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0ab99c7..19e6593 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *, */ #define PERF_EV_CAP_SOFTWARE BIT(0) #define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1) +#define PERF_EV_CAP_NO_COUNTER BIT(2) #define SWEVENT_HLIST_BITS 8 #define SWEVENT_HLIST_SIZE(1 << SWEVENT_HLIST_BITS) @@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, int refresh); extern void perf_event_update_userpage(struct perf_event *event); extern int perf_event_release_kernel(struct perf_event *event); extern struct perf_event * +perf_event_create(struct perf_event_attr *attr, + int cpu, + struct task_struct *task, + perf_overflow_handler_t overflow_handler, + void *context, + bool counter_assignment); +extern struct perf_event * perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, struct task_struct *task, Why the heck are you creating this wrapper nonsense? (please see early discussions: https://lkml.org/lkml/2018/9/20/868) I thought we agreed that the perf event created here don't need to consume an extra counter. In the previous version, we added a "no_counter" bit to perf_event_attr, and that will be exposed to user ABI, which seems not good. (https://lkml.org/lkml/2019/2/14/791) So we wrap a new kernel API above to support this. Do you have a different suggestion to do this? (exclude host/guest just clears the enable bit when on VM-exit/entry, still consumes the counter) Best, Wei
[PATCH v7 04/12] KVM/x86: intel_pmu_lbr_enable
The lbr stack is architecturally specific, for example, SKX has 32 lbr stack entries while HSW has 16 entries, so a HSW guest running on a SKX machine may not get accurate perf results. Currently, we forbid the guest lbr enabling when the guest and host see different lbr stack entries or the host and guest see different lbr stack msr indices. Signed-off-by: Wei Wang Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra --- arch/x86/kvm/pmu.c | 8 +++ arch/x86/kvm/pmu.h | 2 + arch/x86/kvm/vmx/pmu_intel.c | 136 +++ arch/x86/kvm/x86.c | 3 +- 4 files changed, 147 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 132d149..7d7ac18 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -296,6 +296,14 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data) return 0; } +bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu) +{ + if (kvm_x86_ops->pmu_ops->lbr_enable) + return kvm_x86_ops->pmu_ops->lbr_enable(vcpu); + + return false; +} + void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu) { if (lapic_in_kernel(vcpu)) diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 22dff66..c099b4b 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -29,6 +29,7 @@ struct kvm_pmu_ops { u64 *mask); int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx); bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr); + bool (*lbr_enable)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); void (*refresh)(struct kvm_vcpu *vcpu); @@ -107,6 +108,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel); void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx); void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx); +bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu); void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu); void kvm_pmu_handle_event(struct kvm_vcpu *vcpu); int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data); diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 68d231d..ef8ebd4 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "x86.h" #include "cpuid.h" #include "lapic.h" @@ -162,6 +163,140 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) return ret; } +static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu->kvm; + u8 vcpu_model = guest_cpuid_model(vcpu); + unsigned int vcpu_lbr_from, vcpu_lbr_nr; + + if (x86_perf_get_lbr_stack(&kvm->arch.lbr_stack)) + return false; + + if (guest_cpuid_family(vcpu) != boot_cpu_data.x86) + return false; + + /* +* It could be possible that people have vcpus of old model run on +* physcal cpus of newer model, for example a BDW guest on a SKX +* machine (but not possible to be the other way around). +* The BDW guest may not get accurate results on a SKX machine as it +* only reads 16 entries of the lbr stack while there are 32 entries +* of recordings. We currently forbid the lbr enabling when the vcpu +* and physical cpu see different lbr stack entries or the guest lbr +* msr indices are not compatible with the host. +*/ + switch (vcpu_model) { + case INTEL_FAM6_CORE2_MEROM: + case INTEL_FAM6_CORE2_MEROM_L: + case INTEL_FAM6_CORE2_PENRYN: + case INTEL_FAM6_CORE2_DUNNINGTON: + /* intel_pmu_lbr_init_core() */ + vcpu_lbr_nr = 4; + vcpu_lbr_from = MSR_LBR_CORE_FROM; + break; + case INTEL_FAM6_NEHALEM: + case INTEL_FAM6_NEHALEM_EP: + case INTEL_FAM6_NEHALEM_EX: + /* intel_pmu_lbr_init_nhm() */ + vcpu_lbr_nr = 16; + vcpu_lbr_from = MSR_LBR_NHM_FROM; + break; + case INTEL_FAM6_ATOM_BONNELL: + case INTEL_FAM6_ATOM_BONNELL_MID: + case INTEL_FAM6_ATOM_SALTWELL: + case INTEL_FAM6_ATOM_SALTWELL_MID: + case INTEL_FAM6_ATOM_SALTWELL_TABLET: + /* intel_pmu_lbr_init_atom() */ + vcpu_lbr_nr = 8; + vcpu_lbr_from = MSR_LBR_CORE_FROM; + break; + case INTEL_FAM6_ATOM_SILVERMONT: + case INTEL_FAM6_ATOM_SILVERMONT_X: + case INTEL_FAM6_ATOM_SILVERMONT_MID: + case INTEL_FAM6_ATOM_AIRMONT: + case INTEL_FAM6_ATOM_AIRMONT_MID: + /* intel_pmu_lbr_init_slm() */ + vcpu_lbr_nr = 8; + vcpu_lbr_from = MSR_LBR_CORE_FROM; + break;
[PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN
This patch enables the LBR related features in Arch v4 in advance, though the current vPMU only has v2 support. Other arch v4 related support will be enabled later in another series. Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM, the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi has been set and a PMI is generated. The CTR_FRZ bit is set when debugctl.freeze_perfmon_on_pmi is set and a PMI is generated. Signed-off-by: Wei Wang Cc: Andi Kleen Cc: Paolo Bonzini Cc: Kan Liang --- arch/x86/kvm/pmu.c | 11 +-- arch/x86/kvm/pmu.h | 1 + arch/x86/kvm/vmx/pmu_intel.c | 20 3 files changed, 30 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 323bb45..89bff8f 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -52,6 +52,13 @@ static void kvm_pmi_trigger_fn(struct irq_work *irq_work) kvm_pmu_deliver_pmi(vcpu); } +static void kvm_perf_set_global_status(struct kvm_pmu *pmu, u8 idx) +{ + __set_bit(idx, (unsigned long *)&pmu->global_status); + if (kvm_x86_ops->pmu_ops->set_global_status) + kvm_x86_ops->pmu_ops->set_global_status(pmu, idx); +} + static void kvm_perf_overflow(struct perf_event *perf_event, struct perf_sample_data *data, struct pt_regs *regs) @@ -61,7 +68,7 @@ static void kvm_perf_overflow(struct perf_event *perf_event, if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) { - __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); + kvm_perf_set_global_status(pmu, pmc->idx); kvm_make_request(KVM_REQ_PMU, pmc->vcpu); } } @@ -75,7 +82,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event, if (!test_and_set_bit(pmc->idx, (unsigned long *)&pmu->reprogram_pmi)) { - __set_bit(pmc->idx, (unsigned long *)&pmu->global_status); + kvm_perf_set_global_status(pmu, pmc->idx); kvm_make_request(KVM_REQ_PMU, pmc->vcpu); /* diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index cadf91a..408ddc2 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -24,6 +24,7 @@ struct kvm_pmu_ops { u8 unit_mask); unsigned (*find_fixed_event)(int idx); bool (*pmc_is_enabled)(struct kvm_pmc *pmc); + void (*set_global_status)(struct kvm_pmu *pmu, u8 idx); struct kvm_pmc *(*pmc_idx_to_pmc)(struct kvm_pmu *pmu, int pmc_idx); struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, unsigned idx, u64 *mask); diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index fd09777..6f74b69 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -413,6 +413,22 @@ static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu, return ret; } +static void intel_pmu_set_global_status(struct kvm_pmu *pmu, u8 idx) +{ + u64 guest_debugctl; + + if (pmu->version >= 4) { + guest_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); + + if (guest_debugctl & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) + __set_bit(GLOBAL_STATUS_LBRS_FROZEN, + (unsigned long *)&pmu->global_status); + if (guest_debugctl & DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI) + __set_bit(GLOBAL_STATUS_COUNTERS_FROZEN, + (unsigned long *)&pmu->global_status); + } +} + static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); @@ -597,6 +613,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu) pmu->global_ovf_ctrl_mask = pmu->global_ctrl_mask & ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF | MSR_CORE_PERF_GLOBAL_OVF_CTRL_COND_CHGD); + if (pmu->version >= 4) + pmu->global_ovf_ctrl_mask &= ~(GLOBAL_STATUS_LBRS_FROZEN | + GLOBAL_STATUS_COUNTERS_FROZEN); if (kvm_x86_ops->pt_supported()) pmu->global_ovf_ctrl_mask &= ~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI; @@ -711,6 +730,7 @@ void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu) struct kvm_pmu_ops intel_pmu_ops = { .find_arch_event = intel_find_arch_event, .find_fixed_event = intel_find_fixed_event, + .set_global_status = intel_pmu_set_global_status, .pmc_is_enabled = intel_pmc_is_enabled, .pmc_idx_to_pmc = intel_pmc_idx_to_pmc, .msr_idx_to_pmc = intel_msr_idx_to_pmc, -- 2.7.4
[PATCH v7 02/12] perf/x86: add a function to get the lbr stack
The LBR stack MSRs are architecturally specific. The perf subsystem has already assigned the abstracted MSR values based on the CPU architecture. This patch enables a caller outside the perf subsystem to get the LBR stack info. This is useful for hyperviosrs to prepare the lbr feature for the guest. Signed-off-by: Wei Wang Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra --- arch/x86/events/intel/lbr.c | 23 +++ arch/x86/include/asm/perf_event.h | 14 ++ 2 files changed, 37 insertions(+) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 6f814a2..784642a 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1311,3 +1311,26 @@ void intel_pmu_lbr_init_knl(void) if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP) x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS; } + +/** + * x86_perf_get_lbr_stack - get the lbr stack related MSRs + * + * @stack: the caller's memory to get the lbr stack + * + * Returns: 0 indicates that the lbr stack has been successfully obtained. + */ +int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack) +{ + stack->nr = x86_pmu.lbr_nr; + stack->tos = x86_pmu.lbr_tos; + stack->from = x86_pmu.lbr_from; + stack->to = x86_pmu.lbr_to; + + if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) + stack->info = MSR_LBR_INFO_0; + else + stack->info = 0; + + return 0; +} +EXPORT_SYMBOL_GPL(x86_perf_get_lbr_stack); diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 1392d5e..2606100 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -318,7 +318,16 @@ struct perf_guest_switch_msr { u64 host, guest; }; +struct x86_perf_lbr_stack { + unsigned intnr; + unsigned inttos; + unsigned intfrom; + unsigned intto; + unsigned intinfo; +}; + extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr); +extern int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack); extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap); extern void perf_check_microcode(void); extern int x86_perf_rdpmc_index(struct perf_event *event); @@ -329,6 +338,11 @@ static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr) return NULL; } +static inline int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack) +{ + return -1; +} + static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap) { memset(cap, 0, sizeof(*cap)); -- 2.7.4
[PATCH v7 06/12] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of the addresses stored in the LBR stack. Expose those bits to the guest when the guest lbr feature is enabled. Signed-off-by: Wei Wang Cc: Paolo Bonzini Cc: Andi Kleen --- arch/x86/include/asm/perf_event.h | 2 ++ arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/vmx/pmu_intel.c | 16 3 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 2606100..aa77da2 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -95,6 +95,8 @@ #define PEBS_DATACFG_LBRS BIT_ULL(3) #define PEBS_DATACFG_LBR_SHIFT 24 +#define X86_PERF_CAP_MASK_LBR_FMT 0x3f + /* * Intel "Architectural Performance Monitoring" CPUID * detection/enumeration details: diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 4992e7c..4b9e713 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -361,7 +361,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | - F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ | + F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) | F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) | F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) | 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) | diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 1e19b01..09ae6ff 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -151,6 +151,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_STATUS: case MSR_CORE_PERF_GLOBAL_CTRL: case MSR_CORE_PERF_GLOBAL_OVF_CTRL: + case MSR_IA32_PERF_CAPABILITIES: ret = pmu->version > 1; break; default: @@ -316,6 +317,19 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: msr_info->data = pmu->global_ovf_ctrl; return 0; + case MSR_IA32_PERF_CAPABILITIES: { + u64 data; + + if (!boot_cpu_has(X86_FEATURE_PDCM) || + (!msr_info->host_initiated && +!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))) + return 1; + data = native_read_msr(MSR_IA32_PERF_CAPABILITIES); + msr_info->data = 0; + if (vcpu->kvm->arch.lbr_in_guest) + msr_info->data |= (data & X86_PERF_CAP_MASK_LBR_FMT); + return 0; + } default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) { u64 val = pmc_read_counter(pmc); @@ -374,6 +388,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_IA32_PERF_CAPABILITIES: + return 1; /* RO MSR */ default: if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) { if (msr_info->host_initiated) -- 2.7.4
[PATCH v7 11/12] KVM/x86: remove the common handling of the debugctl msr
The debugctl msr is not completely identical on AMD and Intel CPUs, for example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this msr is handled separatedly in svm.c and intel_pmu.c. So remove the common debugctl msr handling code in kvm_get/set_msr_common. Signed-off-by: Wei Wang Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra --- arch/x86/kvm/x86.c | 13 - 1 file changed, 13 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a3bc7f2..08aa34b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2513,18 +2513,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 1; } break; - case MSR_IA32_DEBUGCTLMSR: - if (!data) { - /* We support the non-activated case already */ - break; - } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) { - /* Values other than LBR and BTF are vendor-specific, - thus reserved and should throw a #GP */ - return 1; - } - vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n", - __func__, data); - break; case 0x200 ... 0x2ff: return kvm_mtrr_set_msr(vcpu, msr, data); case MSR_IA32_APICBASE: @@ -2766,7 +2754,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) switch (msr_info->index) { case MSR_IA32_PLATFORM_ID: case MSR_IA32_EBL_CR_POWERON: - case MSR_IA32_DEBUGCTLMSR: case MSR_IA32_LASTBRANCHFROMIP: case MSR_IA32_LASTBRANCHTOIP: case MSR_IA32_LASTINTFROMIP: -- 2.7.4
[PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack
When the vCPU is scheduled in: - if the lbr feature was used in the last vCPU time slice, set the lbr stack to be interceptible, so that the host can capture whether the lbr feature will be used in this time slice; - if the lbr feature wasn't used in the last vCPU time slice, disable the vCPU support of the guest lbr switching. Upon the first access to one of the lbr related MSRs (since the vCPU was scheduled in): - record that the guest has used the lbr; - create a host perf event to help save/restore the guest lbr stack; - pass the stack through to the guest. Suggested-by: Andi Kleen Signed-off-by: Wei Wang Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/pmu.c | 6 ++ arch/x86/kvm/pmu.h | 2 + arch/x86/kvm/vmx/pmu_intel.c| 141 arch/x86/kvm/vmx/vmx.c | 4 +- arch/x86/kvm/vmx/vmx.h | 2 + arch/x86/kvm/x86.c | 2 + 7 files changed, 157 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 79e9c92..cf8996e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -469,6 +469,8 @@ struct kvm_pmu { u64 global_ctrl_mask; u64 global_ovf_ctrl_mask; u64 reserved_bits; + /* Indicate if the lbr msrs were accessed in this vCPU time slice */ + bool lbr_used; u8 version; struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC]; struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED]; diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index ee6ed47..323bb45 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -325,6 +325,12 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info); } +void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ + if (kvm_x86_ops->pmu_ops->sched_in) + kvm_x86_ops->pmu_ops->sched_in(vcpu, cpu); +} + /* refresh PMU settings. This function generally is called when underlying * settings are changed (such as changes of PMU CPUID by guest VMs), which * should rarely happen. diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 384a0b7..cadf91a 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -32,6 +32,7 @@ struct kvm_pmu_ops { bool (*lbr_enable)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info); + void (*sched_in)(struct kvm_vcpu *vcpu, int cpu); void (*refresh)(struct kvm_vcpu *vcpu); void (*init)(struct kvm_vcpu *vcpu); void (*reset)(struct kvm_vcpu *vcpu); @@ -116,6 +117,7 @@ int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx); bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr); int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); +void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu); void kvm_pmu_refresh(struct kvm_vcpu *vcpu); void kvm_pmu_reset(struct kvm_vcpu *vcpu); void kvm_pmu_init(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 24a544e..fd09777 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -13,10 +13,12 @@ #include #include #include +#include #include "x86.h" #include "cpuid.h" #include "lapic.h" #include "pmu.h" +#include "vmx.h" static struct kvm_event_hw_type_mapping intel_arch_events[] = { /* Index must match CPUID 0x0A.EBX bit vector */ @@ -141,6 +143,18 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, return &counters[idx]; } +static inline bool is_lbr_msr(struct kvm_vcpu *vcpu, u32 index) +{ + struct x86_perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack; + int nr = stack->nr; + + return !!(index == MSR_LBR_SELECT || + index == stack->tos || + (index >= stack->from && index < stack->from + nr) || + (index >= stack->to && index < stack->to + nr) || + (index >= stack->info && index < stack->info)); +} + static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) { struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); @@ -152,9 +166,12 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_CTRL: case MSR_CORE_PERF_GLOBAL_OVF_CTRL: case MSR_IA32_PERF_CAPABILITIES: + case MSR_IA32_DEBUGCTLMSR: ret = pmu->version > 1; break; default: + if (is_lbr_msr(vcpu, msr)
[PATCH v7 09/12] perf/x86: save/restore LBR_SELECT on vCPU switching
The vCPU lbr event relies on the host to save/restore all the lbr related MSRs. So add the LBR_SELECT save/restore to the related functions for the vCPU case. Signed-off-by: Wei Wang Cc: Peter Zijlstra Cc: Andi Kleen --- arch/x86/events/intel/lbr.c | 7 +++ arch/x86/events/perf_event.h | 1 + 2 files changed, 8 insertions(+) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 118764b..4861a9d 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -383,6 +383,9 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx) wrmsrl(x86_pmu.lbr_tos, tos); task_ctx->lbr_stack_state = LBR_NONE; + + if (cpuc->vcpu_lbr) + wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel); } static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx) @@ -409,6 +412,10 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx) if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) rdmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]); } + + if (cpuc->vcpu_lbr) + rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel); + task_ctx->valid_lbrs = i; task_ctx->tos = tos; task_ctx->lbr_stack_state = LBR_VALID; diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 86605d1..e37ff82 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -721,6 +721,7 @@ struct x86_perf_task_context { u64 lbr_from[MAX_LBR_ENTRIES]; u64 lbr_to[MAX_LBR_ENTRIES]; u64 lbr_info[MAX_LBR_ENTRIES]; + u64 lbr_sel; int tos; int valid_lbrs; int lbr_callstack_users; -- 2.7.4
[PATCH v7 07/12] perf/x86: no counter allocation support
In some cases, an event may be created without needing a counter allocation. For example, an lbr event may be created by the host only to help save/restore the lbr stack on the vCPU context switching. This patch adds a new interface to allow users to create a perf event without the need of counter assignment. Signed-off-by: Wei Wang Cc: Andi Kleen Cc: Peter Zijlstra --- arch/x86/events/core.c | 12 include/linux/perf_event.h | 13 + kernel/events/core.c | 37 + 3 files changed, 50 insertions(+), 12 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index f315425..eebbd65 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -410,6 +410,9 @@ int x86_setup_perfctr(struct perf_event *event) struct hw_perf_event *hwc = &event->hw; u64 config; + if (is_no_counter_event(event)) + return 0; + if (!is_sampling_event(event)) { hwc->sample_period = x86_pmu.max_period; hwc->last_period = hwc->sample_period; @@ -1248,6 +1251,12 @@ static int x86_pmu_add(struct perf_event *event, int flags) hwc = &event->hw; n0 = cpuc->n_events; + + if (is_no_counter_event(event)) { + n = n0; + goto done_collect; + } + ret = n = collect_events(cpuc, event, false); if (ret < 0) goto out; @@ -1422,6 +1431,9 @@ static void x86_pmu_del(struct perf_event *event, int flags) if (cpuc->txn_flags & PERF_PMU_TXN_ADD) goto do_del; + if (is_no_counter_event(event)) + goto do_del; + /* * Not a TXN, therefore cleanup properly. */ diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 0ab99c7..19e6593 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *, */ #define PERF_EV_CAP_SOFTWARE BIT(0) #define PERF_EV_CAP_READ_ACTIVE_PKGBIT(1) +#define PERF_EV_CAP_NO_COUNTER BIT(2) #define SWEVENT_HLIST_BITS 8 #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS) @@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, int refresh); extern void perf_event_update_userpage(struct perf_event *event); extern int perf_event_release_kernel(struct perf_event *event); extern struct perf_event * +perf_event_create(struct perf_event_attr *attr, + int cpu, + struct task_struct *task, + perf_overflow_handler_t overflow_handler, + void *context, + bool counter_assignment); +extern struct perf_event * perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, struct task_struct *task, @@ -1032,6 +1040,11 @@ static inline bool is_sampling_event(struct perf_event *event) return event->attr.sample_period != 0; } +static inline bool is_no_counter_event(struct perf_event *event) +{ + return !!(event->event_caps & PERF_EV_CAP_NO_COUNTER); +} + /* * Return 1 for a software event, 0 for a hardware event */ diff --git a/kernel/events/core.c b/kernel/events/core.c index abbd4b3..70884df 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -11162,18 +11162,10 @@ SYSCALL_DEFINE5(perf_event_open, return err; } -/** - * perf_event_create_kernel_counter - * - * @attr: attributes of the counter to create - * @cpu: cpu in which the counter is bound - * @task: task to profile (NULL for percpu) - */ -struct perf_event * -perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, -struct task_struct *task, -perf_overflow_handler_t overflow_handler, -void *context) +struct perf_event *perf_event_create(struct perf_event_attr *attr, int cpu, +struct task_struct *task, +perf_overflow_handler_t overflow_handler, +void *context, bool need_counter) { struct perf_event_context *ctx; struct perf_event *event; @@ -11193,6 +11185,9 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, /* Mark owner so we could distinguish it from user events. */ event->owner = TASK_TOMBSTONE; + if (!need_counter) + event->event_caps |= PERF_EV_CAP_NO_COUNTER; + ctx = find_get_context(event->pmu, task, event); if (IS_ERR(ctx)) { err = PTR_ERR(ctx); @@ -11241,6 +11236,24 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu, err:
[PATCH v7 01/12] perf/x86: fix the variable type of the LBR MSRs
The MSR variable type can be "unsigned int", which uses less memory than the longer unsigned long. The lbr nr won't be a negative number, so make it "unsigned int" as well. Suggested-by: Peter Zijlstra Signed-off-by: Wei Wang Cc: Peter Zijlstra Cc: Andi Kleen --- arch/x86/events/perf_event.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index a6ac2f4..186c1c7 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -682,8 +682,8 @@ struct x86_pmu { /* * Intel LBR */ - unsigned long lbr_tos, lbr_from, lbr_to; /* MSR base regs */ - int lbr_nr;/* hardware stack size */ + unsigned intlbr_tos, lbr_from, lbr_to, + lbr_nr;/* lbr stack and size */ u64 lbr_sel_mask; /* LBR_SELECT valid bits */ const int *lbr_sel_map; /* lbr_select mappings */ boollbr_double_abort; /* duplicated lbr aborts */ -- 2.7.4
[PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack
From: Like Xu This patch adds support to enable/disable the host side save/restore for the guest lbr stack on vCPU switching. To enable that, the host creates a perf event for the vCPU, and the event attributes are set to the user callstack mode lbr so that all the conditions are meet in the host perf subsystem to save the lbr stack on task switching. The host side lbr perf event are created only for the purpose of saving and restoring the lbr stack. There is no need to enable the lbr functionality for this perf event, because the feature is essentially used in the vCPU. So perf_event_create is invoked with need_counter=false to get no counter assigned for the perf event. The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is used by the vCPU only for context switching. When the perf subsystem handles this event (e.g. lbr enable or read lbr stack on PMI) and finds it's non-zero, it simply returns. Signed-off-by: Like Xu Signed-off-by: Wei Wang Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra --- arch/x86/events/intel/lbr.c | 13 +++-- arch/x86/events/perf_event.h| 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/pmu.h | 3 ++ arch/x86/kvm/vmx/pmu_intel.c| 61 + 5 files changed, 76 insertions(+), 3 deletions(-) diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 784642a..118764b 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -462,6 +462,9 @@ void intel_pmu_lbr_add(struct perf_event *event) if (!x86_pmu.lbr_nr) return; + if (event->attr.exclude_guest && is_no_counter_event(event)) + cpuc->vcpu_lbr = 1; + cpuc->br_sel = event->hw.branch_reg.reg; if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) { @@ -509,6 +512,9 @@ void intel_pmu_lbr_del(struct perf_event *event) task_ctx->lbr_callstack_users--; } + if (event->attr.exclude_guest && is_no_counter_event(event)) + cpuc->vcpu_lbr = 0; + if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0) cpuc->lbr_pebs_users--; cpuc->lbr_users--; @@ -521,7 +527,7 @@ void intel_pmu_lbr_enable_all(bool pmi) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); - if (cpuc->lbr_users) + if (cpuc->lbr_users && !cpuc->vcpu_lbr) __intel_pmu_lbr_enable(pmi); } @@ -529,7 +535,7 @@ void intel_pmu_lbr_disable_all(void) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); - if (cpuc->lbr_users) + if (cpuc->lbr_users && !cpuc->vcpu_lbr) __intel_pmu_lbr_disable(); } @@ -669,7 +675,8 @@ void intel_pmu_lbr_read(void) * This could be smarter and actually check the event, * but this simple approach seems to work for now. */ - if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users) + if (!cpuc->lbr_users || cpuc->vcpu_lbr || + cpuc->lbr_users == cpuc->lbr_pebs_users) return; if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32) diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h index 186c1c7..86605d1 100644 --- a/arch/x86/events/perf_event.h +++ b/arch/x86/events/perf_event.h @@ -238,6 +238,7 @@ struct cpu_hw_events { /* * Intel LBR bits */ + u8 vcpu_lbr; int lbr_users; int lbr_pebs_users; struct perf_branch_stacklbr_stack; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8d80925..79e9c92 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -474,6 +474,7 @@ struct kvm_pmu { struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED]; struct irq_work irq_work; u64 reprogram_pmi; + struct perf_event *vcpu_lbr_event; }; struct kvm_pmu_ops; diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h index 7926b65..384a0b7 100644 --- a/arch/x86/kvm/pmu.h +++ b/arch/x86/kvm/pmu.h @@ -123,6 +123,9 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu); bool is_vmware_backdoor_pmc(u32 pmc_idx); +extern int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu); +extern void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu); + extern struct kvm_pmu_ops intel_pmu_ops; extern struct kvm_pmu_ops amd_pmu_ops; #endif /* __KVM_X86_PMU_H */ diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 09ae6ff..24a544e 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -507,6 +507,67 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
[PATCH v7 03/12] KVM/x86: KVM_CAP_X86_GUEST_LBR
Introduce KVM_CAP_X86_GUEST_LBR to allow per-VM enabling of the guest lbr feature. Signed-off-by: Wei Wang Cc: Paolo Bonzini Cc: Andi Kleen Cc: Peter Zijlstra --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c | 14 ++ include/uapi/linux/kvm.h| 1 + 3 files changed, 17 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 26d1eb8..8d80925 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -871,6 +871,7 @@ struct kvm_arch { atomic_t vapics_in_nmi_mode; struct mutex apic_map_lock; struct kvm_apic_map *apic_map; + struct x86_perf_lbr_stack lbr_stack; bool apic_access_page_done; @@ -879,6 +880,7 @@ struct kvm_arch { bool mwait_in_guest; bool hlt_in_guest; bool pause_in_guest; + bool lbr_in_guest; unsigned long irq_sources_bitmap; s64 kvmclock_offset; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9857992..b35a118 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3086,6 +3086,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_GET_MSR_FEATURES: case KVM_CAP_MSR_PLATFORM_INFO: case KVM_CAP_EXCEPTION_PAYLOAD: + case KVM_CAP_X86_GUEST_LBR: r = 1; break; case KVM_CAP_SYNC_REGS: @@ -4622,6 +4623,19 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, kvm->arch.exception_payload_enabled = cap->args[0]; r = 0; break; + case KVM_CAP_X86_GUEST_LBR: + r = -EINVAL; + if (cap->args[0] && + x86_perf_get_lbr_stack(&kvm->arch.lbr_stack)) + break; + + if (copy_to_user((void __user *)cap->args[1], +&kvm->arch.lbr_stack, +sizeof(struct x86_perf_lbr_stack))) + break; + kvm->arch.lbr_in_guest = cap->args[0]; + r = 0; + break; default: r = -EINVAL; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 2fe12b4..5391cbc 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -993,6 +993,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_SVE 170 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172 +#define KVM_CAP_X86_GUEST_LBR 173 #ifdef KVM_CAP_IRQ_ROUTING -- 2.7.4
[PATCH v7 00/12] Guest LBR Enabling
Last Branch Recording (LBR) is a performance monitor unit (PMU) feature on Intel CPUs that captures branch related info. This patch series enables this feature to KVM guests. Here is a conclusion of the fundamental methods that we use: 1) the LBR feature is enabled per guest via QEMU setting of KVM_CAP_X86_GUEST_LBR; 2) the LBR stack is passed through to the guest for direct accesses after the guest's first access to any of the lbr related MSRs; 3) the host will help save/resotre the LBR stack when the vCPU is scheduled out/in. ChangeLog: pmu.c: - patch 4: remove guest_cpuid_is_intel, which can be avoided by directly checking pmu_ops->lbr_enable. pmu_intel.c: - patch 10: include MSR_LBR_SELECT into is_lbr_msr check, and pass this msr through to the guest, instead of trapping accesses. - patch 12: fix the condition check of setting LBRS_FROZEN and COUNTERS_FROZEN; update the global_ovf_ctrl_mask for LBRS_FROZEN and COUNTERS_FROZEN. previous: https://lkml.org/lkml/2019/6/6/133 Like Xu (1): KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack Wei Wang (11): perf/x86: fix the variable type of the LBR MSRs perf/x86: add a function to get the lbr stack KVM/x86: KVM_CAP_X86_GUEST_LBR KVM/x86: intel_pmu_lbr_enable KVM/x86/vPMU: tweak kvm_pmu_get_msr KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest perf/x86: no counter allocation support perf/x86: save/restore LBR_SELECT on vCPU switching KVM/x86/lbr: lazy save the guest lbr stack KVM/x86: remove the common handling of the debugctl msr KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN arch/x86/events/core.c| 12 ++ arch/x86/events/intel/lbr.c | 43 - arch/x86/events/perf_event.h | 6 +- arch/x86/include/asm/kvm_host.h | 5 + arch/x86/include/asm/perf_event.h | 16 ++ arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/pmu.c| 29 ++- arch/x86/kvm/pmu.h| 12 +- arch/x86/kvm/pmu_amd.c| 7 +- arch/x86/kvm/vmx/pmu_intel.c | 393 +- arch/x86/kvm/vmx/vmx.c| 4 +- arch/x86/kvm/vmx/vmx.h| 2 + arch/x86/kvm/x86.c| 32 ++-- include/linux/perf_event.h| 13 ++ include/uapi/linux/kvm.h | 1 + kernel/events/core.c | 37 ++-- 16 files changed, 562 insertions(+), 52 deletions(-) -- 2.7.4