Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-14 Thread Wei Wang
On Tue, Apr 6, 2021 at 6:39 PM Pavan Kondeti  wrote:
>
> On Tue, Apr 06, 2021 at 12:15:24PM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> > > Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> > > background workqueue if we identify that it is cpu intensive. However, 
> > > this
> > > needs case by case analysis, tweaking etc. If there is no other 
> > > alternative,
> > > we might end up chasing the background workers and reduce their nice 
> > > value.
> >
> > There shouldn't be that many workqueues that consume a lot of cpu cycles.
> > The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
> > shouldn't be a long list and we want them identified anyway.
> >
> Sure. I have not done a complete study on which workers in our system can
> compete with important tasks in other cgroups. We will have to do that to
> adjust the workqueue priority so that the impact can be minimized.
>

kswapd0 is actually migratable to subgroup.

But echo what Pavan said, the real world is not ideal and the
problematice drivers are inactive when the use case is not activated
in Android, e.g. connectivity, camera, etc. It is tricky sometimes to
track all of those.

> > > The problem at our hand (which you might be knowing already) is that, 
> > > lets say
> > > we have 2 cgroups in our setup and we want to prioritize UX over 
> > > background.
> > > IOW, reduce the cpu.shares of background cgroup. This helps in 
> > > prioritizing
> > > Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> > > can potentially have CPU share equal to the entire UX cgroup.
> > >
> > > -kworker/0:0
> > > -kworker/1:0
> > > - UX
> > > Task-A
> > > Task-B
> > > - background
> > > Task-X
> > > Task-Y
> > > Hence we want to move all kernel threads to another cgroup so that this 
> > > cgroup
> > > will have CPU share equal to UX.
> > >
> > > The patch presented here allows us to create the above setup. Any other
> > > alternative approaches to achieve the same without impacting any future
> > > designs/requirements?
> >
> > Not quite the same but we already have
> > /sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
> > so maybe a global default priority knob would help here?
> >
>
> yeah, not exactly what we are looking for. It gives us the abiility to 
> restrict
> the priority of all unbound workqueues at the expense of not being able to
> prioritize one workqueue over another workqueue.
>

Same here, Android used to have its cgroup setup like this, where the
BG group can be starved too long potentially (and sometimes PI is not
inevitable of course,  that's the reason why I removed BG cgroup in
Android (https://android-review.googlesource.com/q/topic:remove_bg_cgroup)).

-Task-A
-Task-B
-kworker/0:0
- background
Task-X
Task-Y

So having things left in root will post the same risk,

> Thanks,
> Pavan
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
> Foundation Collaborative Project.
>


[PATCH v5 1/5] dt-bindings: aspeed-lpc: Remove LPC partitioning

2021-01-14 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
This patch fixes the documentation by removing the description on LPC
partitions. The register offsets illustrated in the DTS node examples
are also fixed to adapt to the LPC DTS change.

Signed-off-by: Chia-Wei, Wang 
---
 .../devicetree/bindings/mfd/aspeed-lpc.txt| 100 +-
 1 file changed, 25 insertions(+), 75 deletions(-)

diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt 
b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
index d0a38ba8b9ce..936aa108eab4 100644
--- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
+++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
@@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on 
the bus
 conditions it can also take the role of bus master.
 
 The LPC controller is represented as a multi-function device to account for the
-mix of functionality it provides. The principle split is between the register
-layout at the start of the I/O space which is, to quote the Aspeed datasheet,
-"basically compatible with the [LPC registers from the] popular BMC controller
-H8S/2168[1]", and everything else, where everything else is an eclectic
-collection of functions with a esoteric register layout. "Everything else",
-here labeled the "host" portion of the controller, includes, but is not limited
-to:
+mix of functionality, which includes, but is not limited to:
 
 * An IPMI Block Transfer[2] Controller
 
@@ -44,80 +38,36 @@ Required properties
 ===
 
 - compatible:  One of:
-   "aspeed,ast2400-lpc", "simple-mfd"
-   "aspeed,ast2500-lpc", "simple-mfd"
-   "aspeed,ast2600-lpc", "simple-mfd"
+   "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon"
+   "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon"
+   "aspeed,ast2600-lpc-v2", "simple-mfd", "syscon"
 
 - reg: contains the physical address and length values of the Aspeed
 LPC memory region.
 
 - #address-cells: <1>
 - #size-cells: <1>
-- ranges:  Maps 0 to the physical address and length of the LPC memory
-region
-
-Required LPC Child nodes
-
-
-BMC Node
-
-
-- compatible:  One of:
-   "aspeed,ast2400-lpc-bmc"
-   "aspeed,ast2500-lpc-bmc"
-   "aspeed,ast2600-lpc-bmc"
-
-- reg: contains the physical address and length values of the
-H8S/2168-compatible LPC controller memory region
-
-Host Node
--
-
-- compatible:   One of:
-   "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"
-   "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"
-   "aspeed,ast2600-lpc-host", "simple-mfd", "syscon"
-
-- reg: contains the address and length values of the host-related
-register space for the Aspeed LPC controller
-
-- #address-cells: <1>
-- #size-cells: <1>
-- ranges:  Maps 0 to the address and length of the host-related LPC memory
+- ranges:  Maps 0 to the physical address and length of the LPC memory
 region
 
 Example:
 
 lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2500-lpc", "simple-mfd";
+   compatible = "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
 
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2500-lpc-bmc";
+   lpc_snoop: lpc-snoop@0 {
+   compatible = "aspeed,ast2600-lpc-snoop";
reg = <0x0 0x80>;
-   };
-
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
+   interrupts = ;
+   snoop-ports = <0x80>;
};
 };
 
-BMC Node Children
-==
-
-
-Host Node Children
-==
 
 LPC Host Interface Controller
 ---
@@ -149,14 +99,12 @@ Optional properties:
 
 Example:
 
-lpc-host@80 {
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = "aspeed,ast2500-lpc-ctrl";
-   reg = <0x0 0x80>;
-   clocks = < ASPEED_CLK_GATE_LCLK>;
-   me

[PATCH v5 4/5] pinctrl: aspeed-g5: Adapt to new LPC device tree layout

2021-01-14 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c 
b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
index 0cab4c2576e2..996ebcba4d38 100644
--- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
+++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
@@ -60,7 +60,7 @@
 #define COND2  { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 }
 
 /* LHCR0 is offset from the end of the H8S/2168-compatible registers */
-#define LHCR0  0x20
+#define LHCR0  0xa0
 #define GFX064 0x64
 
 #define B14 0
@@ -2648,14 +2648,19 @@ static struct regmap *aspeed_g5_acquire_regmap(struct 
aspeed_pinmux_data *ctx,
}
 
if (ip == ASPEED_IP_LPC) {
-   struct device_node *node;
+   struct device_node *np;
struct regmap *map;
 
-   node = of_parse_phandle(ctx->dev->of_node,
+   np = of_parse_phandle(ctx->dev->of_node,
"aspeed,external-nodes", 1);
-   if (node) {
-   map = syscon_node_to_regmap(node->parent);
-   of_node_put(node);
+   if (np) {
+   if (!of_device_is_compatible(np->parent, 
"aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np->parent, 
"aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np->parent, 
"aspeed,ast2600-lpc-v2"))
+   return ERR_PTR(-ENODEV);
+
+   map = syscon_node_to_regmap(np->parent);
+   of_node_put(np);
if (IS_ERR(map))
return map;
} else
-- 
2.17.1



[PATCH v5 5/5] soc: aspeed: Adapt to new LPC device tree layout

2021-01-14 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  | 20 ++--
 drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +++
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c 
b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
index 439bcd6b8c4a..c557ffd0992c 100644
--- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c
+++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
@@ -18,15 +18,15 @@
 
 #define DEVICE_NAME"aspeed-lpc-ctrl"
 
-#define HICR5 0x0
+#define HICR5 0x80
 #define HICR5_ENL2HBIT(8)
 #define HICR5_ENFWHBIT(10)
 
-#define HICR6 0x4
+#define HICR6 0x84
 #define SW_FWH2AHB BIT(17)
 
-#define HICR7 0x8
-#define HICR8 0xc
+#define HICR7 0x88
+#define HICR8 0x8c
 
 struct aspeed_lpc_ctrl {
struct miscdevice   miscdev;
@@ -215,6 +215,7 @@ static int aspeed_lpc_ctrl_probe(struct platform_device 
*pdev)
struct device_node *node;
struct resource resm;
struct device *dev;
+   struct device_node *np;
int rc;
 
dev = >dev;
@@ -270,8 +271,15 @@ static int aspeed_lpc_ctrl_probe(struct platform_device 
*pdev)
}
}
 
-   lpc_ctrl->regmap = syscon_node_to_regmap(
-   pdev->dev.parent->of_node);
+   np = pdev->dev.parent->of_node;
+   if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   lpc_ctrl->regmap = syscon_node_to_regmap(np);
if (IS_ERR(lpc_ctrl->regmap)) {
dev_err(dev, "Couldn't get regmap\n");
return -ENODEV;
diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c 
b/drivers/soc/aspeed/aspeed-lpc-snoop.c
index 682ba0eb4eba..ab0f0a54fea6 100644
--- a/drivers/soc/aspeed/aspeed-lpc-snoop.c
+++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c
@@ -28,26 +28,25 @@
 #define NUM_SNOOP_CHANNELS 2
 #define SNOOP_FIFO_SIZE 2048
 
-#define HICR5  0x0
+#define HICR5  0x80
 #define HICR5_EN_SNP0W BIT(0)
 #define HICR5_ENINT_SNP0W  BIT(1)
 #define HICR5_EN_SNP1W BIT(2)
 #define HICR5_ENINT_SNP1W  BIT(3)
-
-#define HICR6  0x4
+#define HICR6  0x84
 #define HICR6_STR_SNP0WBIT(0)
 #define HICR6_STR_SNP1WBIT(1)
-#define SNPWADR0x10
+#define SNPWADR0x90
 #define SNPWADR_CH0_MASK   GENMASK(15, 0)
 #define SNPWADR_CH0_SHIFT  0
 #define SNPWADR_CH1_MASK   GENMASK(31, 16)
 #define SNPWADR_CH1_SHIFT  16
-#define SNPWDR 0x14
+#define SNPWDR 0x94
 #define SNPWDR_CH0_MASKGENMASK(7, 0)
 #define SNPWDR_CH0_SHIFT   0
 #define SNPWDR_CH1_MASKGENMASK(15, 8)
 #define SNPWDR_CH1_SHIFT   8
-#define HICRB  0x80
+#define HICRB  0x100
 #define HICRB_ENSNP0D  BIT(14)
 #define HICRB_ENSNP1D  BIT(15)
 
@@ -258,6 +257,7 @@ static int aspeed_lpc_snoop_probe(struct platform_device 
*pdev)
 {
struct aspeed_lpc_snoop *lpc_snoop;
struct device *dev;
+   struct device_node *np;
u32 port;
int rc;
 
@@ -267,8 +267,15 @@ static int aspeed_lpc_snoop_probe(struct platform_device 
*pdev)
if (!lpc_snoop)
return -ENOMEM;
 
-   lpc_snoop->regmap = syscon_node_to_regmap(
-   pdev->dev.parent->of_node);
+   np = pdev->dev.parent->of_node;
+   if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   lpc_snoop->regmap = syscon_node_to_regmap(np);
if (IS_ERR(lpc_snoop->regmap)) {
dev_err(dev, "Couldn't get regmap\n");
return -ENODEV;
-- 
2.17.1



[PATCH v5 3/5] ipmi: kcs: aspeed: Adapt to new LPC DTS layout

2021-01-14 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
Acked-by: Haiyue Wang 
---
 drivers/char/ipmi/kcs_bmc_aspeed.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c 
b/drivers/char/ipmi/kcs_bmc_aspeed.c
index a140203c079b..eefe362f65f0 100644
--- a/drivers/char/ipmi/kcs_bmc_aspeed.c
+++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
@@ -27,7 +27,6 @@
 
 #define KCS_CHANNEL_MAX 4
 
-/* mapped to lpc-bmc@0 IO space */
 #define LPC_HICR00x000
 #define LPC_HICR0_LPC3E  BIT(7)
 #define LPC_HICR0_LPC2E  BIT(6)
@@ -52,15 +51,13 @@
 #define LPC_STR1 0x03C
 #define LPC_STR2 0x040
 #define LPC_STR3 0x044
-
-/* mapped to lpc-host@80 IO space */
-#define LPC_HICRB0x080
+#define LPC_HICRB0x100
 #define LPC_HICRB_IBFIF4 BIT(1)
 #define LPC_HICRB_LPC4E  BIT(0)
-#define LPC_LADR40x090
-#define LPC_IDR4 0x094
-#define LPC_ODR4 0x098
-#define LPC_STR4 0x09C
+#define LPC_LADR40x110
+#define LPC_IDR4 0x114
+#define LPC_ODR4 0x118
+#define LPC_STR4 0x11C
 
 struct aspeed_kcs_bmc {
struct regmap *map;
@@ -348,12 +345,20 @@ static int aspeed_kcs_probe(struct platform_device *pdev)
struct device_node *np;
int rc;
 
-   np = pdev->dev.of_node;
+   np = dev->of_node->parent;
+   if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   np = dev->of_node;
if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc") ||
-   of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc"))
+   of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc"))
kcs_bmc = aspeed_kcs_probe_of_v1(pdev);
else if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc-v2") ||
-   of_device_is_compatible(np, 
"aspeed,ast2500-kcs-bmc-v2"))
+of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc-v2"))
kcs_bmc = aspeed_kcs_probe_of_v2(pdev);
else
return -EINVAL;
-- 
2.17.1



[PATCH v5 0/5] Remove LPC register partitioning

2021-01-14 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
The incorrect partitioning can impose unnecessary range restrictions
on register access through the syscon regmap interface.

For instance, HICRB contains the I/O port address configuration
of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access
HICRB as it is located at the other LPC partition.

In addition, to be backward compatible, the newly added HW control
bits could be located at any reserved bits over the LPC addressing
space.

Thereby, this patch series aims to remove the LPC partitioning for
better driver development and maintenance. This requires the change
to both the device tree and the driver implementation. To ensure
both sides are synchronously updated, a v2 binding check is added.

Chagnes since v4:
- Add child node example in dt-bindings.

Chagnes since v3:
- Revise binding check as suggested by Haiyue Wang.

Changes since v2:
- Add v2 binding check to ensure the synchronization between the
  device tree change and the driver register offset fix.

Changes since v1:
- Add the fix to the aspeed-lpc binding documentation.

Chia-Wei, Wang (5):
  dt-bindings: aspeed-lpc: Remove LPC partitioning
  ARM: dts: Remove LPC BMC and Host partitions
  ipmi: kcs: aspeed: Adapt to new LPC DTS layout
  pinctrl: aspeed-g5: Adapt to new LPC device tree layout
  soc: aspeed: Adapt to new LPC device tree layout

 .../devicetree/bindings/mfd/aspeed-lpc.txt| 100 -
 arch/arm/boot/dts/aspeed-g4.dtsi  |  74 --
 arch/arm/boot/dts/aspeed-g5.dtsi  | 135 --
 arch/arm/boot/dts/aspeed-g6.dtsi  | 135 --
 drivers/char/ipmi/kcs_bmc_aspeed.c|  27 ++--
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c|  17 ++-
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  |  20 ++-
 drivers/soc/aspeed/aspeed-lpc-snoop.c |  23 +--
 8 files changed, 229 insertions(+), 302 deletions(-)

-- 
2.17.1



[PATCH v5 2/5] ARM: dts: Remove LPC BMC and Host partitions

2021-01-14 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.

A concrete instance is that the HICRB[5:4] are for the I/O port address
configurtaion of KCS channel 1/2. However, the KCS driver cannot access
HICRB for channel 1/2 initialization via syscon regmap interface due to
the parition boundary. (i.e. offset 80h)

In addition, for the HW design backward compatibility, a newly added HW
control bit could be located at any reserved one over the LPC addressing
space. Thereby, this patch removes the lpc-bmc and lpc-host child node
and thus the LPC partitioning.

Note that this change requires the synchronization between device tree
change and the driver change. To prevent the misuse of old devicetrees
with new drivers, or vice versa, the v2 compatible strings are adopted
for the LPC device as listed:

"aspeed,ast2400-lpc-v2"
"aspeed,ast2500-lpc-v2"
"aspeed,ast2600-lpc-v2"

Signed-off-by: Chia-Wei, Wang 
---
 arch/arm/boot/dts/aspeed-g4.dtsi |  74 +++--
 arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++-
 arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++-
 3 files changed, 148 insertions(+), 196 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi
index b3dafbc8caca..ee22bc036440 100644
--- a/arch/arm/boot/dts/aspeed-g4.dtsi
+++ b/arch/arm/boot/dts/aspeed-g4.dtsi
@@ -343,58 +343,44 @@
};
 
lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2400-lpc", "simple-mfd";
+   compatible = "aspeed,ast2400-lpc-v2", 
"simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
+   reg-io-width = <4>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
 
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2400-lpc-bmc";
-   reg = <0x0 0x80>;
+   lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2400-lpc-ctrl";
+   reg = <0x80 0x10>;
+   clocks = < ASPEED_CLK_GATE_LCLK>;
+   status = "disabled";
};
 
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2400-lpc-host", 
"simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = 
"aspeed,ast2400-lpc-ctrl";
-   reg = <0x0 0x10>;
-   clocks = < 
ASPEED_CLK_GATE_LCLK>;
-   status = "disabled";
-   };
-
-   lpc_snoop: lpc-snoop@10 {
-   compatible = 
"aspeed,ast2400-lpc-snoop";
-   reg = <0x10 0x8>;
-   interrupts = <8>;
-   status = "disabled";
-   };
-
-   lhc: lhc@20 {
-   compatible = 
"aspeed,ast2400-lhc";
-   reg = <0x20 0x24 0x48 0x8>;
-   };
-
-   lpc_reset: reset-controller@18 {
-   compatible = 
"aspeed,ast2400-lpc-reset";
-   reg = <0x18 0x4>;
-   #reset-cells = <1>;
-   };
-
-   ibt: ibt@c0  {
-   compatible = 
"aspeed,ast2400-ibt-bmc";
-   reg = <0xc0 0x18>;
-   

[PATCH 3/6] clk: ast2600: Add eSPI reset bit

2021-01-05 Thread Chia-Wei, Wang
Add bit field definition for the eSPI reset control.

Signed-off-by: Chia-Wei, Wang 
---
 include/dt-bindings/clock/ast2600-clock.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/dt-bindings/clock/ast2600-clock.h 
b/include/dt-bindings/clock/ast2600-clock.h
index 62b9520a00fd..964934b1caef 100644
--- a/include/dt-bindings/clock/ast2600-clock.h
+++ b/include/dt-bindings/clock/ast2600-clock.h
@@ -89,6 +89,7 @@
 #define ASPEED_CLK_MAC4RCLK70
 
 /* Only list resets here that are not part of a gate */
+#define ASPEED_RESET_ESPI  57
 #define ASPEED_RESET_ADC   55
 #define ASPEED_RESET_JTAG_MASTER2  54
 #define ASPEED_RESET_I3C_DMA   39
-- 
2.17.1



[PATCH 0/6] arm: aspeed: Add eSPI support

2021-01-05 Thread Chia-Wei, Wang
This patch series add the driver support for the eSPI controller
of Aspeed 6th generation SoCs. This controller is a slave device
communicating with a master over Enhanced Serial Peripheral Interface (eSPI).
It supports all of the 4 eSPI channels, namely peripheral, virtual wire,
out-of-band, and flash, and operates at max frequency of 66MHz.

Chia-Wei, Wang (6):
  dt-bindings: aspeed: Add eSPI controller
  MAINTAINER: Add ASPEED eSPI driver entry
  clk: ast2600: Add eSPI reset bit
  irqchip/aspeed: Add Aspeed eSPI interrupt controller
  soc: aspeed: Add eSPI driver
  ARM: dts: aspeed: Add AST2600 eSPI nodes

 .../devicetree/bindings/soc/aspeed/espi.yaml  | 252 +++
 MAINTAINERS   |  14 +
 arch/arm/boot/dts/aspeed-g6.dtsi  |  57 ++
 drivers/irqchip/Makefile  |   2 +-
 drivers/irqchip/irq-aspeed-espi-ic.c  | 251 +++
 drivers/soc/aspeed/Kconfig|  49 ++
 drivers/soc/aspeed/Makefile   |   5 +
 drivers/soc/aspeed/aspeed-espi-ctrl.c | 197 +
 drivers/soc/aspeed/aspeed-espi-flash.c| 490 
 drivers/soc/aspeed/aspeed-espi-oob.c  | 706 ++
 drivers/soc/aspeed/aspeed-espi-peripheral.c   | 613 +++
 drivers/soc/aspeed/aspeed-espi-vw.c   | 211 ++
 include/dt-bindings/clock/ast2600-clock.h |   1 +
 .../interrupt-controller/aspeed-espi-ic.h |  15 +
 include/soc/aspeed/espi.h | 279 +++
 include/uapi/linux/aspeed-espi.h  | 160 
 16 files changed, 3301 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/devicetree/bindings/soc/aspeed/espi.yaml
 create mode 100644 drivers/irqchip/irq-aspeed-espi-ic.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-ctrl.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-flash.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-oob.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-peripheral.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-vw.c
 create mode 100644 include/dt-bindings/interrupt-controller/aspeed-espi-ic.h
 create mode 100644 include/soc/aspeed/espi.h
 create mode 100644 include/uapi/linux/aspeed-espi.h

-- 
2.17.1



[PATCH 6/6] ARM: dts: aspeed: Add AST2600 eSPI nodes

2021-01-05 Thread Chia-Wei, Wang
Add eSPI nodes for the device tree of Aspeed 6th generation SoCs.

Signed-off-by: Chia-Wei, Wang 
---
 arch/arm/boot/dts/aspeed-g6.dtsi | 57 
 1 file changed, 57 insertions(+)

diff --git a/arch/arm/boot/dts/aspeed-g6.dtsi b/arch/arm/boot/dts/aspeed-g6.dtsi
index 810b0676ab03..d457baf11e37 100644
--- a/arch/arm/boot/dts/aspeed-g6.dtsi
+++ b/arch/arm/boot/dts/aspeed-g6.dtsi
@@ -3,7 +3,9 @@
 
 #include 
 #include 
+#include 
 #include 
+#include 
 
 / {
model = "Aspeed BMC";
@@ -75,6 +77,61 @@
interrupts = ;
};
 
+   espi: espi@1e6ee000 {
+   compatible = "aspeed,ast2600-espi", "simple-mfd", "syscon";
+   reg = <0x1e6ee000 0x1000>;
+
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0x0 0x1e6ee000 0x1000>;
+
+   espi_ic: espi-ic {
+   #interrupt-cells = <1>;
+   compatible = "aspeed,ast2600-espi-ic";
+   interrupts-extended = < GIC_SPI 42 
IRQ_TYPE_LEVEL_HIGH>,
+ < ASPEED_GPIO(W, 7) 
IRQ_TYPE_EDGE_FALLING>;
+   interrupt-controller;
+   status = "disabled";
+   };
+
+   espi_ctrl: espi-ctrl {
+   compatible = "aspeed,ast2600-espi-ctrl";
+   interrupts-extended = <_ic 
ASPEED_ESPI_IC_CTRL_EVENT>,
+ <_ic 
ASPEED_ESPI_IC_CTRL_RESET>;
+   clocks = < ASPEED_CLK_GATE_ESPICLK>;
+   resets = < ASPEED_RESET_ESPI>;
+   status = "disabled";
+   };
+
+   espi_peripheral: espi-peripheral-channel {
+   compatible = "aspeed,ast2600-espi-peripheral";
+   interrupts-extended = <_ic 
ASPEED_ESPI_IC_PERIF_EVENT>,
+ <_ic 
ASPEED_ESPI_IC_CHAN_RESET>;
+   status = "disabled";
+   };
+
+   espi_vw: espi-vw-channel {
+   compatible = "aspeed,ast2600-espi-vw";
+   interrupts-extended = <_ic 
ASPEED_ESPI_IC_VW_EVENT>,
+ <_ic 
ASPEED_ESPI_IC_CHAN_RESET>;
+   status = "disabled";
+   };
+
+   espi_oob: espi-oob-channel {
+   compatible = "aspeed,ast2600-espi-oob";
+   interrupts-extended = <_ic 
ASPEED_ESPI_IC_OOB_EVENT>,
+ <_ic 
ASPEED_ESPI_IC_CHAN_RESET>;
+   status = "disabled";
+   };
+
+   espi_flash: espi-flash-channel {
+   compatible = "aspeed,ast2600-espi-flash";
+   interrupts-extended = <_ic 
ASPEED_ESPI_IC_FLASH_EVENT>,
+ <_ic 
ASPEED_ESPI_IC_CHAN_RESET>;
+   status = "disabled";
+   };
+   };
+
ahb {
compatible = "simple-bus";
#address-cells = <1>;
-- 
2.17.1



[PATCH 5/6] soc: aspeed: Add eSPI driver

2021-01-05 Thread Chia-Wei, Wang
The Aspeed eSPI controller is slave device to communicate with
the master through the Enhanced Serial Peripheral Interface (eSPI).
All of the four eSPI channels, namely peripheral, virtual wire,
out-of-band, and flash are supported.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/soc/aspeed/Kconfig  |  49 ++
 drivers/soc/aspeed/Makefile |   5 +
 drivers/soc/aspeed/aspeed-espi-ctrl.c   | 197 ++
 drivers/soc/aspeed/aspeed-espi-flash.c  | 490 ++
 drivers/soc/aspeed/aspeed-espi-oob.c| 706 
 drivers/soc/aspeed/aspeed-espi-peripheral.c | 613 +
 drivers/soc/aspeed/aspeed-espi-vw.c | 211 ++
 include/uapi/linux/aspeed-espi.h| 160 +
 8 files changed, 2431 insertions(+)
 create mode 100644 drivers/soc/aspeed/aspeed-espi-ctrl.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-flash.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-oob.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-peripheral.c
 create mode 100644 drivers/soc/aspeed/aspeed-espi-vw.c
 create mode 100644 include/uapi/linux/aspeed-espi.h

diff --git a/drivers/soc/aspeed/Kconfig b/drivers/soc/aspeed/Kconfig
index 243ca196e6ad..e4408e97023d 100644
--- a/drivers/soc/aspeed/Kconfig
+++ b/drivers/soc/aspeed/Kconfig
@@ -42,6 +42,55 @@ config ASPEED_SOCINFO
help
  Say yes to support decoding of ASPEED BMC information.
 
+config ASPEED_ESPI
+   tristate "Aspeed eSPI Engine Driver"
+   select REGMAP
+   select MFD_SYSON
+   select ASPEED_ESPI_PERIPHERAL
+   select ASPEED_ESPI_VW
+   select ASPEED_ESPI_OOB
+   select ASPEED_ESPI_FLASH
+   default n
+   help
+ Enable support for the Aspeed eSPI engine. The eSPI engine
+ plays as a slave device in BMC to communicate with the host
+ side master over the eSPI bus interface.
+
+ eSPI and LPC are mutually execulisve features on Aspeed SoC.
+ If not sure, say N.
+
+config ASPEED_ESPI_PERIPHERAL
+   tristate "Aspeed eSPI peripheral channel driver"
+   depends on ASPEED_ESPI
+   help
+ Control Aspeed eSPI peripheral channel driver. The driver
+ also provides an eSPI packet put/get interface to communicate
+ with the eSPI host.
+
+config ASPEED_ESPI_VW
+   tristate "Aspeed eSPI virtual wire channel driver"
+   depends on ASPEED_ESPI
+   help
+ Control Aspeed eSPI virtual wire channel driver. The driver
+ also provides an eSPI packet put/get interface to communicate
+ with the eSPI host.
+
+config ASPEED_ESPI_OOB
+   tristate "Aspeed eSPI out-of-band channel driver"
+   depends on ASPEED_ESPI
+   help
+ Control Aspeed eSPI out-of-band channel driver. The driver
+ also provides an eSPI packet put/get interface to communicat
+ with the eSPI host.
+
+config ASPEED_ESPI_FLASH
+   tristate "Aspeed eSPI flash channel driver"
+   depends on ASPEED_ESPI
+   help
+ Control Aspeed eSPI flash channel driver. The driver
+ also provides an eSPI packet put/get interface to communicat
+ with the eSPI host.
+
 endmenu
 
 endif
diff --git a/drivers/soc/aspeed/Makefile b/drivers/soc/aspeed/Makefile
index fcab7192e1a4..ac41ce82bb78 100644
--- a/drivers/soc/aspeed/Makefile
+++ b/drivers/soc/aspeed/Makefile
@@ -3,3 +3,8 @@ obj-$(CONFIG_ASPEED_LPC_CTRL)   += aspeed-lpc-ctrl.o
 obj-$(CONFIG_ASPEED_LPC_SNOOP) += aspeed-lpc-snoop.o
 obj-$(CONFIG_ASPEED_P2A_CTRL)  += aspeed-p2a-ctrl.o
 obj-$(CONFIG_ASPEED_SOCINFO)   += aspeed-socinfo.o
+obj-$(CONFIG_ASPEED_ESPI)  += aspeed-espi-ctrl.o \
+  aspeed-espi-peripheral.o \
+  aspeed-espi-vw.o \
+  aspeed-espi-oob.o \
+  aspeed-espi-flash.o
diff --git a/drivers/soc/aspeed/aspeed-espi-ctrl.c 
b/drivers/soc/aspeed/aspeed-espi-ctrl.c
new file mode 100644
index ..e4329f5f8ed3
--- /dev/null
+++ b/drivers/soc/aspeed/aspeed-espi-ctrl.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2020 Aspeed Technology Inc.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define DEVICE_NAME "aspeed-espi-ctrl"
+
+struct aspeed_espi_ctrl {
+   struct regmap *map;
+   struct clk *clk;
+   struct reset_control *rst;
+
+   int irq;
+   int irq_reset;
+};
+
+static irqreturn_t aspeed_espi_ctrl_isr(int irq, void *arg)
+{
+   uint32_t sts;
+   struct aspeed_espi_ctrl *espi_ctrl = (struct aspeed_espi_ctrl *)arg;
+
+   regmap_read(espi_ctrl->map, ESPI_INT_STS, );
+
+   if (!(sts & ESPI_INT_STS_HW_RST_DEASSERT))
+   return IRQ_NONE;
+
+   regmap_update_bits(espi_ctrl->map, ESPI_SYSEVT,
+

[PATCH 1/6] dt-bindings: aspeed: Add eSPI controller

2021-01-05 Thread Chia-Wei, Wang
Add dt-bindings and the inclusion header for Aspeed eSPI controller.

Signed-off-by: Chia-Wei, Wang 
---
 .../devicetree/bindings/soc/aspeed/espi.yaml  | 252 ++
 .../interrupt-controller/aspeed-espi-ic.h |  15 ++
 2 files changed, 267 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/soc/aspeed/espi.yaml
 create mode 100644 include/dt-bindings/interrupt-controller/aspeed-espi-ic.h

diff --git a/Documentation/devicetree/bindings/soc/aspeed/espi.yaml 
b/Documentation/devicetree/bindings/soc/aspeed/espi.yaml
new file mode 100644
index ..ee92b66fe15b
--- /dev/null
+++ b/Documentation/devicetree/bindings/soc/aspeed/espi.yaml
@@ -0,0 +1,252 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+# # Copyright (c) 2020 Aspeed Technology Inc.
+%YAML 1.2
+---
+$id: "http://devicetree.org/schemas/soc/aspeed/espi.yaml#;
+$schema: "http://devicetree.org/meta-schemas/core.yaml#;
+
+title: Aspeed eSPI Controller
+
+maintainers:
+  - Chia-Wei Wang 
+  - Ryan Chen 
+
+description:
+  Aspeed eSPI controller implements a slave side eSPI endpoint device
+  supporting the four eSPI channels, namely peripheral, virtual wire,
+  out-of-band, and flash.
+
+properties:
+  compatible:
+items:
+  - enum:
+  - aspeed,ast2600-espi
+  - const: simple-mfd
+  - const: syscon
+
+  reg:
+maxItems: 1
+
+  "#address-cells":
+const: 1
+
+  "#size-cells":
+const: 1
+
+  ranges: true
+
+  espi-ic:
+type: object
+
+properties:
+  compatible:
+const: aspeed,ast2600-espi-ic
+
+  interrupts:
+maxItems: 2
+
+  interrupt-controller: true
+
+  "#interrupt-cells":
+const: 1
+description:
+  The cell is a SW-encoded number for IRQ dispatching to eSPI channels
+
+required:
+  - compatible
+  - interrupts
+  - interrupt-controller
+  - "#interrupt-cells"
+
+  espi-ctrl:
+type: object
+
+properties:
+  compatible:
+const: aspeed,ast2600-espi-ctrl
+
+  interrupts:
+maxItems: 2
+
+  clocks:
+maxItems: 1
+
+  resets:
+maxItems: 1
+
+required:
+  - compatible
+  - interrupts
+  - clocks
+  - resets
+
+  espi-peripheral-channel:
+type: object
+
+properties:
+  compatible:
+const: aspeed,ast2600-espi-peripheral
+
+  interrupts:
+maxItems: 2
+
+  dma-mode:
+type: boolean
+
+  memcyc,map-src-addr:
+$ref: "/schemas/types.yaml#/definitions/uint32"
+description: The host side address to be decoded into the memory cycle 
over eSPI peripheral channel
+
+  memcyc,map-size:
+$ref: "/schemas/types.yaml#/definitions/uint32"
+description: The size of the memory region allocated for the memory 
cycle over eSPI peripheral channel
+minimum: 65536
+
+required:
+  - compatible
+  - interrupts
+  - memcyc,map-src-addr
+  - memcyc,map-size
+
+  espi-vw-channel:
+type: object
+
+properties:
+  compatible:
+const: aspeed,ast2600-espi-vw
+
+  interrupts:
+maxItems: 2
+
+required:
+  - compatible
+  - interrupts
+
+  espi-oob-channel:
+type: object
+
+properties:
+  compatible:
+const: aspeed,ast2600-espi-oob
+
+  interrupts:
+maxItems: 2
+
+  dma-mode:
+type: boolean
+
+  dma-tx-desc-num:
+$ref: "/schemas/types.yaml#/definitions/uint32"
+minimum: 2
+maximum: 1023
+
+  dma-rx-desc-num:
+$ref: "/schemas/types.yaml#/definitions/uint32"
+minimum: 2
+maximum: 1023
+
+required:
+  - compatible
+  - interrupts
+
+  espi-flash-channel:
+type: object
+
+properties:
+  compatible:
+const: aspeed,ast2600-espi-flash
+
+  interrupts:
+maxItems: 2
+
+  dma-mode:
+type: boolean
+
+  safs-mode:
+description: Slave-Attached-Sharing-Flash mode, 0->Mix, 1->SW, 2->HW
+enum: [ 0, 1, 2 ]
+
+required:
+  - compatible
+  - interrupts
+
+required:
+  - compatible
+  - reg
+  - "#address-cells"
+  - "#size-cells"
+  - ranges
+  - espi-ic
+  - espi-ctrl
+  - espi-peripheral-channel
+  - espi-vw-channel
+  - espi-oob-channel
+  - espi-flash-channel
+
+additionalProperties: false
+
+examples:
+  - |
+#include 
+#include 
+#include 
+#include 
+
+espi: espi@1e6ee000 {
+compatible = "aspeed,ast2600-espi", "simple-mfd", "syscon";
+reg = <0x1e6ee000 0x1000>;
+
+#address-cells = <1>;
+#size-cells = <1>;
+ranges = <0x0 0x1e6ee000 0x1000>;
+
+espi_ic: espi-ic {
+#interrupt-cells = <1>;
+compatible = "aspeed,ast2600-espi-ic";
+interrupts-exte

[PATCH 4/6] irqchip/aspeed: Add Aspeed eSPI interrupt controller

2021-01-05 Thread Chia-Wei, Wang
The eSPI interrupt controller acts as a SW IRQ number
decoder to correctly control/dispatch interrupts of
the eSPI peripheral, virtual wire, out-of-band, and
flash channels.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/irqchip/Makefile |   2 +-
 drivers/irqchip/irq-aspeed-espi-ic.c | 251 
 include/soc/aspeed/espi.h| 279 +++
 3 files changed, 531 insertions(+), 1 deletion(-)
 create mode 100644 drivers/irqchip/irq-aspeed-espi-ic.c
 create mode 100644 include/soc/aspeed/espi.h

diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index 0ac93bfaec61..56da4a3123f8 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -86,7 +86,7 @@ obj-$(CONFIG_MVEBU_PIC)   += 
irq-mvebu-pic.o
 obj-$(CONFIG_MVEBU_SEI)+= irq-mvebu-sei.o
 obj-$(CONFIG_LS_EXTIRQ)+= irq-ls-extirq.o
 obj-$(CONFIG_LS_SCFG_MSI)  += irq-ls-scfg-msi.o
-obj-$(CONFIG_ARCH_ASPEED)  += irq-aspeed-vic.o irq-aspeed-i2c-ic.o 
irq-aspeed-scu-ic.o
+obj-$(CONFIG_ARCH_ASPEED)  += irq-aspeed-vic.o irq-aspeed-i2c-ic.o 
irq-aspeed-scu-ic.o irq-aspeed-espi-ic.o
 obj-$(CONFIG_STM32_EXTI)   += irq-stm32-exti.o
 obj-$(CONFIG_QCOM_IRQ_COMBINER)+= qcom-irq-combiner.o
 obj-$(CONFIG_IRQ_UNIPHIER_AIDET)   += irq-uniphier-aidet.o
diff --git a/drivers/irqchip/irq-aspeed-espi-ic.c 
b/drivers/irqchip/irq-aspeed-espi-ic.c
new file mode 100644
index ..8a5cc8fe3f0c
--- /dev/null
+++ b/drivers/irqchip/irq-aspeed-espi-ic.c
@@ -0,0 +1,251 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2020 Aspeed Technology Inc.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define DEVICE_NAME"aspeed-espi-ic"
+#define IRQCHIP_NAME   "eSPI-IC"
+
+#define ESPI_IC_IRQ_NUM7
+
+struct aspeed_espi_ic {
+   struct regmap *map;
+   int irq;
+   int gpio_irq;
+   struct irq_domain *irq_domain;
+};
+
+static void aspeed_espi_ic_gpio_isr(struct irq_desc *desc)
+{
+   unsigned int irq;
+   struct aspeed_espi_ic *espi_ic = irq_desc_get_handler_data(desc);
+   struct irq_chip *chip = irq_desc_get_chip(desc);
+
+   chained_irq_enter(chip, desc);
+
+   irq = irq_find_mapping(espi_ic->irq_domain,
+  ASPEED_ESPI_IC_CTRL_RESET);
+   generic_handle_irq(irq);
+
+   irq = irq_find_mapping(espi_ic->irq_domain,
+  ASPEED_ESPI_IC_CHAN_RESET);
+   generic_handle_irq(irq);
+
+   chained_irq_exit(chip, desc);
+}
+
+static void aspeed_espi_ic_isr(struct irq_desc *desc)
+{
+   unsigned int sts;
+   unsigned int irq;
+   struct aspeed_espi_ic *espi_ic = irq_desc_get_handler_data(desc);
+   struct irq_chip *chip = irq_desc_get_chip(desc);
+
+   chained_irq_enter(chip, desc);
+
+   regmap_read(espi_ic->map, ESPI_INT_STS, );
+
+   if (sts & ESPI_INT_STS_PERIF_BITS) {
+   irq = irq_find_mapping(espi_ic->irq_domain,
+  ASPEED_ESPI_IC_PERIF_EVENT);
+   generic_handle_irq(irq);
+   }
+
+   if (sts & ESPI_INT_STS_VW_BITS) {
+   irq = irq_find_mapping(espi_ic->irq_domain,
+  ASPEED_ESPI_IC_VW_EVENT);
+   generic_handle_irq(irq);
+   }
+
+   if (sts & ESPI_INT_STS_OOB_BITS) {
+   irq = irq_find_mapping(espi_ic->irq_domain,
+  ASPEED_ESPI_IC_OOB_EVENT);
+   generic_handle_irq(irq);
+   }
+
+   if (sts & ESPI_INT_STS_FLASH_BITS) {
+   irq = irq_find_mapping(espi_ic->irq_domain,
+  ASPEED_ESPI_IC_FLASH_EVENT);
+   generic_handle_irq(irq);
+   }
+
+   if (sts & ESPI_INT_STS_HW_RST_DEASSERT) {
+   irq = irq_find_mapping(espi_ic->irq_domain,
+  ASPEED_ESPI_IC_CTRL_EVENT);
+   generic_handle_irq(irq);
+   }
+
+   chained_irq_exit(chip, desc);
+}
+
+static void aspeed_espi_ic_irq_disable(struct irq_data *data)
+{
+   struct aspeed_espi_ic *espi_ic = irq_data_get_irq_chip_data(data);
+
+   switch (data->hwirq) {
+   case ASPEED_ESPI_IC_CTRL_EVENT:
+   regmap_update_bits(espi_ic->map, ESPI_INT_EN,
+  ESPI_INT_EN_HW_RST_DEASSERT,
+  0);
+   break;
+   case ASPEED_ESPI_IC_PERIF_EVENT:
+   regmap_update_bits(espi_ic->map, ESPI_INT_EN,
+  ESPI_INT_EN_PERIF_BITS, 0);
+   break;
+   case ASPEED_ESPI_IC_VW_EVENT:
+   regmap_update_bits(espi_ic->

[PATCH 2/6] MAINTAINER: Add ASPEED eSPI driver entry

2021-01-05 Thread Chia-Wei, Wang
Add myself and Ryan Chen as maintainer of the Aspeed eSPI
driver and the associated eSPI interrupt controller.
Joel Stanley is also added as the reviewer.

Signed-off-by: Chia-Wei, Wang 
---
 MAINTAINERS | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 7c1e45c416b1..d5f9205a5439 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1639,6 +1639,20 @@ F:   drivers/crypto/axis
 F: drivers/mmc/host/usdhi6rol0.c
 F: drivers/pinctrl/pinctrl-artpec*
 
+ARM/ASPEED ESPI DRIVER
+M: Chia-Wei Wang 
+M: Ryan Chen 
+R: Joel Stanley 
+L: linux-asp...@lists.ozlabs.org (moderated for non-subscribers)
+L: open...@lists.ozlabs.org (moderated for non-subscribers)
+S: Maintained
+F: Documentation/devicetree/bindings/soc/aspeed/espi.yaml
+F: drivers/irqchip/irq-aspeed-espi-ic.c
+F: drivers/soc/aspeed/aspeed-espi*
+F: include/dt-bindings/interrupt-controller/aspeed-espi-ic.h
+F: include/soc/aspeed/espi.h
+F: include/uapi/linux/aspeed-espi.h
+
 ARM/ASPEED I2C DRIVER
 M: Brendan Higgins 
 R: Benjamin Herrenschmidt 
-- 
2.17.1



[PATCH v4 4/5] pinctrl: aspeed-g5: Adapt to new LPC device tree layout

2020-12-28 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c 
b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
index 0cab4c2576e2..996ebcba4d38 100644
--- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
+++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
@@ -60,7 +60,7 @@
 #define COND2  { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 }
 
 /* LHCR0 is offset from the end of the H8S/2168-compatible registers */
-#define LHCR0  0x20
+#define LHCR0  0xa0
 #define GFX064 0x64
 
 #define B14 0
@@ -2648,14 +2648,19 @@ static struct regmap *aspeed_g5_acquire_regmap(struct 
aspeed_pinmux_data *ctx,
}
 
if (ip == ASPEED_IP_LPC) {
-   struct device_node *node;
+   struct device_node *np;
struct regmap *map;
 
-   node = of_parse_phandle(ctx->dev->of_node,
+   np = of_parse_phandle(ctx->dev->of_node,
"aspeed,external-nodes", 1);
-   if (node) {
-   map = syscon_node_to_regmap(node->parent);
-   of_node_put(node);
+   if (np) {
+   if (!of_device_is_compatible(np->parent, 
"aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np->parent, 
"aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np->parent, 
"aspeed,ast2600-lpc-v2"))
+   return ERR_PTR(-ENODEV);
+
+   map = syscon_node_to_regmap(np->parent);
+   of_node_put(np);
if (IS_ERR(map))
return map;
} else
-- 
2.17.1



[PATCH v4 1/5] dt-bindings: aspeed-lpc: Remove LPC partitioning

2020-12-28 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
This patch fixes the documentation by removing the description on LPC
partitions. The register offsets illustrated in the DTS node examples
are also fixed to adapt to the LPC DTS change.

Signed-off-by: Chia-Wei, Wang 
---
 .../devicetree/bindings/mfd/aspeed-lpc.txt| 99 ---
 1 file changed, 21 insertions(+), 78 deletions(-)

diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt 
b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
index d0a38ba8b9ce..90eb0ecc95d1 100644
--- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
+++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
@@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on 
the bus
 conditions it can also take the role of bus master.
 
 The LPC controller is represented as a multi-function device to account for the
-mix of functionality it provides. The principle split is between the register
-layout at the start of the I/O space which is, to quote the Aspeed datasheet,
-"basically compatible with the [LPC registers from the] popular BMC controller
-H8S/2168[1]", and everything else, where everything else is an eclectic
-collection of functions with a esoteric register layout. "Everything else",
-here labeled the "host" portion of the controller, includes, but is not limited
-to:
+mix of functionality, which includes, but is not limited to:
 
 * An IPMI Block Transfer[2] Controller
 
@@ -44,80 +38,29 @@ Required properties
 ===
 
 - compatible:  One of:
-   "aspeed,ast2400-lpc", "simple-mfd"
-   "aspeed,ast2500-lpc", "simple-mfd"
-   "aspeed,ast2600-lpc", "simple-mfd"
+   "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon"
+   "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon"
+   "aspeed,ast2600-lpc-v2", "simple-mfd", "syscon"
 
 - reg: contains the physical address and length values of the Aspeed
 LPC memory region.
 
 - #address-cells: <1>
 - #size-cells: <1>
-- ranges:  Maps 0 to the physical address and length of the LPC memory
-region
-
-Required LPC Child nodes
-
-
-BMC Node
-
-
-- compatible:  One of:
-   "aspeed,ast2400-lpc-bmc"
-   "aspeed,ast2500-lpc-bmc"
-   "aspeed,ast2600-lpc-bmc"
-
-- reg: contains the physical address and length values of the
-H8S/2168-compatible LPC controller memory region
-
-Host Node
--
-
-- compatible:   One of:
-   "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"
-   "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"
-   "aspeed,ast2600-lpc-host", "simple-mfd", "syscon"
-
-- reg: contains the address and length values of the host-related
-register space for the Aspeed LPC controller
-
-- #address-cells: <1>
-- #size-cells: <1>
-- ranges:  Maps 0 to the address and length of the host-related LPC memory
+- ranges:  Maps 0 to the physical address and length of the LPC memory
 region
 
 Example:
 
 lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2500-lpc", "simple-mfd";
+   compatible = "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
-
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2500-lpc-bmc";
-   reg = <0x0 0x80>;
-   };
-
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-   };
 };
 
-BMC Node Children
-==
-
-
-Host Node Children
-==
 
 LPC Host Interface Controller
 ---
@@ -149,14 +92,12 @@ Optional properties:
 
 Example:
 
-lpc-host@80 {
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = "aspeed,ast2500-lpc-ctrl";
-   reg = <0x0 0x80>;
-   clocks = < ASPEED_CLK_GATE_LCLK>;
-   memory-region = <_memory>;
-   flash = <>;
-   };
+lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2500-lpc-ctrl";
+   reg = <0x8

[PATCH v4 5/5] soc: aspeed: Adapt to new LPC device tree layout

2020-12-28 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  | 20 ++--
 drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +++
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c 
b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
index 439bcd6b8c4a..c557ffd0992c 100644
--- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c
+++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
@@ -18,15 +18,15 @@
 
 #define DEVICE_NAME"aspeed-lpc-ctrl"
 
-#define HICR5 0x0
+#define HICR5 0x80
 #define HICR5_ENL2HBIT(8)
 #define HICR5_ENFWHBIT(10)
 
-#define HICR6 0x4
+#define HICR6 0x84
 #define SW_FWH2AHB BIT(17)
 
-#define HICR7 0x8
-#define HICR8 0xc
+#define HICR7 0x88
+#define HICR8 0x8c
 
 struct aspeed_lpc_ctrl {
struct miscdevice   miscdev;
@@ -215,6 +215,7 @@ static int aspeed_lpc_ctrl_probe(struct platform_device 
*pdev)
struct device_node *node;
struct resource resm;
struct device *dev;
+   struct device_node *np;
int rc;
 
dev = >dev;
@@ -270,8 +271,15 @@ static int aspeed_lpc_ctrl_probe(struct platform_device 
*pdev)
}
}
 
-   lpc_ctrl->regmap = syscon_node_to_regmap(
-   pdev->dev.parent->of_node);
+   np = pdev->dev.parent->of_node;
+   if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   lpc_ctrl->regmap = syscon_node_to_regmap(np);
if (IS_ERR(lpc_ctrl->regmap)) {
dev_err(dev, "Couldn't get regmap\n");
return -ENODEV;
diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c 
b/drivers/soc/aspeed/aspeed-lpc-snoop.c
index 682ba0eb4eba..ab0f0a54fea6 100644
--- a/drivers/soc/aspeed/aspeed-lpc-snoop.c
+++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c
@@ -28,26 +28,25 @@
 #define NUM_SNOOP_CHANNELS 2
 #define SNOOP_FIFO_SIZE 2048
 
-#define HICR5  0x0
+#define HICR5  0x80
 #define HICR5_EN_SNP0W BIT(0)
 #define HICR5_ENINT_SNP0W  BIT(1)
 #define HICR5_EN_SNP1W BIT(2)
 #define HICR5_ENINT_SNP1W  BIT(3)
-
-#define HICR6  0x4
+#define HICR6  0x84
 #define HICR6_STR_SNP0WBIT(0)
 #define HICR6_STR_SNP1WBIT(1)
-#define SNPWADR0x10
+#define SNPWADR0x90
 #define SNPWADR_CH0_MASK   GENMASK(15, 0)
 #define SNPWADR_CH0_SHIFT  0
 #define SNPWADR_CH1_MASK   GENMASK(31, 16)
 #define SNPWADR_CH1_SHIFT  16
-#define SNPWDR 0x14
+#define SNPWDR 0x94
 #define SNPWDR_CH0_MASKGENMASK(7, 0)
 #define SNPWDR_CH0_SHIFT   0
 #define SNPWDR_CH1_MASKGENMASK(15, 8)
 #define SNPWDR_CH1_SHIFT   8
-#define HICRB  0x80
+#define HICRB  0x100
 #define HICRB_ENSNP0D  BIT(14)
 #define HICRB_ENSNP1D  BIT(15)
 
@@ -258,6 +257,7 @@ static int aspeed_lpc_snoop_probe(struct platform_device 
*pdev)
 {
struct aspeed_lpc_snoop *lpc_snoop;
struct device *dev;
+   struct device_node *np;
u32 port;
int rc;
 
@@ -267,8 +267,15 @@ static int aspeed_lpc_snoop_probe(struct platform_device 
*pdev)
if (!lpc_snoop)
return -ENOMEM;
 
-   lpc_snoop->regmap = syscon_node_to_regmap(
-   pdev->dev.parent->of_node);
+   np = pdev->dev.parent->of_node;
+   if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   lpc_snoop->regmap = syscon_node_to_regmap(np);
if (IS_ERR(lpc_snoop->regmap)) {
dev_err(dev, "Couldn't get regmap\n");
return -ENODEV;
-- 
2.17.1



[PATCH v4 3/5] ipmi: kcs: aspeed: Adapt to new LPC DTS layout

2020-12-28 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/char/ipmi/kcs_bmc_aspeed.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c 
b/drivers/char/ipmi/kcs_bmc_aspeed.c
index a140203c079b..eefe362f65f0 100644
--- a/drivers/char/ipmi/kcs_bmc_aspeed.c
+++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
@@ -27,7 +27,6 @@
 
 #define KCS_CHANNEL_MAX 4
 
-/* mapped to lpc-bmc@0 IO space */
 #define LPC_HICR00x000
 #define LPC_HICR0_LPC3E  BIT(7)
 #define LPC_HICR0_LPC2E  BIT(6)
@@ -52,15 +51,13 @@
 #define LPC_STR1 0x03C
 #define LPC_STR2 0x040
 #define LPC_STR3 0x044
-
-/* mapped to lpc-host@80 IO space */
-#define LPC_HICRB0x080
+#define LPC_HICRB0x100
 #define LPC_HICRB_IBFIF4 BIT(1)
 #define LPC_HICRB_LPC4E  BIT(0)
-#define LPC_LADR40x090
-#define LPC_IDR4 0x094
-#define LPC_ODR4 0x098
-#define LPC_STR4 0x09C
+#define LPC_LADR40x110
+#define LPC_IDR4 0x114
+#define LPC_ODR4 0x118
+#define LPC_STR4 0x11C
 
 struct aspeed_kcs_bmc {
struct regmap *map;
@@ -348,12 +345,20 @@ static int aspeed_kcs_probe(struct platform_device *pdev)
struct device_node *np;
int rc;
 
-   np = pdev->dev.of_node;
+   np = dev->of_node->parent;
+   if (!of_device_is_compatible(np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   np = dev->of_node;
if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc") ||
-   of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc"))
+   of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc"))
kcs_bmc = aspeed_kcs_probe_of_v1(pdev);
else if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc-v2") ||
-   of_device_is_compatible(np, 
"aspeed,ast2500-kcs-bmc-v2"))
+of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc-v2"))
kcs_bmc = aspeed_kcs_probe_of_v2(pdev);
else
return -EINVAL;
-- 
2.17.1



[PATCH v4 2/5] ARM: dts: Remove LPC BMC and Host partitions

2020-12-28 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.

A concrete instance is that the HICRB[5:4] are for the I/O port address
configurtaion of KCS channel 1/2. However, the KCS driver cannot access
HICRB for channel 1/2 initialization via syscon regmap interface due to
the parition boundary. (i.e. offset 80h)

In addition, for the HW design backward compatiblity, a newly added HW
control bit could be located at any reserved one over the LPC addressing
space. Thereby, this patch removes the lpc-bmc and lpc-host child node
and thus the LPC partitioning.

Note that this change requires the synchronization between device tree
change and the driver change. To prevent the misuse of old devicetrees
with new drivers, or vice versa, the v2 compatible strings are adopted
for the LPC device as listed:

"aspeed,ast2400-lpc-v2"
"aspeed,ast2500-lpc-v2"
"aspeed,ast2600-lpc-v2"

Signed-off-by: Chia-Wei, Wang 
---
 arch/arm/boot/dts/aspeed-g4.dtsi |  74 +++--
 arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++-
 arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++-
 3 files changed, 148 insertions(+), 196 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi
index b3dafbc8caca..ee22bc036440 100644
--- a/arch/arm/boot/dts/aspeed-g4.dtsi
+++ b/arch/arm/boot/dts/aspeed-g4.dtsi
@@ -343,58 +343,44 @@
};
 
lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2400-lpc", "simple-mfd";
+   compatible = "aspeed,ast2400-lpc-v2", 
"simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
+   reg-io-width = <4>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
 
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2400-lpc-bmc";
-   reg = <0x0 0x80>;
+   lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2400-lpc-ctrl";
+   reg = <0x80 0x10>;
+   clocks = < ASPEED_CLK_GATE_LCLK>;
+   status = "disabled";
};
 
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2400-lpc-host", 
"simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = 
"aspeed,ast2400-lpc-ctrl";
-   reg = <0x0 0x10>;
-   clocks = < 
ASPEED_CLK_GATE_LCLK>;
-   status = "disabled";
-   };
-
-   lpc_snoop: lpc-snoop@10 {
-   compatible = 
"aspeed,ast2400-lpc-snoop";
-   reg = <0x10 0x8>;
-   interrupts = <8>;
-   status = "disabled";
-   };
-
-   lhc: lhc@20 {
-   compatible = 
"aspeed,ast2400-lhc";
-   reg = <0x20 0x24 0x48 0x8>;
-   };
-
-   lpc_reset: reset-controller@18 {
-   compatible = 
"aspeed,ast2400-lpc-reset";
-   reg = <0x18 0x4>;
-   #reset-cells = <1>;
-   };
-
-   ibt: ibt@c0  {
-   compatible = 
"aspeed,ast2400-ibt-bmc";
-   reg = <0xc0 0x18>;
-   

[PATCH v4 0/5] Remove LPC register partitioning

2020-12-28 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
The incorrect partitioning can impose unnecessary range restrictions
on register access through the syscon regmap interface.

For instance, HICRB contains the I/O port address configuration
of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access
HICRB as it is located at the other LPC partition.

In addition, to be backward compatible, the newly added HW control
bits could be located at any reserved bits over the LPC addressing
space.

Thereby, this patch series aims to remove the LPC partitioning for
better driver development and maintenance. This requires the change
to both the device tree and the driver implementation. To ensure
both sides are synchronously updated, a v2 binding check is added.

Chagnes since v3:
- Revise binding check as suggested by Haiyue Wang

Changes since v2:
- Add v2 binding check to ensure the synchronization between the
  device tree change and the driver register offset fix.

Changes since v1:
- Add the fix to the aspeed-lpc binding documentation.

Chia-Wei, Wang (5):
  dt-bindings: aspeed-lpc: Remove LPC partitioning
  ARM: dts: Remove LPC BMC and Host partitions
  ipmi: kcs: aspeed: Adapt to new LPC DTS layout
  pinctrl: aspeed-g5: Adapt to new LPC device tree layout
  soc: aspeed: Adapt to new LPC device tree layout

 .../devicetree/bindings/mfd/aspeed-lpc.txt|  99 +++--
 arch/arm/boot/dts/aspeed-g4.dtsi  |  74 --
 arch/arm/boot/dts/aspeed-g5.dtsi  | 135 --
 arch/arm/boot/dts/aspeed-g6.dtsi  | 135 --
 drivers/char/ipmi/kcs_bmc_aspeed.c|  27 ++--
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c|  17 ++-
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  |  20 ++-
 drivers/soc/aspeed/aspeed-lpc-snoop.c |  23 +--
 8 files changed, 225 insertions(+), 305 deletions(-)

-- 
2.17.1



[PATCH v3 5/5] soc: aspeed: Adapt to new LPC device tree layout

2020-12-20 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  | 20 ++--
 drivers/soc/aspeed/aspeed-lpc-snoop.c | 23 +++
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c 
b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
index 439bcd6b8c4a..b04074949240 100644
--- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c
+++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
@@ -18,15 +18,15 @@
 
 #define DEVICE_NAME"aspeed-lpc-ctrl"
 
-#define HICR5 0x0
+#define HICR5 0x80
 #define HICR5_ENL2HBIT(8)
 #define HICR5_ENFWHBIT(10)
 
-#define HICR6 0x4
+#define HICR6 0x84
 #define SW_FWH2AHB BIT(17)
 
-#define HICR7 0x8
-#define HICR8 0xc
+#define HICR7 0x88
+#define HICR8 0x8c
 
 struct aspeed_lpc_ctrl {
struct miscdevice   miscdev;
@@ -215,6 +215,7 @@ static int aspeed_lpc_ctrl_probe(struct platform_device 
*pdev)
struct device_node *node;
struct resource resm;
struct device *dev;
+   struct device_node *lpc_np;
int rc;
 
dev = >dev;
@@ -270,8 +271,15 @@ static int aspeed_lpc_ctrl_probe(struct platform_device 
*pdev)
}
}
 
-   lpc_ctrl->regmap = syscon_node_to_regmap(
-   pdev->dev.parent->of_node);
+   lpc_np = pdev->dev.parent->of_node;
+   if (!of_device_is_compatible(lpc_np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   lpc_ctrl->regmap = syscon_node_to_regmap(lpc_np);
if (IS_ERR(lpc_ctrl->regmap)) {
dev_err(dev, "Couldn't get regmap\n");
return -ENODEV;
diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c 
b/drivers/soc/aspeed/aspeed-lpc-snoop.c
index 682ba0eb4eba..63c3d9b8ba61 100644
--- a/drivers/soc/aspeed/aspeed-lpc-snoop.c
+++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c
@@ -28,26 +28,25 @@
 #define NUM_SNOOP_CHANNELS 2
 #define SNOOP_FIFO_SIZE 2048
 
-#define HICR5  0x0
+#define HICR5  0x80
 #define HICR5_EN_SNP0W BIT(0)
 #define HICR5_ENINT_SNP0W  BIT(1)
 #define HICR5_EN_SNP1W BIT(2)
 #define HICR5_ENINT_SNP1W  BIT(3)
-
-#define HICR6  0x4
+#define HICR6  0x84
 #define HICR6_STR_SNP0WBIT(0)
 #define HICR6_STR_SNP1WBIT(1)
-#define SNPWADR0x10
+#define SNPWADR0x90
 #define SNPWADR_CH0_MASK   GENMASK(15, 0)
 #define SNPWADR_CH0_SHIFT  0
 #define SNPWADR_CH1_MASK   GENMASK(31, 16)
 #define SNPWADR_CH1_SHIFT  16
-#define SNPWDR 0x14
+#define SNPWDR 0x94
 #define SNPWDR_CH0_MASKGENMASK(7, 0)
 #define SNPWDR_CH0_SHIFT   0
 #define SNPWDR_CH1_MASKGENMASK(15, 8)
 #define SNPWDR_CH1_SHIFT   8
-#define HICRB  0x80
+#define HICRB  0x100
 #define HICRB_ENSNP0D  BIT(14)
 #define HICRB_ENSNP1D  BIT(15)
 
@@ -258,6 +257,7 @@ static int aspeed_lpc_snoop_probe(struct platform_device 
*pdev)
 {
struct aspeed_lpc_snoop *lpc_snoop;
struct device *dev;
+   struct device_node *lpc_np;
u32 port;
int rc;
 
@@ -267,8 +267,15 @@ static int aspeed_lpc_snoop_probe(struct platform_device 
*pdev)
if (!lpc_snoop)
return -ENOMEM;
 
-   lpc_snoop->regmap = syscon_node_to_regmap(
-   pdev->dev.parent->of_node);
+   lpc_np = pdev->dev.parent->of_node;
+   if (!of_device_is_compatible(lpc_np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   lpc_snoop->regmap = syscon_node_to_regmap(lpc_np);
if (IS_ERR(lpc_snoop->regmap)) {
dev_err(dev, "Couldn't get regmap\n");
return -ENODEV;
-- 
2.17.1



[PATCH v3 3/5] ipmi: kcs: aspeed: Adapt to new LPC DTS layout

2020-12-20 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/char/ipmi/kcs_bmc_aspeed.c | 35 ++
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c 
b/drivers/char/ipmi/kcs_bmc_aspeed.c
index a140203c079b..6283bfef4ea7 100644
--- a/drivers/char/ipmi/kcs_bmc_aspeed.c
+++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
@@ -27,7 +27,6 @@
 
 #define KCS_CHANNEL_MAX 4
 
-/* mapped to lpc-bmc@0 IO space */
 #define LPC_HICR00x000
 #define LPC_HICR0_LPC3E  BIT(7)
 #define LPC_HICR0_LPC2E  BIT(6)
@@ -52,15 +51,13 @@
 #define LPC_STR1 0x03C
 #define LPC_STR2 0x040
 #define LPC_STR3 0x044
-
-/* mapped to lpc-host@80 IO space */
-#define LPC_HICRB0x080
+#define LPC_HICRB0x100
 #define LPC_HICRB_IBFIF4 BIT(1)
 #define LPC_HICRB_LPC4E  BIT(0)
-#define LPC_LADR40x090
-#define LPC_IDR4 0x094
-#define LPC_ODR4 0x098
-#define LPC_STR4 0x09C
+#define LPC_LADR40x110
+#define LPC_IDR4 0x114
+#define LPC_ODR4 0x118
+#define LPC_STR4 0x11C
 
 struct aspeed_kcs_bmc {
struct regmap *map;
@@ -345,15 +342,25 @@ static int aspeed_kcs_probe(struct platform_device *pdev)
 {
struct device *dev = >dev;
struct kcs_bmc *kcs_bmc;
-   struct device_node *np;
+   struct device_node *kcs_np;
+   struct device_node *lpc_np;
int rc;
 
-   np = pdev->dev.of_node;
-   if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc") ||
-   of_device_is_compatible(np, "aspeed,ast2500-kcs-bmc"))
+   kcs_np = dev->of_node;
+   lpc_np = kcs_np->parent;
+
+   if (!of_device_is_compatible(lpc_np, "aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, "aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, "aspeed,ast2600-lpc-v2")) {
+   dev_err(dev, "unsupported LPC device binding\n");
+   return -ENODEV;
+   }
+
+   if (of_device_is_compatible(kcs_np, "aspeed,ast2400-kcs-bmc") ||
+   of_device_is_compatible(kcs_np, 
"aspeed,ast2500-kcs-bmc"))
kcs_bmc = aspeed_kcs_probe_of_v1(pdev);
-   else if (of_device_is_compatible(np, "aspeed,ast2400-kcs-bmc-v2") ||
-   of_device_is_compatible(np, 
"aspeed,ast2500-kcs-bmc-v2"))
+   else if (of_device_is_compatible(kcs_np, "aspeed,ast2400-kcs-bmc-v2") ||
+   of_device_is_compatible(kcs_np, 
"aspeed,ast2500-kcs-bmc-v2"))
kcs_bmc = aspeed_kcs_probe_of_v2(pdev);
else
return -EINVAL;
-- 
2.17.1



[PATCH v3 4/5] pinctrl: aspeed-g5: Adapt to new LPC device tree layout

2020-12-20 Thread Chia-Wei, Wang
Add check against LPC device v2 compatible string to
ensure that the fixed device tree layout is adopted.
The LPC register offsets are also fixed accordingly.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c 
b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
index 0cab4c2576e2..6e0e5b64e677 100644
--- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
+++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
@@ -60,7 +60,7 @@
 #define COND2  { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 }
 
 /* LHCR0 is offset from the end of the H8S/2168-compatible registers */
-#define LHCR0  0x20
+#define LHCR0  0xa0
 #define GFX064 0x64
 
 #define B14 0
@@ -2648,14 +2648,21 @@ static struct regmap *aspeed_g5_acquire_regmap(struct 
aspeed_pinmux_data *ctx,
}
 
if (ip == ASPEED_IP_LPC) {
-   struct device_node *node;
+   struct device_node *lhc_np;
+   struct device_node *lpc_np;
struct regmap *map;
 
-   node = of_parse_phandle(ctx->dev->of_node,
+   lhc_np = of_parse_phandle(ctx->dev->of_node,
"aspeed,external-nodes", 1);
-   if (node) {
-   map = syscon_node_to_regmap(node->parent);
-   of_node_put(node);
+   if (lhc_np) {
+   lpc_np = lhc_np->parent;
+   if (!of_device_is_compatible(lpc_np, 
"aspeed,ast2400-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, 
"aspeed,ast2500-lpc-v2") &&
+   !of_device_is_compatible(lpc_np, 
"aspeed,ast2600-lpc-v2"))
+   return ERR_PTR(-ENODEV);
+
+   map = syscon_node_to_regmap(lpc_np);
+   of_node_put(lhc_np);
if (IS_ERR(map))
return map;
} else
-- 
2.17.1



[PATCH v3 1/5] dt-bindings: aspeed-lpc: Remove LPC partitioning

2020-12-20 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
This patch fixes the documentation by removing the description on LPC
partitions. The register offsets illustrated in the DTS node examples
are also fixed to adapt to the LPC DTS change.

Signed-off-by: Chia-Wei, Wang 
---
 .../devicetree/bindings/mfd/aspeed-lpc.txt| 99 ---
 1 file changed, 21 insertions(+), 78 deletions(-)

diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt 
b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
index d0a38ba8b9ce..90eb0ecc95d1 100644
--- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
+++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
@@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on 
the bus
 conditions it can also take the role of bus master.
 
 The LPC controller is represented as a multi-function device to account for the
-mix of functionality it provides. The principle split is between the register
-layout at the start of the I/O space which is, to quote the Aspeed datasheet,
-"basically compatible with the [LPC registers from the] popular BMC controller
-H8S/2168[1]", and everything else, where everything else is an eclectic
-collection of functions with a esoteric register layout. "Everything else",
-here labeled the "host" portion of the controller, includes, but is not limited
-to:
+mix of functionality, which includes, but is not limited to:
 
 * An IPMI Block Transfer[2] Controller
 
@@ -44,80 +38,29 @@ Required properties
 ===
 
 - compatible:  One of:
-   "aspeed,ast2400-lpc", "simple-mfd"
-   "aspeed,ast2500-lpc", "simple-mfd"
-   "aspeed,ast2600-lpc", "simple-mfd"
+   "aspeed,ast2400-lpc-v2", "simple-mfd", "syscon"
+   "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon"
+   "aspeed,ast2600-lpc-v2", "simple-mfd", "syscon"
 
 - reg: contains the physical address and length values of the Aspeed
 LPC memory region.
 
 - #address-cells: <1>
 - #size-cells: <1>
-- ranges:  Maps 0 to the physical address and length of the LPC memory
-region
-
-Required LPC Child nodes
-
-
-BMC Node
-
-
-- compatible:  One of:
-   "aspeed,ast2400-lpc-bmc"
-   "aspeed,ast2500-lpc-bmc"
-   "aspeed,ast2600-lpc-bmc"
-
-- reg: contains the physical address and length values of the
-H8S/2168-compatible LPC controller memory region
-
-Host Node
--
-
-- compatible:   One of:
-   "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"
-   "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"
-   "aspeed,ast2600-lpc-host", "simple-mfd", "syscon"
-
-- reg: contains the address and length values of the host-related
-register space for the Aspeed LPC controller
-
-- #address-cells: <1>
-- #size-cells: <1>
-- ranges:  Maps 0 to the address and length of the host-related LPC memory
+- ranges:  Maps 0 to the physical address and length of the LPC memory
 region
 
 Example:
 
 lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2500-lpc", "simple-mfd";
+   compatible = "aspeed,ast2500-lpc-v2", "simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
-
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2500-lpc-bmc";
-   reg = <0x0 0x80>;
-   };
-
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-   };
 };
 
-BMC Node Children
-==
-
-
-Host Node Children
-==
 
 LPC Host Interface Controller
 ---
@@ -149,14 +92,12 @@ Optional properties:
 
 Example:
 
-lpc-host@80 {
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = "aspeed,ast2500-lpc-ctrl";
-   reg = <0x0 0x80>;
-   clocks = < ASPEED_CLK_GATE_LCLK>;
-   memory-region = <_memory>;
-   flash = <>;
-   };
+lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2500-lpc-ctrl";
+   reg = <0x8

[PATCH v3 2/5] ARM: dts: Remove LPC BMC and Host partitions

2020-12-20 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.

A concrete instance is that the HICRB[5:4] are for the I/O port address
configurtaion of KCS channel 1/2. However, the KCS driver cannot access
HICRB for channel 1/2 initialization via syscon regmap interface due to
the parition boundary. (i.e. offset 80h)

In addition, for the HW design backward compatibility, a newly added HW
control bit could be located at any reserved one over the LPC addressing
space. Thereby, this patch removes the lpc-bmc and lpc-host child node
and thus the LPC partitioning.

Note that this change requires the synchronization between device tree
change and the driver change. To prevent the misuse of old devicetrees
with new drivers, or vice versa, the v2 compatible strings are adopted
for the LPC device as listed:

"aspeed,ast2400-lpc-v2"
"aspeed,ast2500-lpc-v2"
"aspeed,ast2600-lpc-v2"

Signed-off-by: Chia-Wei, Wang 
---
 arch/arm/boot/dts/aspeed-g4.dtsi |  74 +++--
 arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++-
 arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++-
 3 files changed, 148 insertions(+), 196 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi
index b3dafbc8caca..ee22bc036440 100644
--- a/arch/arm/boot/dts/aspeed-g4.dtsi
+++ b/arch/arm/boot/dts/aspeed-g4.dtsi
@@ -343,58 +343,44 @@
};
 
lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2400-lpc", "simple-mfd";
+   compatible = "aspeed,ast2400-lpc-v2", 
"simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
+   reg-io-width = <4>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
 
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2400-lpc-bmc";
-   reg = <0x0 0x80>;
+   lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2400-lpc-ctrl";
+   reg = <0x80 0x10>;
+   clocks = < ASPEED_CLK_GATE_LCLK>;
+   status = "disabled";
};
 
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2400-lpc-host", 
"simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = 
"aspeed,ast2400-lpc-ctrl";
-   reg = <0x0 0x10>;
-   clocks = < 
ASPEED_CLK_GATE_LCLK>;
-   status = "disabled";
-   };
-
-   lpc_snoop: lpc-snoop@10 {
-   compatible = 
"aspeed,ast2400-lpc-snoop";
-   reg = <0x10 0x8>;
-   interrupts = <8>;
-   status = "disabled";
-   };
-
-   lhc: lhc@20 {
-   compatible = 
"aspeed,ast2400-lhc";
-   reg = <0x20 0x24 0x48 0x8>;
-   };
-
-   lpc_reset: reset-controller@18 {
-   compatible = 
"aspeed,ast2400-lpc-reset";
-   reg = <0x18 0x4>;
-   #reset-cells = <1>;
-   };
-
-   ibt: ibt@c0  {
-   compatible = 
"aspeed,ast2400-ibt-bmc";
-   reg = <0xc0 0x18>;
-   

[PATCH v3 0/5] Remove LPC register partitioning

2020-12-20 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
The incorrect partitioning can impose unnecessary range restrictions
on register access through the syscon regmap interface.

For instance, HICRB contains the I/O port address configuration
of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access
HICRB as it is located at the other LPC partition.

In addition, to be backward compatible, the newly added HW control
bits could be located at any reserved bits over the LPC addressing
space.

Thereby, this patch series aims to remove the LPC partitioning for
better driver development and maintenance. This requires the change
to both the device tree and the driver implementation. To ensure
both sides are synchronously updated, a v2 binding check is added.

Changes since v2:
- Add v2 binding check to ensure the synchronization between the
  device tree change and the driver register offset fix.

Changes since v1:
- Add the fix to the aspeed-lpc binding documentation.

Chia-Wei, Wang (5):
  dt-bindings: aspeed-lpc: Remove LPC partitioning
  ARM: dts: Remove LPC BMC and Host partitions
  ipmi: kcs: aspeed: Adapt to new LPC DTS layout
  pinctrl: aspeed-g5: Adapt to new LPC device tree layout
  soc: aspeed: Adapt to new LPC device tree layout

 .../devicetree/bindings/mfd/aspeed-lpc.txt|  99 +++--
 arch/arm/boot/dts/aspeed-g4.dtsi  |  74 --
 arch/arm/boot/dts/aspeed-g5.dtsi  | 135 --
 arch/arm/boot/dts/aspeed-g6.dtsi  | 135 --
 drivers/char/ipmi/kcs_bmc_aspeed.c|  35 +++--
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c|  19 ++-
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  |  20 ++-
 drivers/soc/aspeed/aspeed-lpc-snoop.c |  23 +--
 8 files changed, 232 insertions(+), 308 deletions(-)

-- 
2.17.1



[PATCH] sched: cpufreq_schedutil: restore cached freq when next_f is not changed

2020-10-16 Thread Wei Wang
We have the raw cached freq to reduce the chance in calling cpufreq
driver where it could be costly in some arch/SoC.

Currently, the raw cached freq will be reset when next_f is changed for
correctness. This patch changes it to maintain the cached value instead
of dropping it to honor the purpose of the cached value.

This is adapted from https://android-review.googlesource.com/1352810/

Signed-off-by: Wei Wang 
---
 kernel/sched/cpufreq_schedutil.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5ae7b4e6e8d6..e254745a82cb 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -441,6 +441,7 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
unsigned long util, max;
unsigned int next_f;
bool busy;
+   unsigned int cached_freq = sg_policy->cached_raw_freq;
 
sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -464,8 +465,8 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
if (busy && next_f < sg_policy->next_freq) {
next_f = sg_policy->next_freq;
 
-   /* Reset cached freq as next_freq has changed */
-   sg_policy->cached_raw_freq = 0;
+   /* Restore cached freq as next_freq has changed */
+   sg_policy->cached_raw_freq = cached_freq;
}
 
/*
-- 
2.29.0.rc1.297.gfa9743e501-goog



Re: [PATCH] sched: cpufreq_schedutil: maintain raw cache when next_f is not changed

2020-10-16 Thread Wei Wang
On Fri, Oct 16, 2020 at 10:36 AM Rafael J. Wysocki  wrote:
>
> On Fri, Oct 16, 2020 at 7:18 PM Wei Wang  wrote:
> >
> > On Fri, Oct 16, 2020 at 10:01 AM Rafael J. Wysocki  
> > wrote:
> > >
> > > On Fri, Oct 16, 2020 at 6:36 PM Wei Wang  wrote:
> > > >
> > > > Currently, raw cache will be reset when next_f is changed after
> > > > get_next_freq for correctness. However, it may introduce more
> > > > cycles. This patch changes it to maintain the cached value instead of
> > > > dropping it.
> > >
> > > IMV you need to be more specific about why this helps.
> > >
> >
> > I think the idea of cached_raw_freq is to reduce the chance of calling
> > cpufreq drivers (in some arch those may be costly) but sometimes the
> > cache will be wiped for correctness. The purpose of this patch is to
> > still keep the cached value instead of wiping them.
>
> Well, I see what the problem is and how the patch is attempting to
> address it (which is not the best way to do that because of the extra
> struct member that doesn't need to be added if I'm not mistaken), but
> IMO the changelog is way too vague from the problem statement
> perspective.

Just want to bring this up in the mainline kernel. I think we can
change the patch to use a variable insides sugov_update_single. This
is adapted from Android common kernel where it has some off tree
functions making a single variable not possible but also making the
issue more obvious.


Re: [PATCH] sched: cpufreq_schedutil: maintain raw cache when next_f is not changed

2020-10-16 Thread Wei Wang
On Fri, Oct 16, 2020 at 10:01 AM Rafael J. Wysocki  wrote:
>
> On Fri, Oct 16, 2020 at 6:36 PM Wei Wang  wrote:
> >
> > Currently, raw cache will be reset when next_f is changed after
> > get_next_freq for correctness. However, it may introduce more
> > cycles. This patch changes it to maintain the cached value instead of
> > dropping it.
>
> IMV you need to be more specific about why this helps.
>

I think the idea of cached_raw_freq is to reduce the chance of calling
cpufreq drivers (in some arch those may be costly) but sometimes the
cache will be wiped for correctness. The purpose of this patch is to
still keep the cached value instead of wiping them.

thx
wvw


>
> > This is adapted from https://android-review.googlesource.com/1352810/
> >
> > Signed-off-by: Wei Wang 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index 5ae7b4e6e8d6..ae3ae7fcd027 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -31,6 +31,7 @@ struct sugov_policy {
> > s64 freq_update_delay_ns;
> > unsigned intnext_freq;
> > unsigned intcached_raw_freq;
> > +   unsigned intprev_cached_raw_freq;
> >
> > /* The next fields are only needed if fast switch cannot be used: */
> > struct  irq_work irq_work;
> > @@ -165,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy 
> > *sg_policy,
> > return sg_policy->next_freq;
> >
> > sg_policy->need_freq_update = false;
> > +   sg_policy->prev_cached_raw_freq = sg_policy->cached_raw_freq;
> > sg_policy->cached_raw_freq = freq;
> > return cpufreq_driver_resolve_freq(policy, freq);
> >  }
> > @@ -464,8 +466,8 @@ static void sugov_update_single(struct update_util_data 
> > *hook, u64 time,
> > if (busy && next_f < sg_policy->next_freq) {
> > next_f = sg_policy->next_freq;
> >
> > -   /* Reset cached freq as next_freq has changed */
> > -   sg_policy->cached_raw_freq = 0;
> > +   /* Restore cached freq as next_freq has changed */
> > +   sg_policy->cached_raw_freq = 
> > sg_policy->prev_cached_raw_freq;
> > }
> >
> > /*
> > @@ -828,6 +830,7 @@ static int sugov_start(struct cpufreq_policy *policy)
> > sg_policy->limits_changed   = false;
> > sg_policy->need_freq_update = false;
> > sg_policy->cached_raw_freq  = 0;
> > +   sg_policy->prev_cached_raw_freq = 0;
> >
> > for_each_cpu(cpu, policy->cpus) {
> > struct sugov_cpu *sg_cpu = _cpu(sugov_cpu, cpu);
> > --
> > 2.29.0.rc1.297.gfa9743e501-goog
> >


[PATCH] sched: cpufreq_schedutil: maintain raw cache when next_f is not changed

2020-10-16 Thread Wei Wang
Currently, raw cache will be reset when next_f is changed after
get_next_freq for correctness. However, it may introduce more
cycles. This patch changes it to maintain the cached value instead of
dropping it.

This is adapted from https://android-review.googlesource.com/1352810/

Signed-off-by: Wei Wang 
---
 kernel/sched/cpufreq_schedutil.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5ae7b4e6e8d6..ae3ae7fcd027 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -31,6 +31,7 @@ struct sugov_policy {
s64 freq_update_delay_ns;
unsigned intnext_freq;
unsigned intcached_raw_freq;
+   unsigned intprev_cached_raw_freq;
 
/* The next fields are only needed if fast switch cannot be used: */
struct  irq_work irq_work;
@@ -165,6 +166,7 @@ static unsigned int get_next_freq(struct sugov_policy 
*sg_policy,
return sg_policy->next_freq;
 
sg_policy->need_freq_update = false;
+   sg_policy->prev_cached_raw_freq = sg_policy->cached_raw_freq;
sg_policy->cached_raw_freq = freq;
return cpufreq_driver_resolve_freq(policy, freq);
 }
@@ -464,8 +466,8 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
if (busy && next_f < sg_policy->next_freq) {
next_f = sg_policy->next_freq;
 
-   /* Reset cached freq as next_freq has changed */
-   sg_policy->cached_raw_freq = 0;
+   /* Restore cached freq as next_freq has changed */
+   sg_policy->cached_raw_freq = sg_policy->prev_cached_raw_freq;
}
 
/*
@@ -828,6 +830,7 @@ static int sugov_start(struct cpufreq_policy *policy)
sg_policy->limits_changed   = false;
sg_policy->need_freq_update = false;
sg_policy->cached_raw_freq  = 0;
+   sg_policy->prev_cached_raw_freq = 0;
 
for_each_cpu(cpu, policy->cpus) {
struct sugov_cpu *sg_cpu = _cpu(sugov_cpu, cpu);
-- 
2.29.0.rc1.297.gfa9743e501-goog



[PATCH v2 1/5] ARM: dts: Remove LPC BMC and Host partitions

2020-10-05 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.

A concrete instance is that the HICRB[5:4] are for the I/O port address
configurtaion of KCS channel 1/2. However, the KCS driver cannot access
HICRB for channel 1/2 initialization via syscon regmap interface due to
the parition boundary. (i.e. offset 80h)

In addition, to be backward compatible, the newly added HW control bits
could be located at any reserved bits over the LPC addressing space.

Thereby, this patch removes the lpc-bmc and lpc-host child node and thus
the LPC partitioning.

Signed-off-by: Chia-Wei, Wang 
---
 arch/arm/boot/dts/aspeed-g4.dtsi |  74 +++--
 arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++-
 arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++-
 3 files changed, 148 insertions(+), 196 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi
index 82f0213e3a3c..22996b3c4a00 100644
--- a/arch/arm/boot/dts/aspeed-g4.dtsi
+++ b/arch/arm/boot/dts/aspeed-g4.dtsi
@@ -338,58 +338,44 @@
};
 
lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2400-lpc", "simple-mfd";
+   compatible = "aspeed,ast2400-lpc", 
"simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
+   reg-io-width = <4>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
 
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2400-lpc-bmc";
-   reg = <0x0 0x80>;
+   lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2400-lpc-ctrl";
+   reg = <0x80 0x10>;
+   clocks = < ASPEED_CLK_GATE_LCLK>;
+   status = "disabled";
};
 
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2400-lpc-host", 
"simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = 
"aspeed,ast2400-lpc-ctrl";
-   reg = <0x0 0x10>;
-   clocks = < 
ASPEED_CLK_GATE_LCLK>;
-   status = "disabled";
-   };
-
-   lpc_snoop: lpc-snoop@10 {
-   compatible = 
"aspeed,ast2400-lpc-snoop";
-   reg = <0x10 0x8>;
-   interrupts = <8>;
-   status = "disabled";
-   };
-
-   lhc: lhc@20 {
-   compatible = 
"aspeed,ast2400-lhc";
-   reg = <0x20 0x24 0x48 0x8>;
-   };
-
-   lpc_reset: reset-controller@18 {
-   compatible = 
"aspeed,ast2400-lpc-reset";
-   reg = <0x18 0x4>;
-   #reset-cells = <1>;
-   };
-
-   ibt: ibt@c0  {
-   compatible = 
"aspeed,ast2400-ibt-bmc";
-   reg = <0xc0 0x18>;
-   interrupts = <8>;
-   status = "disabled";
-   };
+   lpc_snoop: lpc-snoop@90 {
+   compatible = "aspeed,ast2400-lpc-snoop";
+   reg = <0x90 0x8>;
+   

[PATCH v2 3/5] ipmi: kcs: aspeed: Fix LPC register offsets

2020-10-05 Thread Chia-Wei, Wang
The LPC register offsets are fixed to adapt to the LPC DTS change,
where the LPC partitioning is removed.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/char/ipmi/kcs_bmc_aspeed.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c 
b/drivers/char/ipmi/kcs_bmc_aspeed.c
index a140203c079b..8843cf867a5d 100644
--- a/drivers/char/ipmi/kcs_bmc_aspeed.c
+++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
@@ -27,7 +27,6 @@
 
 #define KCS_CHANNEL_MAX 4
 
-/* mapped to lpc-bmc@0 IO space */
 #define LPC_HICR00x000
 #define LPC_HICR0_LPC3E  BIT(7)
 #define LPC_HICR0_LPC2E  BIT(6)
@@ -52,15 +51,13 @@
 #define LPC_STR1 0x03C
 #define LPC_STR2 0x040
 #define LPC_STR3 0x044
-
-/* mapped to lpc-host@80 IO space */
-#define LPC_HICRB0x080
+#define LPC_HICRB0x100
 #define LPC_HICRB_IBFIF4 BIT(1)
 #define LPC_HICRB_LPC4E  BIT(0)
-#define LPC_LADR40x090
-#define LPC_IDR4 0x094
-#define LPC_ODR4 0x098
-#define LPC_STR4 0x09C
+#define LPC_LADR40x110
+#define LPC_IDR4 0x114
+#define LPC_ODR4 0x118
+#define LPC_STR4 0x11C
 
 struct aspeed_kcs_bmc {
struct regmap *map;
-- 
2.17.1



[PATCH v2 2/5] soc: aspeed: Fix LPC register offsets

2020-10-05 Thread Chia-Wei, Wang
The LPC register offsets are fixed to adapt to the LPC DTS change,
where the LPC partitioning is removed.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  |  6 +++---
 drivers/soc/aspeed/aspeed-lpc-snoop.c | 11 +--
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c 
b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
index 01ed21e8bfee..36faa0618ada 100644
--- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c
+++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
@@ -17,12 +17,12 @@
 
 #define DEVICE_NAME"aspeed-lpc-ctrl"
 
-#define HICR5 0x0
+#define HICR5 0x80
 #define HICR5_ENL2HBIT(8)
 #define HICR5_ENFWHBIT(10)
 
-#define HICR7 0x8
-#define HICR8 0xc
+#define HICR7 0x88
+#define HICR8 0x8c
 
 struct aspeed_lpc_ctrl {
struct miscdevice   miscdev;
diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c 
b/drivers/soc/aspeed/aspeed-lpc-snoop.c
index f3d8d53ab84d..7ce5c9fcc73c 100644
--- a/drivers/soc/aspeed/aspeed-lpc-snoop.c
+++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c
@@ -28,26 +28,25 @@
 #define NUM_SNOOP_CHANNELS 2
 #define SNOOP_FIFO_SIZE 2048
 
-#define HICR5  0x0
+#define HICR5  0x80
 #define HICR5_EN_SNP0W BIT(0)
 #define HICR5_ENINT_SNP0W  BIT(1)
 #define HICR5_EN_SNP1W BIT(2)
 #define HICR5_ENINT_SNP1W  BIT(3)
-
-#define HICR6  0x4
+#define HICR6  0x84
 #define HICR6_STR_SNP0WBIT(0)
 #define HICR6_STR_SNP1WBIT(1)
-#define SNPWADR0x10
+#define SNPWADR0x90
 #define SNPWADR_CH0_MASK   GENMASK(15, 0)
 #define SNPWADR_CH0_SHIFT  0
 #define SNPWADR_CH1_MASK   GENMASK(31, 16)
 #define SNPWADR_CH1_SHIFT  16
-#define SNPWDR 0x14
+#define SNPWDR 0x94
 #define SNPWDR_CH0_MASKGENMASK(7, 0)
 #define SNPWDR_CH0_SHIFT   0
 #define SNPWDR_CH1_MASKGENMASK(15, 8)
 #define SNPWDR_CH1_SHIFT   8
-#define HICRB  0x80
+#define HICRB  0x100
 #define HICRB_ENSNP0D  BIT(14)
 #define HICRB_ENSNP1D  BIT(15)
 
-- 
2.17.1



[PATCH v2 5/5] dt-bindings: aspeed-lpc: Remove LPC partitioning

2020-10-05 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
This patch fixes the documentation by removing the description on LPC
partitions. The register offsets illustrated in the DTS node examples
are also fixed to adapt to the LPC DTS change.

Signed-off-by: Chia-Wei, Wang 
---
 .../devicetree/bindings/mfd/aspeed-lpc.txt| 85 +++
 1 file changed, 14 insertions(+), 71 deletions(-)

diff --git a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt 
b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
index a92acf1dd491..866f54a09e09 100644
--- a/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
+++ b/Documentation/devicetree/bindings/mfd/aspeed-lpc.txt
@@ -9,13 +9,7 @@ primary use case of the Aspeed LPC controller is as a slave on 
the bus
 conditions it can also take the role of bus master.
 
 The LPC controller is represented as a multi-function device to account for the
-mix of functionality it provides. The principle split is between the register
-layout at the start of the I/O space which is, to quote the Aspeed datasheet,
-"basically compatible with the [LPC registers from the] popular BMC controller
-H8S/2168[1]", and everything else, where everything else is an eclectic
-collection of functions with a esoteric register layout. "Everything else",
-here labeled the "host" portion of the controller, includes, but is not limited
-to:
+mix of functionality, which includes, but is not limited to:
 
 * An IPMI Block Transfer[2] Controller
 
@@ -44,8 +38,8 @@ Required properties
 ===
 
 - compatible:  One of:
-   "aspeed,ast2400-lpc", "simple-mfd"
-   "aspeed,ast2500-lpc", "simple-mfd"
+   "aspeed,ast2400-lpc", "simple-mfd", "syscon"
+   "aspeed,ast2500-lpc", "simple-mfd", "syscon"
 
 - reg: contains the physical address and length values of the Aspeed
 LPC memory region.
@@ -55,66 +49,17 @@ Required properties
 - ranges:  Maps 0 to the physical address and length of the LPC memory
 region
 
-Required LPC Child nodes
-
-
-BMC Node
-
-
-- compatible:  One of:
-   "aspeed,ast2400-lpc-bmc"
-   "aspeed,ast2500-lpc-bmc"
-
-- reg: contains the physical address and length values of the
-H8S/2168-compatible LPC controller memory region
-
-Host Node
--
-
-- compatible:   One of:
-   "aspeed,ast2400-lpc-host", "simple-mfd", "syscon"
-   "aspeed,ast2500-lpc-host", "simple-mfd", "syscon"
-
-- reg: contains the address and length values of the host-related
-register space for the Aspeed LPC controller
-
-- #address-cells: <1>
-- #size-cells: <1>
-- ranges:  Maps 0 to the address and length of the host-related LPC memory
-region
-
 Example:
 
 lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2500-lpc", "simple-mfd";
+   compatible = "aspeed,ast2500-lpc", "simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
-
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2500-lpc-bmc";
-   reg = <0x0 0x80>;
-   };
-
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-   };
 };
 
-BMC Node Children
-==
-
-
-Host Node Children
-==
 
 LPC Host Interface Controller
 ---
@@ -145,14 +90,12 @@ Optional properties:
 
 Example:
 
-lpc-host@80 {
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = "aspeed,ast2500-lpc-ctrl";
-   reg = <0x0 0x80>;
-   clocks = < ASPEED_CLK_GATE_LCLK>;
-   memory-region = <_memory>;
-   flash = <>;
-   };
+lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2500-lpc-ctrl";
+   reg = <0x80 0x80>;
+   clocks = < ASPEED_CLK_GATE_LCLK>;
+   memory-region = <_memory>;
+   flash = <>;
 };
 
 LPC Host Controller
@@ -174,9 +117,9 @@ Required properties:
 
 Example:
 
-lhc: lhc@20 {
+lhc: lhc@a0 {
compatible = "aspeed,ast2500-lhc";
-   reg = <0x20 0x24 0x48 0x8>;
+   reg = <0xa0 0x24 0xc8 0x8>;
 };
 
 LPC reset control
@@ -194,8 +137,8 @@ Required properties:
 
 Example:
 
-lpc_reset: reset-controller@18 {
+lpc_reset: reset-controller@98 {
 compatible = "aspeed,ast2500-lpc-reset";
-reg = <0x18 0x4>;
+reg = <0x98 0x4>;
 #reset-cells = <1>;
 };
-- 
2.17.1



[PATCH v2 4/5] pinctrl: aspeed-g5: Fix LPC register offsets

2020-10-05 Thread Chia-Wei, Wang
The LPC register offsets are fixed to adapt to the LPC DTS change,
where the LPC partitioning is removed.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c 
b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
index 0cab4c2576e2..98e62333fa54 100644
--- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
+++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
@@ -60,7 +60,7 @@
 #define COND2  { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 }
 
 /* LHCR0 is offset from the end of the H8S/2168-compatible registers */
-#define LHCR0  0x20
+#define LHCR0  0xa0
 #define GFX064 0x64
 
 #define B14 0
-- 
2.17.1



[PATCH v2 0/5] Remove LPC register partitioning

2020-10-05 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
The incorrect partitioning can impose unnecessary range restrictions
on register access through the syscon regmap interface.

For instance, HICRB contains the I/O port address configuration
of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access
HICRB as it is located at the other LPC partition.

In addition, to be backward compatible, the newly added HW control
bits could be located at any reserved bits over the LPC addressing
space.

Thereby, this patch series aims to remove the LPC partitioning for
better driver development and maintenance.


Changes since v1:
- Add the fix to the aspeed-lpc binding documentation.

Chia-Wei, Wang (5):
  ARM: dts: Remove LPC BMC and Host partitions
  soc: aspeed: Fix LPC register offsets
  ipmi: kcs: aspeed: Fix LPC register offsets
  pinctrl: aspeed-g5: Fix LPC register offsets
  dt-bindings: aspeed-lpc: Remove LPC partitioning

 .../devicetree/bindings/mfd/aspeed-lpc.txt|  85 ++-
 arch/arm/boot/dts/aspeed-g4.dtsi  |  74 --
 arch/arm/boot/dts/aspeed-g5.dtsi  | 135 --
 arch/arm/boot/dts/aspeed-g6.dtsi  | 135 --
 drivers/char/ipmi/kcs_bmc_aspeed.c|  13 +-
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c|   2 +-
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  |   6 +-
 drivers/soc/aspeed/aspeed-lpc-snoop.c |  11 +-
 8 files changed, 176 insertions(+), 285 deletions(-)

-- 
2.17.1



Re: [PATCH] sched/rt: Disable RT_RUNTIME_SHARE by default

2020-09-22 Thread Wei Wang
On Tue, Sep 22, 2020 at 12:04 PM Daniel Bristot de Oliveira
 wrote:
>
> On 9/22/20 7:14 PM, Wei Wang wrote:
> > On Mon, Sep 21, 2020 at 7:40 AM Daniel Bristot de Oliveira
> >  wrote:
> >>
> >> The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
> >> between CPUs, allowing a CPU to run a real-time task up to 100% of the
> >> time while leaving more space for non-real-time tasks to run on the CPU
> >> that lend rt_runtime.
> >>
> >> The problem is that a CPU can easily borrow enough rt_runtime to allow
> >> a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
> >> which are non-real-time by design.
> >>
> >> This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
> >> The feature will still be present for users that want to enable it,
> >> though.
> >>
> >> Signed-off-by: Daniel Bristot de Oliveira 
> >> Cc: Ingo Molnar 
> >> Cc: Peter Zijlstra 
> >> Cc: Juri Lelli 
> >> Cc: Vincent Guittot 
> >> Cc: Dietmar Eggemann 
> >> Cc: Steven Rostedt 
> >> Cc: Ben Segall 
> >> Cc: Mel Gorman 
> >> Cc: Daniel Bristot de Oliveira 
> >> Cc: Thomas Gleixner 
> >> Cc: linux-kernel@vger.kernel.org
> >> ---
> >>  kernel/sched/features.h | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> >> index 7481cd96f391..68d369cba9e4 100644
> >> --- a/kernel/sched/features.h
> >> +++ b/kernel/sched/features.h
> >> @@ -77,7 +77,7 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false)
> >>  SCHED_FEAT(RT_PUSH_IPI, true)
> >>  #endif
> >>
> >> -SCHED_FEAT(RT_RUNTIME_SHARE, true)
> >> +SCHED_FEAT(RT_RUNTIME_SHARE, false)
> >>  SCHED_FEAT(LB_MIN, false)
> >>  SCHED_FEAT(ATTACH_AGE_LOAD, true)
> >>
> >> --
> >> 2.26.2
> >>
> >
> > Tested on an Android device and can no longer see long running RT
> > tasks (yes, Android have those for reasons).
> >
>
> So:
>
> Tested-by: Wei Wang 
>
> ?

Yup, thanks for pushing this.

>
> Thanks!
> -- Daniel
>


Re: [PATCH] sched/rt: Disable RT_RUNTIME_SHARE by default

2020-09-22 Thread Wei Wang
On Mon, Sep 21, 2020 at 7:40 AM Daniel Bristot de Oliveira
 wrote:
>
> The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
> between CPUs, allowing a CPU to run a real-time task up to 100% of the
> time while leaving more space for non-real-time tasks to run on the CPU
> that lend rt_runtime.
>
> The problem is that a CPU can easily borrow enough rt_runtime to allow
> a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
> which are non-real-time by design.
>
> This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
> The feature will still be present for users that want to enable it,
> though.
>
> Signed-off-by: Daniel Bristot de Oliveira 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Juri Lelli 
> Cc: Vincent Guittot 
> Cc: Dietmar Eggemann 
> Cc: Steven Rostedt 
> Cc: Ben Segall 
> Cc: Mel Gorman 
> Cc: Daniel Bristot de Oliveira 
> Cc: Thomas Gleixner 
> Cc: linux-kernel@vger.kernel.org
> ---
>  kernel/sched/features.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 7481cd96f391..68d369cba9e4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -77,7 +77,7 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false)
>  SCHED_FEAT(RT_PUSH_IPI, true)
>  #endif
>
> -SCHED_FEAT(RT_RUNTIME_SHARE, true)
> +SCHED_FEAT(RT_RUNTIME_SHARE, false)
>  SCHED_FEAT(LB_MIN, false)
>  SCHED_FEAT(ATTACH_AGE_LOAD, true)
>
> --
> 2.26.2
>

Tested on an Android device and can no longer see long running RT
tasks (yes, Android have those for reasons).


[PATCH 2/4] soc: aspeed: Fix LPC register offsets

2020-09-10 Thread Chia-Wei, Wang
The LPC register offsets are fixed to adapt to the LPC DTS change,
where the LPC partitioning is removed.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/soc/aspeed/aspeed-lpc-ctrl.c  |  6 +++---
 drivers/soc/aspeed/aspeed-lpc-snoop.c | 11 +--
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/soc/aspeed/aspeed-lpc-ctrl.c 
b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
index 01ed21e8bfee..36faa0618ada 100644
--- a/drivers/soc/aspeed/aspeed-lpc-ctrl.c
+++ b/drivers/soc/aspeed/aspeed-lpc-ctrl.c
@@ -17,12 +17,12 @@
 
 #define DEVICE_NAME"aspeed-lpc-ctrl"
 
-#define HICR5 0x0
+#define HICR5 0x80
 #define HICR5_ENL2HBIT(8)
 #define HICR5_ENFWHBIT(10)
 
-#define HICR7 0x8
-#define HICR8 0xc
+#define HICR7 0x88
+#define HICR8 0x8c
 
 struct aspeed_lpc_ctrl {
struct miscdevice   miscdev;
diff --git a/drivers/soc/aspeed/aspeed-lpc-snoop.c 
b/drivers/soc/aspeed/aspeed-lpc-snoop.c
index f3d8d53ab84d..7ce5c9fcc73c 100644
--- a/drivers/soc/aspeed/aspeed-lpc-snoop.c
+++ b/drivers/soc/aspeed/aspeed-lpc-snoop.c
@@ -28,26 +28,25 @@
 #define NUM_SNOOP_CHANNELS 2
 #define SNOOP_FIFO_SIZE 2048
 
-#define HICR5  0x0
+#define HICR5  0x80
 #define HICR5_EN_SNP0W BIT(0)
 #define HICR5_ENINT_SNP0W  BIT(1)
 #define HICR5_EN_SNP1W BIT(2)
 #define HICR5_ENINT_SNP1W  BIT(3)
-
-#define HICR6  0x4
+#define HICR6  0x84
 #define HICR6_STR_SNP0WBIT(0)
 #define HICR6_STR_SNP1WBIT(1)
-#define SNPWADR0x10
+#define SNPWADR0x90
 #define SNPWADR_CH0_MASK   GENMASK(15, 0)
 #define SNPWADR_CH0_SHIFT  0
 #define SNPWADR_CH1_MASK   GENMASK(31, 16)
 #define SNPWADR_CH1_SHIFT  16
-#define SNPWDR 0x14
+#define SNPWDR 0x94
 #define SNPWDR_CH0_MASKGENMASK(7, 0)
 #define SNPWDR_CH0_SHIFT   0
 #define SNPWDR_CH1_MASKGENMASK(15, 8)
 #define SNPWDR_CH1_SHIFT   8
-#define HICRB  0x80
+#define HICRB  0x100
 #define HICRB_ENSNP0D  BIT(14)
 #define HICRB_ENSNP1D  BIT(15)
 
-- 
2.17.1



[PATCH 0/4] Remove LPC register partitioning

2020-09-10 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.
The incorrect partitioning can impose unnecessary range restrictions
on register access through the syscon regmap interface.

For instance, HICRB contains the I/O port address configuration
of KCS channel 1/2. However, the KCS#1/#2 drivers cannot access
HICRB as it is located at the other LPC partition.

In addition, to be backward compatible, the newly added HW control
bits could be added at any reserved bits over the LPC addressing space.

Thereby, this patch series aims to remove the LPC partitioning for
better driver development and maintenance.

Chia-Wei, Wang (4):
  ARM: dts: Remove LPC BMC and Host partitions
  soc: aspeed: Fix LPC register offsets
  ipmi: kcs: aspeed: Fix LPC register offsets
  pinctrl: aspeed-g5: Fix LPC register offsets

 arch/arm/boot/dts/aspeed-g4.dtsi   |  74 +--
 arch/arm/boot/dts/aspeed-g5.dtsi   | 135 +
 arch/arm/boot/dts/aspeed-g6.dtsi   | 135 +
 drivers/char/ipmi/kcs_bmc_aspeed.c |  13 +-
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c |   2 +-
 drivers/soc/aspeed/aspeed-lpc-ctrl.c   |   6 +-
 drivers/soc/aspeed/aspeed-lpc-snoop.c  |  11 +-
 7 files changed, 162 insertions(+), 214 deletions(-)

-- 
2.17.1



[PATCH 4/4] pinctrl: aspeed-g5: Fix LPC register offsets

2020-09-10 Thread Chia-Wei, Wang
The LPC register offsets are fixed to adapt to the LPC DTS change,
where the LPC partitioning is removed.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c 
b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
index 0cab4c2576e2..98e62333fa54 100644
--- a/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
+++ b/drivers/pinctrl/aspeed/pinctrl-aspeed-g5.c
@@ -60,7 +60,7 @@
 #define COND2  { ASPEED_IP_SCU, SCU94, GENMASK(1, 0), 0, 0 }
 
 /* LHCR0 is offset from the end of the H8S/2168-compatible registers */
-#define LHCR0  0x20
+#define LHCR0  0xa0
 #define GFX064 0x64
 
 #define B14 0
-- 
2.17.1



[PATCH 3/4] ipmi: kcs: aspeed: Fix LPC register offsets

2020-09-10 Thread Chia-Wei, Wang
The LPC register offsets are fixed to adapt to the LPC DTS change,
where the LPC partitioning is removed.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/char/ipmi/kcs_bmc_aspeed.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/char/ipmi/kcs_bmc_aspeed.c 
b/drivers/char/ipmi/kcs_bmc_aspeed.c
index a140203c079b..8843cf867a5d 100644
--- a/drivers/char/ipmi/kcs_bmc_aspeed.c
+++ b/drivers/char/ipmi/kcs_bmc_aspeed.c
@@ -27,7 +27,6 @@
 
 #define KCS_CHANNEL_MAX 4
 
-/* mapped to lpc-bmc@0 IO space */
 #define LPC_HICR00x000
 #define LPC_HICR0_LPC3E  BIT(7)
 #define LPC_HICR0_LPC2E  BIT(6)
@@ -52,15 +51,13 @@
 #define LPC_STR1 0x03C
 #define LPC_STR2 0x040
 #define LPC_STR3 0x044
-
-/* mapped to lpc-host@80 IO space */
-#define LPC_HICRB0x080
+#define LPC_HICRB0x100
 #define LPC_HICRB_IBFIF4 BIT(1)
 #define LPC_HICRB_LPC4E  BIT(0)
-#define LPC_LADR40x090
-#define LPC_IDR4 0x094
-#define LPC_ODR4 0x098
-#define LPC_STR4 0x09C
+#define LPC_LADR40x110
+#define LPC_IDR4 0x114
+#define LPC_ODR4 0x118
+#define LPC_STR4 0x11C
 
 struct aspeed_kcs_bmc {
struct regmap *map;
-- 
2.17.1



[PATCH 1/4] ARM: dts: Remove LPC BMC and Host partitions

2020-09-10 Thread Chia-Wei, Wang
The LPC controller has no concept of the BMC and the Host partitions.

A concrete instance is that the HICRB[5:4] are for the I/O port address
configurtaion of KCS channel 1/2. However, the KCS driver cannot access
HICRB for channel 1/2 initialization via syscon regmap interface due to
the parition boundary. (i.e. offset 80h)

In addition, to be backward compatible, the newly added HW control bits
could be located at any reserved bits over the LPC addressing space.

Thereby, this patch removes the lpc-bmc and lpc-host child node and thus
the LPC partitioning for better driver development and maintenance.

Signed-off-by: Chia-Wei, Wang 
---
 arch/arm/boot/dts/aspeed-g4.dtsi |  74 +++--
 arch/arm/boot/dts/aspeed-g5.dtsi | 135 ++-
 arch/arm/boot/dts/aspeed-g6.dtsi | 135 ++-
 3 files changed, 148 insertions(+), 196 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-g4.dtsi b/arch/arm/boot/dts/aspeed-g4.dtsi
index 82f0213e3a3c..22996b3c4a00 100644
--- a/arch/arm/boot/dts/aspeed-g4.dtsi
+++ b/arch/arm/boot/dts/aspeed-g4.dtsi
@@ -338,58 +338,44 @@
};
 
lpc: lpc@1e789000 {
-   compatible = "aspeed,ast2400-lpc", "simple-mfd";
+   compatible = "aspeed,ast2400-lpc", 
"simple-mfd", "syscon";
reg = <0x1e789000 0x1000>;
+   reg-io-width = <4>;
 
#address-cells = <1>;
#size-cells = <1>;
ranges = <0x0 0x1e789000 0x1000>;
 
-   lpc_bmc: lpc-bmc@0 {
-   compatible = "aspeed,ast2400-lpc-bmc";
-   reg = <0x0 0x80>;
+   lpc_ctrl: lpc-ctrl@80 {
+   compatible = "aspeed,ast2400-lpc-ctrl";
+   reg = <0x80 0x10>;
+   clocks = < ASPEED_CLK_GATE_LCLK>;
+   status = "disabled";
};
 
-   lpc_host: lpc-host@80 {
-   compatible = "aspeed,ast2400-lpc-host", 
"simple-mfd", "syscon";
-   reg = <0x80 0x1e0>;
-   reg-io-width = <4>;
-
-   #address-cells = <1>;
-   #size-cells = <1>;
-   ranges = <0x0 0x80 0x1e0>;
-
-   lpc_ctrl: lpc-ctrl@0 {
-   compatible = 
"aspeed,ast2400-lpc-ctrl";
-   reg = <0x0 0x10>;
-   clocks = < 
ASPEED_CLK_GATE_LCLK>;
-   status = "disabled";
-   };
-
-   lpc_snoop: lpc-snoop@10 {
-   compatible = 
"aspeed,ast2400-lpc-snoop";
-   reg = <0x10 0x8>;
-   interrupts = <8>;
-   status = "disabled";
-   };
-
-   lhc: lhc@20 {
-   compatible = 
"aspeed,ast2400-lhc";
-   reg = <0x20 0x24 0x48 0x8>;
-   };
-
-   lpc_reset: reset-controller@18 {
-   compatible = 
"aspeed,ast2400-lpc-reset";
-   reg = <0x18 0x4>;
-   #reset-cells = <1>;
-   };
-
-   ibt: ibt@c0  {
-   compatible = 
"aspeed,ast2400-ibt-bmc";
-   reg = <0xc0 0x18>;
-   interrupts = <8>;
-   status = "disabled";
-   };
+   lpc_snoop: lpc-snoop@90 {
+   compatible = "aspeed,ast2400-lpc-snoop";

Re: [PATCH] virtio_balloon: fix up endian-ness for free cmd id

2020-07-28 Thread Wei Wang

On 07/28/2020 12:03 AM, Michael S. Tsirkin wrote:

free cmd id is read using virtio endian, spec says all fields
in balloon are LE. Fix it up.

Fixes: 86a559787e6f ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
Cc: sta...@vger.kernel.org
Signed-off-by: Michael S. Tsirkin 
---
  drivers/virtio/virtio_balloon.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 774deb65a9bb..798ec304fe3e 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -578,10 +578,14 @@ static int init_vqs(struct virtio_balloon *vb)
  static u32 virtio_balloon_cmd_id_received(struct virtio_balloon *vb)
  {
if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
-  >config_read_bitmap))
+  >config_read_bitmap)) {
virtio_cread(vb->vdev, struct virtio_balloon_config,
 free_page_hint_cmd_id,
 >cmd_id_received_cache);
+   /* Legacy balloon config space is LE, unlike all other devices. 
*/
+   if (!virtio_has_feature(vb->vdev, VIRTIO_F_VERSION_1))
+   vb->cmd_id_received_cache = le32_to_cpu((__force 
__le32)vb->cmd_id_received_cache);

Seems it exceeds 80 character length.

Reviewed-by: Wei Wang 

Best,
Wei


[tip: perf/core] perf/x86: Fix variable types for LBR registers

2020-07-03 Thread tip-bot2 for Wei Wang
The following commit has been merged into the perf/core branch of tip:

Commit-ID: 3cb9d5464c1ceea86f6225089b2f7965989cf316
Gitweb:
https://git.kernel.org/tip/3cb9d5464c1ceea86f6225089b2f7965989cf316
Author:Wei Wang 
AuthorDate:Sat, 13 Jun 2020 16:09:46 +08:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 02 Jul 2020 15:51:45 +02:00

perf/x86: Fix variable types for LBR registers

The MSR variable type can be 'unsigned int', which uses less memory than
the longer 'unsigned long'. Fix 'struct x86_pmu' for that. The lbr_nr won't
be a negative number, so make it 'unsigned int' as well.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Wei Wang 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200613080958.132489-2-like...@linux.intel.com
---
 arch/x86/events/perf_event.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e17a3d8..eb37f6c 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -673,8 +673,8 @@ struct x86_pmu {
/*
 * Intel LBR
 */
-   unsigned long   lbr_tos, lbr_from, lbr_to; /* MSR base regs   */
-   int lbr_nr;/* hardware stack size */
+   unsigned intlbr_tos, lbr_from, lbr_to,
+   lbr_nr;/* LBR base regs and size */
u64 lbr_sel_mask;  /* LBR_SELECT valid bits */
const int   *lbr_sel_map;  /* lbr_select mappings */
boollbr_double_abort;  /* duplicated lbr aborts */


Re: [PATCH] cpufreq: schedutil: force frequency update when limits change

2020-06-25 Thread Wei Wang
On Thu, Jun 25, 2020 at 7:32 PM Viresh Kumar  wrote:
>
> On 26-06-20, 07:44, Viresh Kumar wrote:
> > On 25-06-20, 13:47, Wei Wang wrote:
> > > On Thu, Jun 25, 2020 at 3:23 AM Viresh Kumar  
> > > wrote:
> > > > I am sorry but I am not fully sure of what the problem is. Can you
> > > > describe that by giving an example with some random frequency, and
> > > > tell the expected and actual behavior ?
> > > >
> > > The problem is sugov thought next_freq already updated (but actually
> > > skipped by the rate limit thing) and all following updates will be
> > > skipped.
>
> The rate-limiting thing is specific to android and not present in
> mainline. Even in android I see next_freq getting updated only after
> rate-limiting is verified.
>
> I think you maybe trying to fix an android only problem in mainline,
> which may not be required at all. And I am not sure if Android has a
> problem as well :)
>
Yes, that is Android specific, I added you to the Gerrit already.

Thanks!
-Wei

> > I am sorry, can you please give a detailed example with existing
> > frequency and limits, then the limits changed to new values, then what
> > exactly happens ?
> >
> > > Actually this is specifically for Android common kernel 4.19's issue
> > > which has sugov_up_down_rate_limit in sugov_update_next_freq, let's
> > > continue discussion there.
> >
> > If it is a mainline problem, we will surely get it fixed here. Just
> > that I am not able to understand the problem yet. Sorry about that.
>
> --
> viresh


Re: [PATCH] cpufreq: schedutil: force frequency update when limits change

2020-06-25 Thread Wei Wang
On Thu, Jun 25, 2020 at 3:23 AM Viresh Kumar  wrote:
>
> On 24-06-20, 23:46, Wei Wang wrote:
> > To avoid reducing the frequency of a CPU prematurely, we skip reducing
> > the frequency if the CPU had been busy recently.
> >
> > This should not be done when the limits of the policy are changed, for
> > example due to thermal throttling. We should always get the frequency
> > within the new limits as soon as possible.
> >
> > There was a fix in
> > commit 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when
> > limits change") upstream which introduced another flag. However, the
> > fix didn't address the case when next_freq is the same as previously
> > voted, which is then checked in sugov_update_next_freq. As a result, the
> > frequency would be stuck at low until the high demanding workload quits.
> >
> > test trace:
> >   kworker/u19:0-1872  ( 1872) [002]    347.878871: 
> > cpu_frequency_limits: min=60 max=2348000 cpu_id=6
> >  dhry64-11525 (11525) [007] d.h2   347.880012: 
> > sugov_should_update_freq: thermal limit on policy6
> >  dhry64-11525 (11525) [007] d.h2   347.880012: 
> > sugov_deferred_update: policy6 skipped update
> >  dhry64-11525 (11525) [007] d.h2   347.884040: 
> > sugov_deferred_update: policy6 skipped update
>
> I am not sure these are helpful in the logs as the code which
> generated them isn't there in the kernel.
>
Yes, those traceprintk were added to those particular functions to help debug.


> > ...
> >
> > This patch fixes this by skipping the check and forcing an update in
> > this case. The second flag was kept as the limits_change flag could be
> > updated in thermal kworker from another CPU.
>
> I am sorry but I am not fully sure of what the problem is. Can you
> describe that by giving an example with some random frequency, and
> tell the expected and actual behavior ?
>
The problem is sugov thought next_freq already updated (but actually
skipped by the rate limit thing) and all following updates will be
skipped.
Actually this is specifically for Android common kernel 4.19's issue
which has sugov_up_down_rate_limit in sugov_update_next_freq, let's
continue discussion there.

Thanks!
-Wei
> > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> > Signed-off-by: Wei Wang 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index 7fbaee24c824..dc2cd768022e 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -102,11 +102,12 @@ static bool sugov_should_update_freq(struct 
> > sugov_policy *sg_policy, u64 time)
> >  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 
> > time,
> >  unsigned int next_freq)
> >  {
> > - if (sg_policy->next_freq == next_freq)
> > + if (!sg_policy->need_freq_update && sg_policy->next_freq == next_freq)
>
> AFAIU, if the next freq is same as currently programmed one, there is
> no need to force update it.
>
> >   return false;
> >
> >   sg_policy->next_freq = next_freq;
> >   sg_policy->last_freq_update_time = time;
> > + sg_policy->need_freq_update = false;
> >
> >   return true;
> >  }
> > @@ -178,7 +179,6 @@ static unsigned int get_next_freq(struct sugov_policy 
> > *sg_policy,
> >   if (freq == sg_policy->cached_raw_freq && 
> > !sg_policy->need_freq_update)
> >   return sg_policy->next_freq;
> >
> > - sg_policy->need_freq_update = false;
> >   sg_policy->cached_raw_freq = freq;
> >   return cpufreq_driver_resolve_freq(policy, freq);
> >  }
> > --
> > 2.27.0.212.ge8ba1cc988-goog
>
> --
> viresh


[PATCH] cpufreq: schedutil: force frequency update when limits change

2020-06-25 Thread Wei Wang
To avoid reducing the frequency of a CPU prematurely, we skip reducing
the frequency if the CPU had been busy recently.

This should not be done when the limits of the policy are changed, for
example due to thermal throttling. We should always get the frequency
within the new limits as soon as possible.

There was a fix in
commit 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when
limits change") upstream which introduced another flag. However, the
fix didn't address the case when next_freq is the same as previously
voted, which is then checked in sugov_update_next_freq. As a result, the
frequency would be stuck at low until the high demanding workload quits.

test trace:
  kworker/u19:0-1872  ( 1872) [002]    347.878871: cpu_frequency_limits: 
min=60 max=2348000 cpu_id=6
 dhry64-11525 (11525) [007] d.h2   347.880012: 
sugov_should_update_freq: thermal limit on policy6
 dhry64-11525 (11525) [007] d.h2   347.880012: sugov_deferred_update: 
policy6 skipped update
 dhry64-11525 (11525) [007] d.h2   347.884040: sugov_deferred_update: 
policy6 skipped update
...

This patch fixes this by skipping the check and forcing an update in
this case. The second flag was kept as the limits_change flag could be
updated in thermal kworker from another CPU.

Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
Signed-off-by: Wei Wang 
---
 kernel/sched/cpufreq_schedutil.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 7fbaee24c824..dc2cd768022e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -102,11 +102,12 @@ static bool sugov_should_update_freq(struct sugov_policy 
*sg_policy, u64 time)
 static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
   unsigned int next_freq)
 {
-   if (sg_policy->next_freq == next_freq)
+   if (!sg_policy->need_freq_update && sg_policy->next_freq == next_freq)
return false;
 
sg_policy->next_freq = next_freq;
sg_policy->last_freq_update_time = time;
+   sg_policy->need_freq_update = false;
 
return true;
 }
@@ -178,7 +179,6 @@ static unsigned int get_next_freq(struct sugov_policy 
*sg_policy,
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
 
-   sg_policy->need_freq_update = false;
sg_policy->cached_raw_freq = freq;
return cpufreq_driver_resolve_freq(policy, freq);
 }
-- 
2.27.0.212.ge8ba1cc988-goog



Re: [PATCH] thermal: create softlink by name for thermal_zone and cooling_device

2019-10-16 Thread Wei Wang
On Wed, Oct 16, 2019 at 10:16 AM Amit Kucheria
 wrote:
>
> On Wed, Oct 16, 2019 at 10:20 PM Amit Kucheria
>  wrote:
> >
> > On Tue, Oct 15, 2019 at 11:43 AM Wei Wang  wrote:
> > >
> > > The paths thermal_zone%d and cooling_device%d are not intuitive and the
> > > numbers are subject to change due to device tree change. This usually
> > > leads to tree traversal in userspace code.
> > > The patch creates `tz-by-name' and `cdev-by-name' for thermal zone and
> > > cooling_device respectively.
> >
> > I like this.
> >
> > > Signed-off-by: Wei Wang 
> > > ---
> > >  drivers/thermal/thermal_core.c | 23 +--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/thermal/thermal_core.c 
> > > b/drivers/thermal/thermal_core.c
> > > index d4481cc8958f..0ff8fb1d7b0a 100644
> > > --- a/drivers/thermal/thermal_core.c
> > > +++ b/drivers/thermal/thermal_core.c
> > > @@ -22,6 +22,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #define CREATE_TRACE_POINTS
> > >  #include 
> > > @@ -46,6 +47,8 @@ static DEFINE_MUTEX(poweroff_lock);
> > >
> > >  static atomic_t in_suspend;
> > >  static bool power_off_triggered;
> > > +static struct kobject *cdev_link_kobj;
> > > +static struct kobject *tz_link_kobj;
> > >
> > >  static struct thermal_governor *def_governor;
> > >
> > > @@ -954,7 +957,7 @@ __thermal_cooling_device_register(struct device_node 
> > > *np,
> > > struct thermal_zone_device *pos = NULL;
> > > int result;
> > >
> > > -   if (type && strlen(type) >= THERMAL_NAME_LENGTH)
> > > +   if (!type || !type[0] || strlen(type) >= THERMAL_NAME_LENGTH)
> > > return ERR_PTR(-EINVAL);
> >
> > This should be a separate fix, if needed.
Agree, but the link now requires that "" as invalid _type_.


> >
> > > if (!ops || !ops->get_max_state || !ops->get_cur_state ||
> > > @@ -989,9 +992,15 @@ __thermal_cooling_device_register(struct device_node 
> > > *np,
> > > return ERR_PTR(result);
> > > }
> > >
> > > -   /* Add 'this' new cdev to the global cdev list */
> > > +   /* Add 'this' new cdev to the global cdev list and create link*/
> > > mutex_lock(_list_lock);
> > > list_add(>node, _cdev_list);
> > > +   if (!cdev_link_kobj)
> > > +   cdev_link_kobj = kobject_create_and_add("cdev-by-name",
> > > +   cdev->device.kobj.parent);
> > > +   if (!cdev_link_kobj || sysfs_create_link(cdev_link_kobj,
> > > +   >device.kobj, 
> > > cdev->type))
> > > +   dev_err(>device, "Failed to create cdev-by-name 
> > > link\n");
> >
> > Any reason not to use the following form instead? It seems easier to read.
> >
> > if (!cdev_link_kobj) {
> >cdev_link_kobj = kobject_create_and_add("cdev-by-name",
> >cdev->device.kobj.parent);
> >   ret = sysfs_create_link(cdev_link_kobj,
> >   >device.kobj, 
> > cdev->type))
> >   if (ret)
> >dev_err(>device, "Failed to create
> > cdev-by-name link\n");
> > }
>
> I can now see why you had to do that - none of the other links would
> get created after the first one.
>
> Perhaps create the directories in the __init functions and only create
> the links here?
>
AFAICT, this is no such API except the private get_device_parent()
under driver/base/. Also the lazy initialization makes sense in such
case when there is no thermal device attached. Looks like the class
dir is also lazy-initialized when first device registered
https://elixir.bootlin.com/linux/v5.3.5/source/drivers/base/core.c#L1790.

>
> > > mutex_unlock(_list_lock);
> > >
> > > /* Update binding information for 'this' new cdev */
> > > @@ -1157,6 +1166,8 @@ void thermal_cooling_device_unregister(struct 
> > > thermal_cooling_device *cdev)
> > > }
> > > }
> > > }
> > > +   if (cdev_link_k

[PATCH] thermal: create softlink by name for thermal_zone and cooling_device

2019-10-15 Thread Wei Wang
The paths thermal_zone%d and cooling_device%d are not intuitive and the
numbers are subject to change due to device tree change. This usually
leads to tree traversal in userspace code.
The patch creates `tz-by-name' and `cdev-by-name' for thermal zone and
cooling_device respectively.

Signed-off-by: Wei Wang 
---
 drivers/thermal/thermal_core.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index d4481cc8958f..0ff8fb1d7b0a 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -46,6 +47,8 @@ static DEFINE_MUTEX(poweroff_lock);
 
 static atomic_t in_suspend;
 static bool power_off_triggered;
+static struct kobject *cdev_link_kobj;
+static struct kobject *tz_link_kobj;
 
 static struct thermal_governor *def_governor;
 
@@ -954,7 +957,7 @@ __thermal_cooling_device_register(struct device_node *np,
struct thermal_zone_device *pos = NULL;
int result;
 
-   if (type && strlen(type) >= THERMAL_NAME_LENGTH)
+   if (!type || !type[0] || strlen(type) >= THERMAL_NAME_LENGTH)
return ERR_PTR(-EINVAL);
 
if (!ops || !ops->get_max_state || !ops->get_cur_state ||
@@ -989,9 +992,15 @@ __thermal_cooling_device_register(struct device_node *np,
return ERR_PTR(result);
}
 
-   /* Add 'this' new cdev to the global cdev list */
+   /* Add 'this' new cdev to the global cdev list and create link*/
mutex_lock(_list_lock);
list_add(>node, _cdev_list);
+   if (!cdev_link_kobj)
+   cdev_link_kobj = kobject_create_and_add("cdev-by-name",
+   cdev->device.kobj.parent);
+   if (!cdev_link_kobj || sysfs_create_link(cdev_link_kobj,
+   >device.kobj, cdev->type))
+   dev_err(>device, "Failed to create cdev-by-name link\n");
mutex_unlock(_list_lock);
 
/* Update binding information for 'this' new cdev */
@@ -1157,6 +1166,8 @@ void thermal_cooling_device_unregister(struct 
thermal_cooling_device *cdev)
}
}
}
+   if (cdev_link_kobj)
+   sysfs_remove_link(cdev_link_kobj, cdev->type);
 
mutex_unlock(_list_lock);
 
@@ -1340,6 +1351,12 @@ thermal_zone_device_register(const char *type, int 
trips, int mask,
 
mutex_lock(_list_lock);
list_add_tail(>node, _tz_list);
+   if (!tz_link_kobj)
+   tz_link_kobj = kobject_create_and_add("tz-by-name",
+   tz->device.kobj.parent);
+   if (!tz_link_kobj || sysfs_create_link(tz_link_kobj,
+   >device.kobj, tz->type))
+   dev_err(>device, "Failed to create tz-by-name link\n");
mutex_unlock(_list_lock);
 
/* Bind cooling devices for this zone */
@@ -1411,6 +1428,8 @@ void thermal_zone_device_unregister(struct 
thermal_zone_device *tz)
}
}
}
+   if (tz_link_kobj)
+   sysfs_remove_link(tz_link_kobj, tz->type);
 
mutex_unlock(_list_lock);
 
-- 
2.23.0.700.g56cf767bdb-goog



[PATCH 2/2] dt-bindings: peci: aspeed: Add AST2600 compatible

2019-10-02 Thread Chia-Wei, Wang
Document the AST2600 PECI controller compatible string.

Signed-off-by: Chia-Wei, Wang 
---
 Documentation/devicetree/bindings/peci/peci-aspeed.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/peci/peci-aspeed.txt 
b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
index cdca73a3b7d8..cddd2d2aa58f 100644
--- a/Documentation/devicetree/bindings/peci/peci-aspeed.txt
+++ b/Documentation/devicetree/bindings/peci/peci-aspeed.txt
@@ -4,6 +4,7 @@ Required properties:
 - compatible: Should be one of:
"aspeed,ast2400-peci"
"aspeed,ast2500-peci"
+   "aspeed,ast2600-peci"
 - reg   : Should contain PECI controller registers location and
  length.
 - #address-cells: Should be <1> required to define a client address.
-- 
2.17.1



[PATCH 0/2] peci: aspeed: Add AST2600 compatible

2019-10-02 Thread Chia-Wei, Wang
Update the Aspeed PECI driver with the AST2600 compatible string.
A new comptabile string is needed for the extended HW feature of
AST2600.

Chia-Wei, Wang (2):
  peci: aspeed: Add AST2600 compatible string
  dt-bindings: peci: aspeed: Add AST2600 compatible

 Documentation/devicetree/bindings/peci/peci-aspeed.txt | 1 +
 drivers/peci/peci-aspeed.c | 1 +
 2 files changed, 2 insertions(+)

-- 
2.17.1



[PATCH 1/2] peci: aspeed: Add AST2600 compatible string

2019-10-02 Thread Chia-Wei, Wang
The AST2600 SoC contains the same register set as AST25xx.

Signed-off-by: Chia-Wei, Wang 
---
 drivers/peci/peci-aspeed.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/peci/peci-aspeed.c b/drivers/peci/peci-aspeed.c
index 51cb2563ceb6..4eed119dc83b 100644
--- a/drivers/peci/peci-aspeed.c
+++ b/drivers/peci/peci-aspeed.c
@@ -485,6 +485,7 @@ static int aspeed_peci_remove(struct platform_device *pdev)
 static const struct of_device_id aspeed_peci_of_table[] = {
{ .compatible = "aspeed,ast2400-peci", },
{ .compatible = "aspeed,ast2500-peci", },
+   { .compatible = "aspeed,ast2600-peci", },
{ }
 };
 MODULE_DEVICE_TABLE(of, aspeed_peci_of_table);
-- 
2.17.1



Re: [PATCH v2] ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule

2019-09-24 Thread Wei Wang
On Tue, Sep 24, 2019 at 7:01 AM Jason A. Donenfeld  wrote:
>
> Commit 7d9e5f422150 removed references from certain dsts, but accounting
> for this never translated down into the fib6 suppression code. This bug
> was triggered by WireGuard users who use wg-quick(8), which uses the
> "suppress-prefix" directive to ip-rule(8) for routing all of their
> internet traffic without routing loops. The test case added here
> causes the reference underflow by causing packets to evaluate a suppress
> rule.
>
> Cc: sta...@vger.kernel.org
> Fixes: 7d9e5f422150 ("ipv6: convert major tx path to use 
> RT6_LOOKUP_F_DST_NOREF")
> Signed-off-by: Jason A. Donenfeld 
> ---
Acked-by: Wei Wang 

Good catch. Thanks for the fix.

>  net/ipv6/fib6_rules.c|  3 ++-
>  tools/testing/selftests/net/fib_tests.sh | 17 -
>  2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
> index d22b6c140f23..f9e8fe3ff0c5 100644
> --- a/net/ipv6/fib6_rules.c
> +++ b/net/ipv6/fib6_rules.c
> @@ -287,7 +287,8 @@ static bool fib6_rule_suppress(struct fib_rule *rule, 
> struct fib_lookup_arg *arg
> return false;
>
>  suppress_route:
> -   ip6_rt_put(rt);
> +   if (!(arg->flags & FIB_LOOKUP_NOREF))
> +   ip6_rt_put(rt);
> return true;
>  }
>
> diff --git a/tools/testing/selftests/net/fib_tests.sh 
> b/tools/testing/selftests/net/fib_tests.sh
> index 4465fc2dae14..c2c5f2bf0f95 100755
> --- a/tools/testing/selftests/net/fib_tests.sh
> +++ b/tools/testing/selftests/net/fib_tests.sh
> @@ -9,7 +9,7 @@ ret=0
>  ksft_skip=4
>
>  # all tests in this script. Can be overridden with -t option
> -TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt ipv6_addr_metric 
> ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics ipv4_route_v6_gw 
> rp_filter"
> +TESTS="unregister down carrier nexthop suppress ipv6_rt ipv4_rt 
> ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics 
> ipv4_route_v6_gw rp_filter"
>
>  VERBOSE=0
>  PAUSE_ON_FAIL=no
> @@ -614,6 +614,20 @@ fib_nexthop_test()
> cleanup
>  }
>
> +fib_suppress_test()
> +{
> +   $IP link add dummy1 type dummy
> +   $IP link set dummy1 up
> +   $IP -6 route add default dev dummy1
> +   $IP -6 rule add table main suppress_prefixlength 0
> +   ping -f -c 1000 -W 1 1234::1 || true
> +   $IP -6 rule del table main suppress_prefixlength 0
> +   $IP link del dummy1
> +
> +   # If we got here without crashing, we're good.
> +   return 0
> +}
> +
>  
> 
>  # Tests on route add and replace
>
> @@ -1591,6 +1605,7 @@ do
> fib_carrier_test|carrier)   fib_carrier_test;;
> fib_rp_filter_test|rp_filter)   fib_rp_filter_test;;
> fib_nexthop_test|nexthop)   fib_nexthop_test;;
> +   fib_suppress_test|suppress) fib_suppress_test;;
> ipv6_route_test|ipv6_rt)ipv6_route_test;;
> ipv4_route_test|ipv4_rt)ipv4_route_test;;
> ipv6_addr_metric)   ipv6_addr_metric_test;;
> --
> 2.21.0
>


Re: [PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest

2019-08-06 Thread Wei Wang

On 08/06/2019 03:16 PM, Wei Wang wrote:

The guest can access the lbr related msrs only when the vcpu's lbr event
has been assigned the lbr feature. A cpu pinned lbr event (though no such
event usages in the current upstream kernel) could reclaim the lbr feature
from the vcpu's lbr event (task pinned) via ipi calls. If the cpu is
running in the non-root mode, this will cause the cpu to vm-exit to handle
the host ipi and then vm-entry back to the guest. So on vm-entry (where
interrupt has been disabled), we double confirm that the vcpu's lbr event
is still assigned the lbr feature via checking event->oncpu.

The pass-through of the lbr related msrs will be cancelled if the lbr is
reclaimed, and the following guest accesses to the lbr related msrs will
vm-exit to the related msr emulation handler in kvm, which will prevent
the accesses.

Signed-off-by: Wei Wang 
---
  arch/x86/kvm/pmu.c   |  6 ++
  arch/x86/kvm/pmu.h   |  3 +++
  arch/x86/kvm/vmx/pmu_intel.c | 35 +++
  arch/x86/kvm/x86.c   | 13 +
  4 files changed, 57 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index afad092..ed10a57 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -339,6 +339,12 @@ bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
return false;
  }
  
+void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu)

+{
+   if (kvm_x86_ops->pmu_ops->enabled_feature_confirm)
+   kvm_x86_ops->pmu_ops->enabled_feature_confirm(vcpu);
+}
+
  void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
  {
if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f875721..7467907 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -30,6 +30,7 @@ struct kvm_pmu_ops {
int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
bool (*lbr_enable)(struct kvm_vcpu *vcpu);
+   void (*enabled_feature_confirm)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
@@ -126,6 +127,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void 
__user *argp);
  
  bool is_vmware_backdoor_pmc(u32 pmc_idx);
  
+void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu);

+
  extern struct kvm_pmu_ops intel_pmu_ops;
  extern struct kvm_pmu_ops amd_pmu_ops;
  #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 5580f1a..421051aa 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -781,6 +781,40 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
intel_pmu_free_lbr_event(vcpu);
  }
  
+void intel_pmu_lbr_confirm(struct kvm_vcpu *vcpu)

+{
+   struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+   /*
+* Either lbr_event being NULL or lbr_used being false indicates that
+* the lbr msrs haven't been passed through to the guest, so no need
+* to cancel passthrough.
+*/
+   if (!pmu->lbr_event || !pmu->lbr_used)
+   return;
+
+   /*
+* The lbr feature gets reclaimed via IPI calls, so checking of
+* lbr_event->oncpu needs to be in an atomic context. Just confirm
+* that irq has been disabled already.
+*/
+   lockdep_assert_irqs_disabled();
+
+   /*
+* Cancel the pass-through of the lbr msrs if lbr has been reclaimed
+* by the host perf.
+*/
+   if (pmu->lbr_event->oncpu != -1) {


A mistake here,  should be "pmu->lbr_event->oncpu == -1".
(It didn't seem to affect the profiling result, but generated
more vm-exits due to mistakenly cancelling the passthrough)

Best,
Wei


[PATCH v8 07/14] perf/x86: support to create a perf event without counter allocation

2019-08-06 Thread Wei Wang
Hypervisors may create an lbr event for a vcpu's lbr emulation, and the
emulation doesn't need a counter fundamentally. This makes the emulation
follow the x86 SDM's description about lbr, which doesn't include a
counter, and also avoids wasting a counter.

The perf scheduler is supported to not assign a counter for a perf event
which doesn't need a counter. Define a macro, X86_PMC_IDX_NA, to replace
"-1", which represents a never assigned counter id.

Cc: Andi Kleen 
Cc: Peter Zijlstra 
Signed-off-by: Wei Wang 
https://lkml.kernel.org/r/20180920162407.ga24...@hirez.programming.kicks-ass.net
---
 arch/x86/events/core.c| 36 +++-
 arch/x86/events/intel/core.c  |  3 +++
 arch/x86/include/asm/perf_event.h |  1 +
 include/linux/perf_event.h| 11 +++
 4 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 81b005e..ffa27bb 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -73,7 +73,7 @@ u64 x86_perf_event_update(struct perf_event *event)
int idx = hwc->idx;
u64 delta;
 
-   if (idx == INTEL_PMC_IDX_FIXED_BTS)
+   if ((idx == INTEL_PMC_IDX_FIXED_BTS) || (idx == X86_PMC_IDX_NA))
return 0;
 
/*
@@ -595,7 +595,7 @@ static int __x86_pmu_event_init(struct perf_event *event)
atomic_inc(_events);
event->destroy = hw_perf_event_destroy;
 
-   event->hw.idx = -1;
+   event->hw.idx = X86_PMC_IDX_NA;
event->hw.last_cpu = -1;
event->hw.last_tag = ~0ULL;
 
@@ -763,6 +763,8 @@ static bool perf_sched_restore_state(struct perf_sched 
*sched)
 static bool __perf_sched_find_counter(struct perf_sched *sched)
 {
struct event_constraint *c;
+   struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
+   struct perf_event *e = cpuc->event_list[sched->state.event];
int idx;
 
if (!sched->state.unassigned)
@@ -772,6 +774,14 @@ static bool __perf_sched_find_counter(struct perf_sched 
*sched)
return false;
 
c = sched->constraints[sched->state.event];
+   if (c == )
+   return false;
+
+   if (is_no_counter_event(e)) {
+   idx = X86_PMC_IDX_NA;
+   goto done;
+   }
+
/* Prefer fixed purpose counters */
if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) {
idx = INTEL_PMC_IDX_FIXED;
@@ -797,7 +807,7 @@ static bool __perf_sched_find_counter(struct perf_sched 
*sched)
 done:
sched->state.counter = idx;
 
-   if (c->overlap)
+   if ((idx != X86_PMC_IDX_NA) && c->overlap)
perf_sched_save_state(sched);
 
return true;
@@ -918,7 +928,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, 
int *assign)
c = cpuc->event_constraint[i];
 
/* never assigned */
-   if (hwc->idx == -1)
+   if (hwc->idx == X86_PMC_IDX_NA)
break;
 
/* constraint still honored */
@@ -969,7 +979,8 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, 
int *assign)
if (!unsched && assign) {
for (i = 0; i < n; i++) {
e = cpuc->event_list[i];
-   if (x86_pmu.commit_scheduling)
+   if (x86_pmu.commit_scheduling &&
+   (assign[i] != X86_PMC_IDX_NA))
x86_pmu.commit_scheduling(cpuc, i, assign[i]);
}
} else {
@@ -1038,7 +1049,8 @@ static inline void x86_assign_hw_event(struct perf_event 
*event,
hwc->last_cpu = smp_processor_id();
hwc->last_tag = ++cpuc->tags[i];
 
-   if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
+   if ((hwc->idx == INTEL_PMC_IDX_FIXED_BTS) ||
+   (hwc->idx == X86_PMC_IDX_NA)) {
hwc->config_base = 0;
hwc->event_base = 0;
} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
@@ -1115,7 +1127,7 @@ static void x86_pmu_enable(struct pmu *pmu)
 * - running on same CPU as last time
 * - no other event has used the counter since
 */
-   if (hwc->idx == -1 ||
+   if (hwc->idx == X86_PMC_IDX_NA ||
match_prev_assignment(hwc, cpuc, i))
continue;
 
@@ -1169,7 +1181,7 @@ int x86_perf_event_set_period(struct perf_event *event)
s64 period = hwc->sample_period;
int ret = 0, idx = hwc->idx;
 
-   if (idx == INTEL_PMC_IDX_FIXED_BTS)
+   if ((idx == INTEL_PMC_IDX_FIXED_BTS) || (idx == X86_PMC_IDX_NA))
return 0;
 
/*
@@ -1306,7 +1318,7 @@ static void x86_pmu_start(struct perf_event

[PATCH v8 08/14] perf/core: set the event->owner before event_init

2019-08-06 Thread Wei Wang
Kernel and user events can be distinguished by checking event->owner. Some
pmu driver implementation may need to know event->owner in event_init. For
example, intel_pmu_setup_lbr_filter treats a kernel event with
exclude_host set as an lbr event created for guest lbr emulation, which
doesn't need a pmu counter.

So move the event->owner assignment into perf_event_alloc to have it
set before event_init is called.

Signed-off-by: Wei Wang 
---
 kernel/events/core.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0463c11..7663f85 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10288,6 +10288,7 @@ static void account_event(struct perf_event *event)
 static struct perf_event *
 perf_event_alloc(struct perf_event_attr *attr, int cpu,
 struct task_struct *task,
+struct task_struct *owner,
 struct perf_event *group_leader,
 struct perf_event *parent_event,
 perf_overflow_handler_t overflow_handler,
@@ -10340,6 +10341,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
event->group_leader = group_leader;
event->pmu  = NULL;
event->oncpu= -1;
+   event->owner= owner;
 
event->parent   = parent_event;
 
@@ -10891,7 +10893,7 @@ SYSCALL_DEFINE5(perf_event_open,
if (flags & PERF_FLAG_PID_CGROUP)
cgroup_fd = pid;
 
-   event = perf_event_alloc(, cpu, task, group_leader, NULL,
+   event = perf_event_alloc(, cpu, task, current, group_leader, NULL,
 NULL, NULL, cgroup_fd);
if (IS_ERR(event)) {
err = PTR_ERR(event);
@@ -11153,8 +11155,6 @@ SYSCALL_DEFINE5(perf_event_open,
perf_event__header_size(event);
perf_event__id_header_size(event);
 
-   event->owner = current;
-
perf_install_in_context(ctx, event, event->cpu);
perf_unpin_context(ctx);
 
@@ -11231,16 +11231,13 @@ perf_event_create_kernel_counter(struct 
perf_event_attr *attr, int cpu,
 * Get the target context (task or percpu):
 */
 
-   event = perf_event_alloc(attr, cpu, task, NULL, NULL,
+   event = perf_event_alloc(attr, cpu, task, TASK_TOMBSTONE, NULL, NULL,
 overflow_handler, context, -1);
if (IS_ERR(event)) {
err = PTR_ERR(event);
goto err;
}
 
-   /* Mark owner so we could distinguish it from user events. */
-   event->owner = TASK_TOMBSTONE;
-
ctx = find_get_context(event->pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
@@ -11677,6 +11674,7 @@ inherit_event(struct perf_event *parent_event,
 
child_event = perf_event_alloc(_event->attr,
   parent_event->cpu,
+  parent_event->owner,
   child,
   group_leader, parent_event,
   NULL, NULL, -1);
-- 
2.7.4



[PATCH v8 10/14] perf/x86/lbr: don't share lbr for the vcpu usage case

2019-08-06 Thread Wei Wang
Perf event scheduling lets multiple lbr events share the lbr if they
use the same config for LBR_SELECT. For the vcpu case, the vcpu's lbr
event created on the host deliberately sets the config to the user
callstack mode to have the host support to save/restore the lbr state
on vcpu context switching, and the config won't be written to the
LBR_SELECT, as the LBR_SELECT is configured by the guest, which might
not be the same as the user callstack mode. So don't allow the vcpu's
lbr event to share lbr with other host lbr events.

Signed-off-by: Wei Wang 
---
 arch/x86/events/intel/lbr.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 4f4bd18..a0f3686 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -45,6 +45,12 @@ static const enum {
 #define LBR_CALL_STACK_BIT 9 /* enable call stack */
 
 /*
+ * Set this hardware reserved bit if the lbr perf event is for the vcpu lbr
+ * emulation. This makes the reg->config different from other regular lbr
+ * events' config, so that they will not share the lbr feature.
+ */
+#define LBR_VCPU_BIT   62
+/*
  * Following bit only exists in Linux; we mask it out before writing it to
  * the actual MSR. But it helps the constraint perf code to understand
  * that this is a separate configuration.
@@ -62,6 +68,7 @@ static const enum {
 #define LBR_FAR(1 << LBR_FAR_BIT)
 #define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)
 #define LBR_NO_INFO(1ULL << LBR_NO_INFO_BIT)
+#define LBR_VCPU   (1ULL << LBR_VCPU_BIT)
 
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
@@ -818,6 +825,26 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event 
*event)
(x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO))
reg->config |= LBR_NO_INFO;
 
+   /*
+* An lbr perf event without a counter indicates this is for the vcpu
+* lbr emulation. The vcpu lbr emulation does not allow the lbr
+* feature to be shared with other lbr events on the host, because the
+* LBR_SELECT msr is configured by the guest itself. The reg->config
+* is deliberately configured to be user call stack mode via the
+* related attr fileds to get the host perf's help to save/restore the
+* lbr state on vcpu context switching. It doesn't represent what
+* LBR_SELECT will be configured.
+*
+* Set the reserved LBR_VCPU bit for the vcpu usage case, so that the
+* vcpu's lbr perf event will not share the lbr feature with other perf
+* events. (see __intel_shared_reg_get_constraints, failing to share
+* makes it return the emptyconstraint, which finally makes
+* x86_schedule_events fail to schedule the lower priority lbr event on
+* the lbr feature).
+*/
+   if (is_no_counter_event(event))
+   reg->config |= LBR_VCPU;
+
return 0;
 }
 
-- 
2.7.4



[PATCH v8 09/14] KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread

2019-08-06 Thread Wei Wang
VMX transition is much more frequent than vcpu switching, and
saving/restoring tens of lbr msrs (e.g. 32 lbr stack entries) would add
too much overhead to the frequent vmx transition, which is not
necessary. So the vcpu's lbr state only gets saved/restored on the vcpu
context switching.

The main purposes of using the vcpu's lbr perf event are
- follow the host perf scheduling rules to manage the vcpu's usage of
  lbr (e.g. a cpu pinned lbr event could reclaim lbr and thus stopping
  the vcpu's use);
- have the host perf do context switching of the lbr state on the vcpu
  thread context switching.
Please see the comments in intel_pmu_create_lbr_event for more details.

To achieve the pure lbr emulation, the perf event is created only to
claim for the lbr feature, and no perf counter is needed for it.

The vcpu_lbr field is added to indicate to the host lbr driver that the
lbr is currently assigned to a vcpu to use. The guest driver inside the
vcpu has its own logic to use the lbr, thus the host side lbr driver
doesn't need to enable and use the lbr feature in this case.

Some design choice considerations:
- Why using "is_kernel_event", instead of checking the PF_VCPU flag, to
  determine that it is a vcpu perf event for lbr emulation?
  This is because PF_VCPU is set right before vm-entry into the guest,
  and cleared after the guest vm-exits to the host. So that flag doesn't
  remain set when running the host code.

Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
Co-developed-by: Like Xu 
Signed-off-by: Like Xu 
Signed-off-by: Wei Wang 
---
 arch/x86/events/intel/lbr.c | 38 ++--
 arch/x86/events/perf_event.h|  1 +
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/pmu_intel.c| 64 +
 include/linux/perf_event.h  |  7 +
 kernel/events/core.c|  7 -
 6 files changed, 108 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 9b2d05c..4f4bd18 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -462,6 +462,14 @@ void intel_pmu_lbr_add(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return;
 
+   /*
+* An lbr event without a counter indicates this is for the vcpu lbr
+* emulation, so set the vcpu_lbr flag when the vcpu lbr event
+* gets scheduled on the lbr here.
+*/
+   if (is_no_counter_event(event))
+   cpuc->vcpu_lbr = 1;
+
cpuc->br_sel = event->hw.branch_reg.reg;
 
if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -509,6 +517,14 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}
 
+   /*
+* An lbr event without a counter indicates this is for the vcpu lbr
+* emulation, so clear the vcpu_lbr flag when the vcpu's lbr event
+* gets scheduled out from the lbr.
+*/
+   if (is_no_counter_event(event))
+   cpuc->vcpu_lbr = 0;
+
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
@@ -521,7 +537,12 @@ void intel_pmu_lbr_enable_all(bool pmi)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
 
-   if (cpuc->lbr_users)
+   /*
+* The vcpu lbr emulation doesn't need host to enable lbr at this
+* point, because the guest will set the enabling at a proper time
+* itself.
+*/
+   if (cpuc->lbr_users && !cpuc->vcpu_lbr)
__intel_pmu_lbr_enable(pmi);
 }
 
@@ -529,7 +550,11 @@ void intel_pmu_lbr_disable_all(void)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
 
-   if (cpuc->lbr_users)
+   /*
+* Same as intel_pmu_lbr_enable_all, the guest is responsible for
+* clearing the enabling itself.
+*/
+   if (cpuc->lbr_users && !cpuc->vcpu_lbr)
__intel_pmu_lbr_disable();
 }
 
@@ -668,8 +693,12 @@ void intel_pmu_lbr_read(void)
 *
 * This could be smarter and actually check the event,
 * but this simple approach seems to work for now.
+*
+* And no need to read the lbr msrs here if the vcpu lbr event
+* is using it, as the guest will read them itself.
 */
-   if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+   if (!cpuc->lbr_users || cpuc->vcpu_lbr ||
+   cpuc->lbr_users == cpuc->lbr_pebs_users)
return;
 
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -802,6 +831,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return -EOPNOTSUPP;
 
+   if (event->attr.exclude_host && is_kernel_event(event))

[PATCH v8 14/14] KVM/x86: remove the common handling of the debugctl msr

2019-08-06 Thread Wei Wang
The debugctl msr is not completely identical on AMD and Intel CPUs, for
example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this
msr is handled separatedly in svm.c and intel_pmu.c. So remove the
common debugctl msr handling code in kvm_get/set_msr_common.

Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
Signed-off-by: Wei Wang 
---
 arch/x86/kvm/x86.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index efaf0e8..3839ebd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2528,18 +2528,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 1;
}
break;
-   case MSR_IA32_DEBUGCTLMSR:
-   if (!data) {
-   /* We support the non-activated case already */
-   break;
-   } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
-   /* Values other than LBR and BTF are vendor-specific,
-  thus reserved and should throw a #GP */
-   return 1;
-   }
-   vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
-   __func__, data);
-   break;
case 0x200 ... 0x2ff:
return kvm_mtrr_set_msr(vcpu, msr, data);
case MSR_IA32_APICBASE:
@@ -2800,7 +2788,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
switch (msr_info->index) {
case MSR_IA32_PLATFORM_ID:
case MSR_IA32_EBL_CR_POWERON:
-   case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:
case MSR_IA32_LASTBRANCHTOIP:
case MSR_IA32_LASTINTFROMIP:
-- 
2.7.4



[PATCH v8 11/14] perf/x86: save/restore LBR_SELECT on vcpu switching

2019-08-06 Thread Wei Wang
The regular host lbr perf event doesn't save/restore the LBR_SELECT msr
during a thread context switching, because the LBR_SELECT value is
generated from attr.branch_sample_type and already stored in
event->hw.branch_reg (please see intel_pmu_setup_hw_filter), which
doesn't get lost during thread context switching.

The attr.branch_sample_type for the vcpu lbr event is deliberately set
to the user call stack mode to enable the perf core to save/restore the
lbr related msrs on vcpu switching. So the attr.branch_sample_type
essentially doesn't represent what the guest pmu driver will write to
LBR_SELECT. Meanwhile, the host lbr driver doesn't configure the lbr msrs,
including the LBR_SELECT msr, for the vcpu thread case, as the pmu driver
inside the vcpu will do that.

So for the vcpu case, add the LBR_SELECT save/restore to ensure what the
guest writes to the LBR_SELECT msr doesn't get lost during the vcpu context
switching.

Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Kan Liang 
Signed-off-by: Wei Wang 
---
 arch/x86/events/intel/lbr.c  | 7 +++
 arch/x86/events/perf_event.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index a0f3686..236f8352 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -390,6 +390,9 @@ static void __intel_pmu_lbr_restore(struct 
x86_perf_task_context *task_ctx)
 
wrmsrl(x86_pmu.lbr_tos, tos);
task_ctx->lbr_stack_state = LBR_NONE;
+
+   if (cpuc->vcpu_lbr)
+   wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
 }
 
 static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -416,6 +419,10 @@ static void __intel_pmu_lbr_save(struct 
x86_perf_task_context *task_ctx)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
rdmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
}
+
+   if (cpuc->vcpu_lbr)
+   rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
+
task_ctx->valid_lbrs = i;
task_ctx->tos = tos;
task_ctx->lbr_stack_state = LBR_VALID;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8b90a25..0b2f660 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -699,6 +699,7 @@ struct x86_perf_task_context {
u64 lbr_from[MAX_LBR_ENTRIES];
u64 lbr_to[MAX_LBR_ENTRIES];
u64 lbr_info[MAX_LBR_ENTRIES];
+   u64 lbr_sel;
int tos;
int valid_lbrs;
int lbr_callstack_users;
-- 
2.7.4



[PATCH v8 13/14] KVM/x86/vPMU: check the lbr feature before entering guest

2019-08-06 Thread Wei Wang
The guest can access the lbr related msrs only when the vcpu's lbr event
has been assigned the lbr feature. A cpu pinned lbr event (though no such
event usages in the current upstream kernel) could reclaim the lbr feature
from the vcpu's lbr event (task pinned) via ipi calls. If the cpu is
running in the non-root mode, this will cause the cpu to vm-exit to handle
the host ipi and then vm-entry back to the guest. So on vm-entry (where
interrupt has been disabled), we double confirm that the vcpu's lbr event
is still assigned the lbr feature via checking event->oncpu.

The pass-through of the lbr related msrs will be cancelled if the lbr is
reclaimed, and the following guest accesses to the lbr related msrs will
vm-exit to the related msr emulation handler in kvm, which will prevent
the accesses.

Signed-off-by: Wei Wang 
---
 arch/x86/kvm/pmu.c   |  6 ++
 arch/x86/kvm/pmu.h   |  3 +++
 arch/x86/kvm/vmx/pmu_intel.c | 35 +++
 arch/x86/kvm/x86.c   | 13 +
 4 files changed, 57 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index afad092..ed10a57 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -339,6 +339,12 @@ bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
return false;
 }
 
+void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu)
+{
+   if (kvm_x86_ops->pmu_ops->enabled_feature_confirm)
+   kvm_x86_ops->pmu_ops->enabled_feature_confirm(vcpu);
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f875721..7467907 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -30,6 +30,7 @@ struct kvm_pmu_ops {
int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
bool (*lbr_enable)(struct kvm_vcpu *vcpu);
+   void (*enabled_feature_confirm)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
@@ -126,6 +127,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void 
__user *argp);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
+void kvm_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu);
+
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 5580f1a..421051aa 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -781,6 +781,40 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
intel_pmu_free_lbr_event(vcpu);
 }
 
+void intel_pmu_lbr_confirm(struct kvm_vcpu *vcpu)
+{
+   struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+   /*
+* Either lbr_event being NULL or lbr_used being false indicates that
+* the lbr msrs haven't been passed through to the guest, so no need
+* to cancel passthrough.
+*/
+   if (!pmu->lbr_event || !pmu->lbr_used)
+   return;
+
+   /*
+* The lbr feature gets reclaimed via IPI calls, so checking of
+* lbr_event->oncpu needs to be in an atomic context. Just confirm
+* that irq has been disabled already.
+*/
+   lockdep_assert_irqs_disabled();
+
+   /*
+* Cancel the pass-through of the lbr msrs if lbr has been reclaimed
+* by the host perf.
+*/
+   if (pmu->lbr_event->oncpu != -1) {
+   pmu->lbr_used = false;
+   intel_pmu_set_intercept_for_lbr_msrs(vcpu, true);
+   }
+}
+
+void intel_pmu_enabled_feature_confirm(struct kvm_vcpu *vcpu)
+{
+   intel_pmu_lbr_confirm(vcpu);
+}
+
 struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -790,6 +824,7 @@ struct kvm_pmu_ops intel_pmu_ops = {
.is_valid_msr_idx = intel_is_valid_msr_idx,
.is_valid_msr = intel_is_valid_msr,
.lbr_enable = intel_pmu_lbr_enable,
+   .enabled_feature_confirm = intel_pmu_enabled_feature_confirm,
.get_msr = intel_pmu_get_msr,
.set_msr = intel_pmu_set_msr,
.sched_in = intel_pmu_sched_in,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b76f019..efaf0e8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7985,6 +7985,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
smp_mb__after_srcu_read_unlock();
 
/*
+* Higher priority host perf events (e.g. cpu pinned) could reclaim the
+* pmu resources (e.g. lbr) that were assigned to the vcpu. This is
+* usually done via ipi calls (see perf_install_in_context for
+* details).
+*
+* Before entering the non-root mode

[PATCH v8 12/14] KVM/x86/lbr: lbr emulation

2019-08-06 Thread Wei Wang
In general, the lbr emulation works in this way:
Guest first access (since vcpu scheduled in) to the lbr related msr
gets trapped to kvm, and the handler will do the following things:
  - create an lbr perf event to have the vcpu get the lbr feature
from host perf following the perf scheduling rules;
  - pass the lbr related msrs through to the guest for direct accesses
without vm-exits till the end of this vcpu time slice.

The guest first access is made interceptible so that the kvm side lbr
emulation can always get if the lbr feature has been used during the
vcpu time slice. If the lbr feature isn't used during a time slice,
the lbr event created for the vcpu will be freed.

Some considerations:
- Why not free the vcpu lbr event when the guest clears the lbr enable bit?
Guest may frequently clear the lbr enable bit (in the debugctl msr) during
its use of the lbr feature, e.g. in PMI handler. This will cause the kvm
emulation to frequently alloc/free the vcpu lbr event, which is
unnecessary. Technically, we want to free the vcpu lbr event when the guest
doesn't need to run lbr anymore. Heuristically, we free the vcpu lbr event
when the guest doesn't touch any of the lbr msrs during an entire vcpu time
slice.

Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
Suggested-by: Andi Kleen 
Signed-off-by: Wei Wang 
---
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/pmu.c  |   6 ++
 arch/x86/kvm/pmu.h  |   2 +
 arch/x86/kvm/vmx/pmu_intel.c| 206 
 arch/x86/kvm/vmx/vmx.c  |   4 +-
 arch/x86/kvm/vmx/vmx.h  |   2 +
 arch/x86/kvm/x86.c  |   2 +
 7 files changed, 222 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 692a0c2..ecd22b5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -469,6 +469,8 @@ struct kvm_pmu {
u64 global_ctrl_mask;
u64 global_ovf_ctrl_mask;
u64 reserved_bits;
+   /* Indicate if the lbr msrs were accessed in this vcpu time slice */
+   bool lbr_used;
u8 version;
struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 1a291ed..afad092 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -360,6 +360,12 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data 
*msr_info)
return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
 }
 
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+   if (kvm_x86_ops->pmu_ops->sched_in)
+   kvm_x86_ops->pmu_ops->sched_in(vcpu, cpu);
+}
+
 /* refresh PMU settings. This function generally is called when underlying
  * settings are changed (such as changes of PMU CPUID by guest VMs), which
  * should rarely happen.
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f61024e..f875721 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -32,6 +32,7 @@ struct kvm_pmu_ops {
bool (*lbr_enable)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+   void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
@@ -116,6 +117,7 @@ int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, 
unsigned idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 89730f8..5580f1a 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -17,6 +17,7 @@
 #include "cpuid.h"
 #include "lapic.h"
 #include "pmu.h"
+#include "vmx.h"
 
 static struct kvm_event_hw_type_mapping intel_arch_events[] = {
/* Index must match CPUID 0x0A.EBX bit vector */
@@ -141,6 +142,19 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct 
kvm_vcpu *vcpu,
return [idx];
 }
 
+/* Return true if it is one of the lbr related msrs. */
+static inline bool is_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
+{
+   struct x86_perf_lbr_stack *stack = >kvm->arch.lbr_stack;
+   int nr = stack->nr;
+
+   return !!(index == MSR_LBR_SELECT ||
+ index == stack->tos ||
+ (index >= stack->from && index < stack->from + nr) ||
+ (index >= stack->to && i

[PATCH v8 06/14] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest

2019-08-06 Thread Wei Wang
Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of
the addresses stored in the lbr stack. Expose those bits to the guest
when the guest lbr feature is enabled.

Cc: Paolo Bonzini 
Cc: Andi Kleen 
Signed-off-by: Wei Wang 
---
 arch/x86/include/asm/perf_event.h |  2 ++
 arch/x86/kvm/cpuid.c  |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c  | 16 
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 2606100..aa77da2 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -95,6 +95,8 @@
 #define PEBS_DATACFG_LBRS  BIT_ULL(3)
 #define PEBS_DATACFG_LBR_SHIFT 24
 
+#define X86_PERF_CAP_MASK_LBR_FMT  0x3f
+
 /*
  * Intel "Architectural Performance Monitoring" CPUID
  * detection/enumeration details:
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 22c2720..826b2dc 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -458,7 +458,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 
*entry, u32 function,
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
-   F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+   F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) |
F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 53bb95e..f0ad78f 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -151,6 +151,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 
msr)
case MSR_CORE_PERF_GLOBAL_STATUS:
case MSR_CORE_PERF_GLOBAL_CTRL:
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+   case MSR_IA32_PERF_CAPABILITIES:
ret = pmu->version > 1;
break;
default:
@@ -316,6 +317,19 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+   case MSR_IA32_PERF_CAPABILITIES: {
+   u64 data;
+
+   if (!boot_cpu_has(X86_FEATURE_PDCM) ||
+   (!msr_info->host_initiated &&
+!guest_cpuid_has(vcpu, X86_FEATURE_PDCM)))
+   return 1;
+   data = native_read_msr(MSR_IA32_PERF_CAPABILITIES);
+   msr_info->data = 0;
+   if (vcpu->kvm->arch.lbr_in_guest)
+   msr_info->data |= (data & X86_PERF_CAP_MASK_LBR_FMT);
+   return 0;
+   }
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -374,6 +388,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 0;
}
break;
+   case MSR_IA32_PERF_CAPABILITIES:
+   return 1; /* RO MSR */
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (msr_info->host_initiated)
-- 
2.7.4



[PATCH v8 04/14] KVM/x86: intel_pmu_lbr_enable

2019-08-06 Thread Wei Wang
The lbr stack is model specific, for example, SKX has 32 lbr stack
entries while HSW has 16 entries, so a HSW guest running on a SKX
machine may not get accurate perf results. Currently, we forbid the
guest lbr enabling when the guest and host see different lbr stack
entries or the host and guest see different lbr stack msr indices.

Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
Cc: Kan Liang 
Signed-off-by: Wei Wang 
---
 arch/x86/kvm/pmu.c   |   8 +++
 arch/x86/kvm/pmu.h   |   2 +
 arch/x86/kvm/vmx/pmu_intel.c | 136 +++
 arch/x86/kvm/x86.c   |   3 +-
 4 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 46875bb..26fac6c 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -331,6 +331,14 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 
*data)
return 0;
 }
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+   if (kvm_x86_ops->pmu_ops->lbr_enable)
+   return kvm_x86_ops->pmu_ops->lbr_enable(vcpu);
+
+   return false;
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 58265f7..d9eec9a 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -29,6 +29,7 @@ struct kvm_pmu_ops {
  u64 *mask);
int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
+   bool (*lbr_enable)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*refresh)(struct kvm_vcpu *vcpu);
@@ -107,6 +108,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
eventsel);
 void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
 void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu);
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4dea0e0..6294a86 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
@@ -162,6 +163,140 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 
msr)
return ret;
 }
 
+static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+   struct kvm *kvm = vcpu->kvm;
+   u8 vcpu_model = guest_cpuid_model(vcpu);
+   unsigned int vcpu_lbr_from, vcpu_lbr_nr;
+
+   if (x86_perf_get_lbr_stack(>arch.lbr_stack))
+   return false;
+
+   if (guest_cpuid_family(vcpu) != boot_cpu_data.x86)
+   return false;
+
+   /*
+* It could be possible that people have vcpus of old model run on
+* physcal cpus of newer model, for example a BDW guest on a SKX
+* machine (but not possible to be the other way around).
+* The BDW guest may not get accurate results on a SKX machine as it
+* only reads 16 entries of the lbr stack while there are 32 entries
+* of recordings. We currently forbid the lbr enabling when the vcpu
+* and physical cpu see different lbr stack entries or the guest lbr
+* msr indices are not compatible with the host.
+*/
+   switch (vcpu_model) {
+   case INTEL_FAM6_CORE2_MEROM:
+   case INTEL_FAM6_CORE2_MEROM_L:
+   case INTEL_FAM6_CORE2_PENRYN:
+   case INTEL_FAM6_CORE2_DUNNINGTON:
+   /* intel_pmu_lbr_init_core() */
+   vcpu_lbr_nr = 4;
+   vcpu_lbr_from = MSR_LBR_CORE_FROM;
+   break;
+   case INTEL_FAM6_NEHALEM:
+   case INTEL_FAM6_NEHALEM_EP:
+   case INTEL_FAM6_NEHALEM_EX:
+   /* intel_pmu_lbr_init_nhm() */
+   vcpu_lbr_nr = 16;
+   vcpu_lbr_from = MSR_LBR_NHM_FROM;
+   break;
+   case INTEL_FAM6_ATOM_BONNELL:
+   case INTEL_FAM6_ATOM_BONNELL_MID:
+   case INTEL_FAM6_ATOM_SALTWELL:
+   case INTEL_FAM6_ATOM_SALTWELL_MID:
+   case INTEL_FAM6_ATOM_SALTWELL_TABLET:
+   /* intel_pmu_lbr_init_atom() */
+   vcpu_lbr_nr = 8;
+   vcpu_lbr_from = MSR_LBR_CORE_FROM;
+   break;
+   case INTEL_FAM6_ATOM_SILVERMONT:
+   case INTEL_FAM6_ATOM_SILVERMONT_X:
+   case INTEL_FAM6_ATOM_SILVERMONT_MID:
+   case INTEL_FAM6_ATOM_AIRMONT:
+   case INTEL_FAM6_ATOM_AIRMONT_MID:
+   /* intel_pmu_lbr_init_slm() */
+   vcpu_lbr_nr = 8;
+   vcpu_lbr_from = MSR_LBR_CORE_FROM;
+   break;
+  

[PATCH v8 05/14] KVM/x86/vPMU: tweak kvm_pmu_get_msr

2019-08-06 Thread Wei Wang
Change kvm_pmu_get_msr to get the msr_data struct, as the host_initiated
field from the struct could be used by get_msr. This also makes this API
consistent with kvm_pmu_set_msr.

Cc: Paolo Bonzini 
Cc: Andi Kleen 
Signed-off-by: Wei Wang 
---
 arch/x86/kvm/pmu.c   |  4 ++--
 arch/x86/kvm/pmu.h   |  4 ++--
 arch/x86/kvm/pmu_amd.c   |  7 ---
 arch/x86/kvm/vmx/pmu_intel.c | 19 +++
 arch/x86/kvm/x86.c   |  4 ++--
 5 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 26fac6c..1a291ed 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -350,9 +350,9 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
return kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, msr);
 }
 
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
-   return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
+   return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr_info);
 }
 
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index d9eec9a..f61024e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -30,7 +30,7 @@ struct kvm_pmu_ops {
int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
bool (*lbr_enable)(struct kvm_vcpu *vcpu);
-   int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+   int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
@@ -114,7 +114,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
 int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, unsigned idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index c838838..4a64a3f 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -208,21 +208,22 @@ static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 
msr)
return ret;
 }
 
-static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+   u32 msr = msr_info->index;
 
/* MSR_PERFCTRn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
if (pmc) {
-   *data = pmc_read_counter(pmc);
+   msr_info->data = pmc_read_counter(pmc);
return 0;
}
/* MSR_EVNTSELn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
if (pmc) {
-   *data = pmc->eventsel;
+   msr_info->data = pmc->eventsel;
return 0;
}
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 6294a86..53bb95e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -297,35 +297,38 @@ static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
return true;
 }
 
-static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+   u32 msr = msr_info->index;
 
switch (msr) {
case MSR_CORE_PERF_FIXED_CTR_CTRL:
-   *data = pmu->fixed_ctr_ctrl;
+   msr_info->data = pmu->fixed_ctr_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_STATUS:
-   *data = pmu->global_status;
+   msr_info->data = pmu->global_status;
return 0;
case MSR_CORE_PERF_GLOBAL_CTRL:
-   *data = pmu->global_ctrl;
+   msr_info->data = pmu->global_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
-   *data = pmu->global_ovf_ctrl;
+   msr_info->data = pmu->global_ovf_ctrl;
return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
-   *data = val & pmu->counter_bitmask[KVM_PMC_GP];
+   msr_info->data =
+   val & pmu->count

[PATCH v8 01/14] perf/x86: fix the variable type of the lbr msrs

2019-08-06 Thread Wei Wang
The msr variable type can be "unsigned int", which uses less memory than
the longer unsigned long. The lbr nr won't be a negative number, so make
it "unsigned int" as well.

Cc: Peter Zijlstra 
Cc: Andi Kleen 
Suggested-by: Peter Zijlstra 
Signed-off-by: Wei Wang 
---
 arch/x86/events/perf_event.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8751008..27e4d32 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -660,8 +660,8 @@ struct x86_pmu {
/*
 * Intel LBR
 */
-   unsigned long   lbr_tos, lbr_from, lbr_to; /* MSR base regs   */
-   int lbr_nr;/* hardware stack size */
+   unsigned intlbr_tos, lbr_from, lbr_to,
+   lbr_nr;/* lbr stack and size */
u64 lbr_sel_mask;  /* LBR_SELECT valid bits */
const int   *lbr_sel_map;  /* lbr_select mappings */
boollbr_double_abort;  /* duplicated lbr aborts */
-- 
2.7.4



[PATCH v8 03/14] KVM/x86: KVM_CAP_X86_GUEST_LBR

2019-08-06 Thread Wei Wang
Introduce KVM_CAP_X86_GUEST_LBR to allow per-VM enabling of the guest
lbr feature.

Signed-off-by: Wei Wang 
---
 Documentation/virt/kvm/api.txt  | 26 ++
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/x86.c  | 16 
 include/uapi/linux/kvm.h|  1 +
 4 files changed, 45 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 2d06776..64632a8 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -5046,6 +5046,32 @@ it hard or impossible to use it correctly.  The 
availability of
 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
 Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.
 
+7.19 KVM_CAP_X86_GUEST_LBR
+Architectures: x86
+Parameters: args[0] whether feature should be enabled or not
+args[1] pointer to the userspace memory to load the lbr stack info
+
+The lbr stack info is described by
+struct x86_perf_lbr_stack {
+   unsigned intnr;
+   unsigned inttos;
+   unsigned intfrom;
+   unsigned intto;
+   unsigned intinfo;
+};
+
+@nr: number of lbr stack entries
+@tos: index of the top of stack msr
+@from: index of the msr that stores a branch source address
+@to: index of the msr that stores a branch destination address
+@info: index of the msr that stores lbr related flags
+
+Enabling this capability allows guest accesses to the lbr feature. Otherwise,
+#GP will be injected to the guest when it accesses to the lbr related msrs.
+
+After the feature is enabled, before exiting to userspace, kvm handlers should
+fill the lbr stack info into the userspace memory pointed by args[1].
+
 8. Other capabilities.
 --
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7b0a4ee..d29 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -875,6 +875,7 @@ struct kvm_arch {
atomic_t vapics_in_nmi_mode;
struct mutex apic_map_lock;
struct kvm_apic_map *apic_map;
+   struct x86_perf_lbr_stack lbr_stack;
 
bool apic_access_page_done;
 
@@ -884,6 +885,7 @@ struct kvm_arch {
bool hlt_in_guest;
bool pause_in_guest;
bool cstate_in_guest;
+   bool lbr_in_guest;
 
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c6d951c..e1eb1be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3129,6 +3129,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_EXCEPTION_PAYLOAD:
r = 1;
break;
+   case KVM_CAP_X86_GUEST_LBR:
+   r = sizeof(struct x86_perf_lbr_stack);
+   break;
case KVM_CAP_SYNC_REGS:
r = KVM_SYNC_X86_VALID_FIELDS;
break;
@@ -4670,6 +4673,19 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.exception_payload_enabled = cap->args[0];
r = 0;
break;
+   case KVM_CAP_X86_GUEST_LBR:
+   r = -EINVAL;
+   if (cap->args[0] &&
+   x86_perf_get_lbr_stack(>arch.lbr_stack))
+   break;
+
+   if (copy_to_user((void __user *)cap->args[1],
+>arch.lbr_stack,
+sizeof(struct x86_perf_lbr_stack)))
+   break;
+   kvm->arch.lbr_in_guest = cap->args[0];
+   r = 0;
+   break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5e3f12d..dd53edc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -996,6 +996,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171
 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172
 #define KVM_CAP_PMU_EVENT_FILTER 173
+#define KVM_CAP_X86_GUEST_LBR 174
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4



[PATCH v8 02/14] perf/x86: add a function to get the addresses of the lbr stack msrs

2019-08-06 Thread Wei Wang
The lbr stack msrs are model specific. The perf subsystem has already
assigned the abstracted msr address values based on the cpu model.
So add a function to enable callers outside the perf subsystem to get the
lbr stack addresses. This is useful for hypervisors to emulate the lbr
feature for the guest.

Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
Signed-off-by: Wei Wang 
---
 arch/x86/events/intel/lbr.c   | 23 +++
 arch/x86/include/asm/perf_event.h | 14 ++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 6f814a2..9b2d05c 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1311,3 +1311,26 @@ void intel_pmu_lbr_init_knl(void)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
 }
+
+/**
+ * x86_perf_get_lbr_stack - get the lbr stack related msrs
+ *
+ * @stack: the caller's memory to get the lbr stack
+ *
+ * Returns: 0 indicates that the lbr stack has been successfully obtained.
+ */
+int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+   stack->nr = x86_pmu.lbr_nr;
+   stack->tos = x86_pmu.lbr_tos;
+   stack->from = x86_pmu.lbr_from;
+   stack->to = x86_pmu.lbr_to;
+
+   if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+   stack->info = MSR_LBR_INFO_0;
+   else
+   stack->info = 0;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_lbr_stack);
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 1392d5e..2606100 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -318,7 +318,16 @@ struct perf_guest_switch_msr {
u64 host, guest;
 };
 
+struct x86_perf_lbr_stack {
+   unsigned intnr;
+   unsigned inttos;
+   unsigned intfrom;
+   unsigned intto;
+   unsigned intinfo;
+};
+
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
 extern int x86_perf_rdpmc_index(struct perf_event *event);
@@ -329,6 +338,11 @@ static inline struct perf_guest_switch_msr 
*perf_guest_get_msrs(int *nr)
return NULL;
 }
 
+static inline int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+   return -1;
+}
+
 static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
memset(cap, 0, sizeof(*cap));
-- 
2.7.4



[PATCH v8 00/14] Guest LBR Enabling

2019-08-06 Thread Wei Wang
   check.
  - Patch 7:
-- support perf scheduler to not assign a counter for the perf event
   that has PERF_EV_CAP_NO_COUNTER set (rather than skipping the perf
   scheduler). This allows the scheduler to detect lbr usage conflicts
   via get_event_constraints, and lower priority events will finally
   fail to use lbr.
-- define X86_PMC_IDX_NA as "-1", which represents a never assigned
   counter id. There are other places that use "-1", but could be
   updated to use the new macro in another patch series.
  - Patch 8:
-- move the event->owner assignment into perf_event_alloc to have it
   set before event_init is called. Please see this patch's commit for
   reasons.
  - Patch 9:
-- use "exclude_host" and "is_kernel_event" to decide if the lbr event
   is used for the vcpu lbr emulation, which doesn't need a counter,
   and removes the usage of the previous new perf_event_create API.
-- remove the unused attr fields.
  - Patch 10:
-- set a hardware reserved bit (bit 62 of LBR_SELECT) to reg->config
   for the vcpu lbr emulation event. This makes the config different
   from other host lbr event, so that they don't share the lbr.
   Please see the comments in the patch for the reasons why they
   shouldn't share.
  - Patch 12:
-- disable interrupt and check if the vcpu lbr event owns the lbr
   feature before kvm writing to the lbr related msr. This avoids kvm
   updating the lbr msrs after lbr has been reclaimed by other events
   via ipi.
-- remove arch v4 related support.
  - Patch 13:
-- double check if the vcpu lbr event owns the lbr feature before
   vm-entry into the guest. The lbr pass-through will be cancelled if
   lbr feature has been reclaimed by a cpu pinned lbr event.

Previous:
https://lkml.kernel.org/r/1562548999-37095-1-git-send-email-wei.w.w...@intel.com

Wei Wang (14):
  perf/x86: fix the variable type of the lbr msrs
  perf/x86: add a function to get the addresses of the lbr stack msrs
  KVM/x86: KVM_CAP_X86_GUEST_LBR
  KVM/x86: intel_pmu_lbr_enable
  KVM/x86/vPMU: tweak kvm_pmu_get_msr
  KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
  perf/x86: support to create a perf event without counter allocation
  perf/core: set the event->owner before event_init
  KVM/x86/vPMU: APIs to create/free lbr perf event for a vcpu thread
  perf/x86/lbr: don't share lbr for the vcpu usage case
  perf/x86: save/restore LBR_SELECT on vcpu switching
  KVM/x86/lbr: lbr emulation
  KVM/x86/vPMU: check the lbr feature before entering guest
  KVM/x86: remove the common handling of the debugctl msr

 Documentation/virt/kvm/api.txt|  26 +++
 arch/x86/events/core.c|  36 ++-
 arch/x86/events/intel/core.c  |   3 +
 arch/x86/events/intel/lbr.c   |  95 +++-
 arch/x86/events/perf_event.h  |   6 +-
 arch/x86/include/asm/kvm_host.h   |   5 +
 arch/x86/include/asm/perf_event.h |  17 ++
 arch/x86/kvm/cpuid.c  |   2 +-
 arch/x86/kvm/pmu.c|  24 +-
 arch/x86/kvm/pmu.h|  11 +-
 arch/x86/kvm/pmu_amd.c|   7 +-
 arch/x86/kvm/vmx/pmu_intel.c  | 476 +-
 arch/x86/kvm/vmx/vmx.c|   4 +-
 arch/x86/kvm/vmx/vmx.h|   2 +
 arch/x86/kvm/x86.c|  47 ++--
 include/linux/perf_event.h|  18 ++
 include/uapi/linux/kvm.h  |   1 +
 kernel/events/core.c  |  19 +-
 18 files changed, 738 insertions(+), 61 deletions(-)

-- 
2.7.4



Re: [PATCH] KVM: x86: Add fixed counters to PMU filter

2019-07-19 Thread Wei Wang

On 07/19/2019 02:38 AM, Eric Hankland wrote:

From: ehankland 

Updates KVM_CAP_PMU_EVENT_FILTER so it can also whitelist or blacklist
fixed counters.

Signed-off-by: ehankland 
---
  Documentation/virtual/kvm/api.txt | 13 -
  arch/x86/include/uapi/asm/kvm.h   |  9 ++---
  arch/x86/kvm/pmu.c| 30 +-
  3 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 2cd6250b2896..96bcf1aa1931 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4090,17 +4090,20 @@ Parameters: struct kvm_pmu_event_filter (in)
  Returns: 0 on success, -1 on error
  
  struct kvm_pmu_event_filter {

-   __u32 action;
-   __u32 nevents;
-   __u64 events[0];
+   __u32 action;
+   __u32 nevents;
+   __u32 fixed_counter_bitmap;
+   __u32 flags;
+   __u32 pad[4];
+   __u64 events[0];
  };
  
  This ioctl restricts the set of PMU events that the guest can program.

  The argument holds a list of events which will be allowed or denied.
  The eventsel+umask of each event the guest attempts to program is compared
  against the events field to determine whether the guest should have access.
-This only affects general purpose counters; fixed purpose counters can
-be disabled by changing the perfmon CPUID leaf.
+The events field only controls general purpose counters; fixed purpose
+counters are controlled by the fixed_counter_bitmap.
  
  Valid values for 'action':

  #define KVM_PMU_EVENT_ALLOW 0
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index e901b0ab116f..503d3f42da16 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -435,9 +435,12 @@ struct kvm_nested_state {
  
  /* for KVM_CAP_PMU_EVENT_FILTER */

  struct kvm_pmu_event_filter {
-   __u32 action;
-   __u32 nevents;
-   __u64 events[0];
+   __u32 action;
+   __u32 nevents;
+   __u32 fixed_counter_bitmap;
+   __u32 flags;
+   __u32 pad[4];
+   __u64 events[0];
  };
  
  #define KVM_PMU_EVENT_ALLOW 0

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index aa5a2597305a..ae5cd1b02086 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -19,8 +19,8 @@
  #include "lapic.h"
  #include "pmu.h"
  
-/* This keeps the total size of the filter under 4k. */

-#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63
+/* This is enough to filter the vast majority of currently defined events. */
+#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 300
  
  /* NOTE:

   * - Each perf counter is defined as "struct kvm_pmc";
@@ -206,12 +206,25 @@ void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 
ctrl, int idx)
  {
unsigned en_field = ctrl & 0x3;
bool pmi = ctrl & 0x8;
+   struct kvm_pmu_event_filter *filter;
+   struct kvm *kvm = pmc->vcpu->kvm;
+


unnecessary white space here, other part looks good to me.

Reviewed-by: Wei Wang 

Best,
Wei


[PATCH v2] mm/balloon_compaction: avoid duplicate page removal

2019-07-18 Thread Wei Wang
Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces)

A #GP is reported in the guest when requesting balloon inflation via
virtio-balloon. The reason is that the virtio-balloon driver has
removed the page from its internal page list (via balloon_page_pop),
but balloon_page_enqueue_one also calls "list_del"  to do the removal.
This is necessary when it's used from balloon_page_enqueue_list, but
not from balloon_page_enqueue_one.

So remove the list_del balloon_page_enqueue_one, and update some
comments as a reminder.

Signed-off-by: Wei Wang 
---
ChangeLong:
v1->v2: updated some comments

 mm/balloon_compaction.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 83a7b61..8639bfc 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -21,7 +21,6 @@ static void balloon_page_enqueue_one(struct balloon_dev_info 
*b_dev_info,
 * memory corruption is possible and we should stop execution.
 */
BUG_ON(!trylock_page(page));
-   list_del(>lru);
balloon_page_insert(b_dev_info, page);
unlock_page(page);
__count_vm_event(BALLOON_INFLATE);
@@ -33,7 +32,7 @@ static void balloon_page_enqueue_one(struct balloon_dev_info 
*b_dev_info,
  * @b_dev_info: balloon device descriptor where we will insert a new page to
  * @pages: pages to enqueue - allocated using balloon_page_alloc.
  *
- * Driver must call it to properly enqueue a balloon pages before definitively
+ * Driver must call it to properly enqueue balloon pages before definitively
  * removing it from the guest system.
  *
  * Return: number of pages that were enqueued.
@@ -47,6 +46,7 @@ size_t balloon_page_list_enqueue(struct balloon_dev_info 
*b_dev_info,
 
spin_lock_irqsave(_dev_info->pages_lock, flags);
list_for_each_entry_safe(page, tmp, pages, lru) {
+   list_del(>lru);
balloon_page_enqueue_one(b_dev_info, page);
n_pages++;
}
@@ -128,13 +128,19 @@ struct page *balloon_page_alloc(void)
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
 
 /*
- * balloon_page_enqueue - allocates a new page and inserts it into the balloon
- *   page list.
+ * balloon_page_enqueue - inserts a new page into the balloon page list.
+ *
  * @b_dev_info: balloon device descriptor where we will insert a new page to
  * @page: new page to enqueue - allocated using balloon_page_alloc.
  *
  * Driver must call it to properly enqueue a new allocated balloon page
  * before definitively removing it from the guest system.
+ *
+ * Drivers must not call balloon_page_enqueue on pages that have been
+ * pushed to a list with balloon_page_push before removing them with
+ * balloon_page_pop. To all pages on a list, use balloon_page_list_enqueue
+ * instead.
+ *
  * This function returns the page address for the recently enqueued page or
  * NULL in the case we fail to allocate a new page this turn.
  */
-- 
2.7.4



Re: use of shrinker in virtio balloon free page hinting

2019-07-18 Thread Wei Wang

On 07/18/2019 02:47 PM, Michael S. Tsirkin wrote:

On Thu, Jul 18, 2019 at 02:30:01PM +0800, Wei Wang wrote:

On 07/18/2019 01:58 PM, Michael S. Tsirkin wrote:

what if it does not fail?



Shrinker is called on system memory pressure. On memory pressure
get_free_page_and_send will fail memory allocation, so it stops allocating
more.

Memory pressure could be triggered by an unrelated allocation
e.g. from another driver.

As memory pressure is system-wide (no matter who triggers it), free page
hinting
will fail on memory pressure, same as other drivers.

That would be good.  Except instead of failing it can hit a race
condition where it will reallocate memory freed by shrinker. Not good.


OK..I could see this when another module does allocation, which triggers 
kswapd
to have balloon's shrinker release some memory, which could be eaten by 
balloon

quickly again before that module takes it, and this could happen repeatedly
in theory.

So add a vb->stop_free_page_report boolean, set it in shrinker_count, 
and clear it in

virtio_balloon_queue_free_page_work?

Best,
Wei


Re: [PATCH v1] mm/balloon_compaction: avoid duplicate page removal

2019-07-18 Thread Wei Wang

On 07/18/2019 12:31 PM, Michael S. Tsirkin wrote:

On Thu, Jul 18, 2019 at 10:23:30AM +0800, Wei Wang wrote:

Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces)

A #GP is reported in the guest when requesting balloon inflation via
virtio-balloon. The reason is that the virtio-balloon driver has
removed the page from its internal page list (via balloon_page_pop),
but balloon_page_enqueue_one also calls "list_del"  to do the removal.

I would add here "this is necessary when it's used from
balloon_page_enqueue_list but not when it's called
from balloon_page_enqueue".


So remove the list_del in balloon_page_enqueue_one, and have the callers
do the page removal from their own page lists.

Signed-off-by: Wei Wang 

Patch is good but comments need some work.


---
  mm/balloon_compaction.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 83a7b61..1a5ddc4 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -11,6 +11,7 @@
  #include 
  #include 
  
+/* Callers ensure that @page has been removed from its original list. */

This comment does not make sense. E.g. balloon_page_enqueue
does nothing to ensure this. And drivers are not supposed
to care how the page lists are managed. Pls drop.

Instead please add the following to balloon_page_enqueue:


Note: drivers must not call balloon_page_list_enqueue on


Probably, you meant balloon_page_enqueue here.

The description for balloon_page_enqueue also seems incorrect:
"allocates a new page and inserts it into the balloon page list."
This function doesn't do any allocation itself.
Plan to reword it: inserts a new page into the balloon page list."

Best,
Wei


Re: use of shrinker in virtio balloon free page hinting

2019-07-18 Thread Wei Wang

On 07/18/2019 01:58 PM, Michael S. Tsirkin wrote:


what if it does not fail?



Shrinker is called on system memory pressure. On memory pressure
get_free_page_and_send will fail memory allocation, so it stops allocating
more.

Memory pressure could be triggered by an unrelated allocation
e.g. from another driver.


As memory pressure is system-wide (no matter who triggers it), free page 
hinting

will fail on memory pressure, same as other drivers.

As long as the page allocation succeeds, we could just think the system 
is not in

the memory pressure situation, then thing could go on normally.

Also, the VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG includes NORETRY and 
NOMEMALLOC,

which makes it easier than most other drivers to fail allocation first.

Best,
Wei


Re: use of shrinker in virtio balloon free page hinting

2019-07-17 Thread Wei Wang

On 07/18/2019 12:13 PM, Michael S. Tsirkin wrote:


It makes sense for pages in the balloon (requested by hypervisor).
However free page hinting can freeze up lots of memory for its own
internal reasons. It does not make sense to ask hypervisor
to set flags in order to fix internal guest issues.


Sounds reasonable to me. Probably we could move the flag check to
shrinker_count and shrinker_scan as a reclaiming condition for
ballooning pages only?




Right. But that does not include the pages in the hint vq,
which could be a significant amount of memory.


I think it includes, as vb->num_free_page_blocks records the total number
of free page blocks that balloon has taken from mm.

For shrink_free_pages, it calls return_free_pages_to_mm, which pops pages
from vb->free_page_list (this is the list where pages get enlisted after 
they

are put to the hint vq, see get_free_page_and_send).






- if free pages are being reported, pages freed
   by shrinker will just get re-allocated again

fill_balloon will re-try the allocation after sleeping 200ms once allocation 
fails.

Even if ballon was never inflated, if shrinker frees some memory while
we are hinting, hint vq will keep going and allocate it back without
sleeping.


Still see get_free_page_and_send. -EINTR is returned when page 
allocation fails,

and reporting ends then.

Shrinker is called on system memory pressure. On memory pressure
get_free_page_and_send will fail memory allocation, so it stops 
allocating more.



Best,
Wei


[PATCH v1] mm/balloon_compaction: avoid duplicate page removal

2019-07-17 Thread Wei Wang
Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces)

A #GP is reported in the guest when requesting balloon inflation via
virtio-balloon. The reason is that the virtio-balloon driver has
removed the page from its internal page list (via balloon_page_pop),
but balloon_page_enqueue_one also calls "list_del"  to do the removal.
So remove the list_del in balloon_page_enqueue_one, and have the callers
do the page removal from their own page lists.

Signed-off-by: Wei Wang 
---
 mm/balloon_compaction.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 83a7b61..1a5ddc4 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 
+/* Callers ensure that @page has been removed from its original list. */
 static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
 struct page *page)
 {
@@ -21,7 +22,6 @@ static void balloon_page_enqueue_one(struct balloon_dev_info 
*b_dev_info,
 * memory corruption is possible and we should stop execution.
 */
BUG_ON(!trylock_page(page));
-   list_del(>lru);
balloon_page_insert(b_dev_info, page);
unlock_page(page);
__count_vm_event(BALLOON_INFLATE);
@@ -47,6 +47,7 @@ size_t balloon_page_list_enqueue(struct balloon_dev_info 
*b_dev_info,
 
spin_lock_irqsave(_dev_info->pages_lock, flags);
list_for_each_entry_safe(page, tmp, pages, lru) {
+   list_del(>lru);
balloon_page_enqueue_one(b_dev_info, page);
n_pages++;
}
-- 
2.7.4



Re: [PATCH v2] KVM: x86: PMU Event Filter

2019-07-16 Thread Wei Wang

On 07/16/2019 08:10 AM, Eric Hankland wrote:

I think just disabling guest cpuid might not be enough, since guest
could write to the msr without checking the cpuid.

Why not just add a bitmap for fixed counter?
e.g. fixed_counter_reject_bitmap

At the beginning of reprogram_fixed_counter, we could add the check:

if (test_bit(idx, >arch.fixed_counter_reject_bitmap))
  return -EACCES;

(Please test with your old guest and see if they have issues if we
inject #GP when
they try to set the fixed_ctrl msr. If there is, we could drop -EACCESS
above)

The bitmap could be set at kvm_vm_ioctl_set_pmu_event_filter.

intel_pmu_refresh() checks the guest cpuid and sets the number of
fixed counters according to that:
pmu->nr_arch_fixed_counters = min_t(int, edx.split.num_counters_fixed,
INTEL_PMC_MAX_FIXED);

and reprogram_fixed_counters()/get_fixed_pmc() respect this so the
guest can't just ignore the cpuid.


Yes, but as you noticed, we couldn't disable fixed counter 2 while keeping
counter 3 running via that.



Adding a bitmap does let you do things like disable the first counter
but keep the second and third, but given that there are only three and
the events are likely to be on a whitelist anyway, it seemed like
adding the bitmap wasn't worth it. If you still feel the same way even
though we can disable them via the cpuid, I can add this in.


We need the design to be architecturally clean. For example, if the 
hardware later

comes up with fixed counter 5 and 6.
5 is something we really want to expose to the guest to use while 6 isn't,
can our design here (further) support that?

We don't want to re-design this at that time. However, extending what we 
have would
be acceptable. So, if you hesitate to add the bitmap method that I 
described, please

add GP tags to the ACTIONs defined, e.g.

enum kvm_pmu_action_type
{
  KVM_PMU_EVENT_ACTION_GP_NONE = 0,
  KVM_PMU_EVENT_ACTION_GP_ACCEPT = 1,
  KVM_PMU_EVENT_ACTION_GP_REJECT = 2,
  KVM_PMU_EVENT_ACTION_MAX
};

and add comments to explain something like below:

Those GP actions are for the filtering of guest events running on the 
virtual general
purpose counters. The actions to filter guest events running on the 
virtual fixed
function counters are not added currently as they all seem fine to be 
used by the
guest so far, but that could be supported on demand in the future via 
adding new

actions.



I think it would be better to add more, please see below:

enum kvm_pmu_action_type {
  KVM_PMU_EVENT_ACTION_NONE = 0,
  KVM_PMU_EVENT_ACTION_ACCEPT = 1,
  KVM_PMU_EVENT_ACTION_REJECT = 2,
  KVM_PMU_EVENT_ACTION_MAX
};

and do a check in kvm_vm_ioctl_set_pmu_event_filter()
  if (filter->action >= KVM_PMU_EVENT_ACTION_MAX)
  return -EINVAL;

This is for detecting the case that we add a new action in
userspace, while the kvm hasn't been updated to support that.

KVM_PMU_EVENT_ACTION_NONE is for userspace to remove
the filter after they set it.

We can achieve the same result by using a reject action with an empty
set of events - is there some advantage to "none" over that? I can add
that check for valid actions.


Yes, we could also make it work via passing nevents=0.

I slightly prefer the use of the NONE action here. The advantage is
simpler (less) userspace command. For QEMU, people could
disable the filter list via qmp command, e.g. pmu-filter=none,
instead of pmu-filter=accept,nevents=0.
It also seems more straightforward when people read the usage
manual - a NONE command to cancel the filtering, instead of
"first setting this, then setting that.."
I don't see advantages of using nevents over NONE action.

Anyway, this one isn't a critical issue, and it's up to you here.





+#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63

Why is this limit needed?

Serves to keep the filters on the smaller side and ensures the size
calculation can't overflow if users attempt to. Keeping the filter
under 4k is nicer for allocation - also, if we want really large
filters we might want to do something smarter than a linear traversal
of the filter when guests program counters.


I think 63 is too small, and it looks like a random number being put here.
From the SDM table 19-3, it seems there are roughly 300 events. 
Functionally,

the design should support to filter most of them.
(optimization with smarter traversal is another story that could be done 
later)


Maybe
#define KVM_PMU_EVENT_FILTER_MAX_EVENTS (PAGE_SIZE - sizeof(struct 
kvm_pmu_event_filter)) / sizeof (__u64)

?

and please add some comments above about the consideration that we set 
this number.




+   kvfree(filter);

Probably better to have it conditionally?

if (filter) {
  synchronize_srcu();
  kfree(filter)
}

You may want to factor it out, so that kvm_pmu_destroy could reuse.

Do you mean kvm_arch_destroy_vm? It looks like that's where kvm_arch
members are freed. I can do that.


Sounds good.

Best,
Wei




Re: [PATCH v2] KVM: x86: PMU Event Filter

2019-07-11 Thread Wei Wang

On 07/11/2019 09:25 AM, Eric Hankland wrote:

- Add a VM ioctl that can control which events the guest can monitor.

Signed-off-by: ehankland
---
Changes since v1:
-Moved to a vm ioctl rather than a vcpu one
-Changed from a whitelist to a configurable filter which can either be
white or black
-Only restrict GP counters since fixed counters require extra handling
and they can be disabled by setting the guest cpuid (though only by
setting the number - they can't be disabled individually)


I think just disabling guest cpuid might not be enough, since guest
could write to the msr without checking the cpuid.

Why not just add a bitmap for fixed counter?
e.g. fixed_counter_reject_bitmap

At the beginning of reprogram_fixed_counter, we could add the check:

if (test_bit(idx, >arch.fixed_counter_reject_bitmap))
return -EACCES;

(Please test with your old guest and see if they have issues if we 
inject #GP when
they try to set the fixed_ctrl msr. If there is, we could drop -EACCESS 
above)


The bitmap could be set at kvm_vm_ioctl_set_pmu_event_filter.


+/* for KVM_CAP_PMU_EVENT_FILTER */
+struct kvm_pmu_event_filter {
+   __u32 type;
+   __u32 nevents;
+   __u64 events[0];
+};
+
+#define KVM_PMU_EVENT_WHITELIST 0
+#define KVM_PMU_EVENT_BLACKLIST 1


I think it would be better to add more, please see below:

enum kvm_pmu_action_type {
KVM_PMU_EVENT_ACTION_NONE = 0,
KVM_PMU_EVENT_ACTION_ACCEPT = 1,
KVM_PMU_EVENT_ACTION_REJECT = 2,
KVM_PMU_EVENT_ACTION_MAX
};

and do a check in kvm_vm_ioctl_set_pmu_event_filter()
if (filter->action >= KVM_PMU_EVENT_ACTION_MAX)
return -EINVAL;

This is for detecting the case that we add a new action in
userspace, while the kvm hasn't been updated to support that.

KVM_PMU_EVENT_ACTION_NONE is for userspace to remove
the filter after they set it.



+
  #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index dd745b58ffd8..d674b79ff8da 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -22,6 +22,9 @@
  #include "lapic.h"
  #include "pmu.h"

+/* This keeps the total size of the filter under 4k. */
+#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 63
+


Why is this limit needed?


  /* NOTE:
   * - Each perf counter is defined as "struct kvm_pmc";
   * - There are two types of perf counters: general purpose (gp) and fixed.
@@ -144,6 +147,10 @@ void reprogram_gp_counter(struct kvm_pmc *pmc,
u64 eventsel)
  {
 unsigned config, type = PERF_TYPE_RAW;
 u8 event_select, unit_mask;
+   struct kvm_arch *arch = >vcpu->kvm->arch;
+   struct kvm_pmu_event_filter *filter;
+   int i;
+   bool allow_event = true;

 if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
 printk_once("kvm pmu: pin control bit is ignored\n");
@@ -155,6 +162,24 @@ void reprogram_gp_counter(struct kvm_pmc *pmc,
u64 eventsel)
 if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_is_enabled(pmc))
 return;

+   rcu_read_lock();
+   filter = rcu_dereference(arch->pmu_event_filter);
+   if (filter) {
+   for (i = 0; i < filter->nevents; i++)
+   if (filter->events[i] ==
+   (eventsel & AMD64_RAW_EVENT_MASK_NB))
+   break;
+   if (filter->type == KVM_PMU_EVENT_WHITELIST &&
+   i == filter->nevents)
+   allow_event = false;
+   if (filter->type == KVM_PMU_EVENT_BLACKLIST &&
+   i < filter->nevents)
+   allow_event = false;
+   }
+   rcu_read_unlock();
+   if (!allow_event)
+   return;
+


I think it looks tidier to wrap the changes above into a function:

if (kvm_pmu_filter_event(kvm, eventsel & AMD64_RAW_EVENT_MASK_NB))
return;


 event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
 unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;

@@ -351,3 +376,39 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
  {
 kvm_pmu_reset(vcpu);
  }
+
+int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
+{
+   struct kvm_pmu_event_filter tmp, *filter;
+   size_t size;
+   int r;
+
+   if (copy_from_user(, argp, sizeof(tmp)))
+   return -EFAULT;
+
+   if (tmp.nevents > KVM_PMU_EVENT_FILTER_MAX_EVENTS)
+   return -E2BIG;
+
+   size = sizeof(tmp) + sizeof(tmp.events[0]) * tmp.nevents;
+   filter = vmalloc(size);
+   if (!filter)
+   return -ENOMEM;
+
+   r = -EFAULT;
+   if (copy_from_user(filter, argp, size))


Though the above functions correctly, I would just move "r = -EFAULT" here
to have it executed conditionally.



+   goto cleanup;
+
+   /* Ensure nevents can't be changed between the user copies. */
+   *filter = tmp;
+
+   mutex_lock(>lock);
+   rcu_swap_protected(kvm->arch.pmu_event_filter, filter,
+  

Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN

2019-07-10 Thread Wei Wang

On 07/09/2019 07:35 PM, Peter Zijlstra wrote:


Yeah; although I'm not sure if its an implementation or specification
problem. But as it exists it is of very limited use.

Fundamentally our events (with exception of event groups) are
independent. Events should always count, except when the PMI is running
-- so as to not include the measurement overhead in the measurement
itself. But this (mis)feature stops the entire PMU as soon as a single
counter overflows, inhibiting all other counters from running (as they
should) until the PMI has happened and reset the state.

(Note that, strictly speaking, we even expect the overflowing counter to
continue counting until the PMI happens. Having an overflow should not
mean we loose events. A sampling and !sampling event should produce the
same event count.)

So even when there's only a single event (group) scheduled, it isn't
strictly right. And when there's multiple events scheduled it is
definitely wrong.

And while I understand the purpose of the current semantics; it makes a
single event group sample count more coherent, the fact that is looses
events just bugs me something fierce -- and as shown, it breaks tools.


Thanks for sharing the finding.
If I understand this correctly, you observed that counter getting freezed
earlier than expected (expected to freeze at the time PMI gets generated).

Have you talked to anyone for possible freeze adjustment from the hardware?

Best,
Wei



Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

2019-07-10 Thread Wei Wang

On 07/09/2019 07:45 PM, Peter Zijlstra wrote:



+* config:Actually this field won't be used by the perf core
+*as this event doesn't have a perf counter.
+* sample_period: Same as above.

If it's unused; why do we need to set it at all?


OK, we'll remove the unused fields.




+* sample_type:   tells the perf core that it is an lbr event.
+* branch_sample_type: tells the perf core that the lbr event works in
+*the user callstack mode so that the lbr stack will be
+*saved/restored on vCPU switching.

Again; doesn't make sense. What does the user part have to do with
save/restore? What happens when this vcpu thread drops to userspace for
an assist?


This is a fake event which doesn't run the lbr on the host.
Returning to userspace doesn't need save/restore lbr, because userspace
wouldn't change those lbr msrs.

The event is created to save/restore lbr msrs on vcpu switching.
Host perf only do this save/restore for "user callstack mode" lbr event, 
so we
construct the event to be "user callstack mode" to reuse the host lbr 
save/restore.


An alternative method is to have KVM vcpu sched_in/out callback
save/restore those msrs, which don't need to create this fake event.

Please let me know if you prefer the second method.

Best,
Wei


Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

2019-07-10 Thread Wei Wang

On 07/09/2019 08:19 PM, Peter Zijlstra wrote:



For the lbr feature, could we thought of it as first come, first served?
For example, if we have 2 host threads who want to use lbr at the same time,
I think one of them would simply fail to use.

So if guest first gets the lbr, host wouldn't take over unless some
userspace command (we added to QEMU) is executed to have the vCPU
actively stop using lbr.

Doesn't work that way.

Say you start KVM with LBR emulation, it creates this task event, it
gets the LBR (nobody else wants it) and the guest works and starts using
the LBR.

Then the host creates a CPU LBR event and the vCPU suddenly gets denied
the LBR and the guest no longer functions correctly.

Or you should fail to VMENTER, in which case you starve the guest, but
at least it doesn't malfunction.



Ok, I think we could go with the first option for now, like this:

When there are other host threads starting to use lbr, host perf notifies
kvm to disable the guest use of lbr (add a notifier_block), which includes:
 - cancel passing through lbr msrs to guest
 - upon guest access to the lbr msr (will trap to KVM now), return #GP
   (or 0).


Best,
Wei





Re: [PATCH v7 07/12] perf/x86: no counter allocation support

2019-07-09 Thread Wei Wang

On 07/09/2019 05:43 PM, Peter Zijlstra wrote:

That's almost a year ago; I really can't remember that and you didn't
put any of that in your Changelog to help me remember.

(also please use: https://lkml.kernel.org/r/$msgid style links)


OK, I'll put this link in the cover letter or commit log for a reminder.




In the previous version, we added a "no_counter" bit to perf_event_attr, and
that will be exposed to user ABI, which seems not good.
(https://lkml.org/lkml/2019/2/14/791)
So we wrap a new kernel API above to support this.

Do you have a different suggestion to do this?
(exclude host/guest just clears the enable bit when on VM-exit/entry,
still consumes the counter)

Just add an argument to perf_event_create_kernel_counter() ?


Yes. I didn't find a proper place to add this "no_counter" indicator, so 
added a
wrapper to avoid changing existing callers of 
perf_event_create_kernel_counter.


Best,
Wei


Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

2019-07-09 Thread Wei Wang

On 07/09/2019 05:39 PM, Peter Zijlstra wrote:

On Tue, Jul 09, 2019 at 11:04:21AM +0800, Wei Wang wrote:

On 07/08/2019 10:48 PM, Peter Zijlstra wrote:

*WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
this?

Because the VMX transition is much more frequent than the vCPU switching.
On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
would be too large for the frequent VMX transitions.

LBR state is saved when vCPU is scheduled out to ensure that this
vCPU's LBR data doesn't get lost (as another vCPU or host thread that
is scheduled in may use LBR)

But VMENTER/VMEXIT still have to enable/disable the LBR, right?
Otherwise the host will pollute LBR contents. And you then rely on this
'fake' event to ensure the host doesn't use LBR when the VCPU is
running.


Yes, only the debugctl msr is save/restore on vmx tranisions.




But what about the counter scheduling rules;


The counter is emulated independent of the lbr emulation.

Here is the background reason:

The direction we are going is the architectural emulation, where the 
features
are emulated based on the hardware behavior described in the spec. So 
the lbr
emulation path only offers the lbr feature to the guest (no counters 
associated, as

the lbr feature doesn't have a counter essentially).

If the above isn't clear, please see this example: the guest could run 
any software
to use the lbr feature (non-perf or non-linux, or even a testing kernel 
module to try
lbr for their own purpose), and it could choose to use a regular timer 
to do sampling.
If the lbr emulation takes a counter to generate a PMI to the guest to 
do sampling,

that pmi isn't expected from the guest perspective.

So the counter scheduling isn't considered by the lbr emulation here, it 
is considered
by the counter emulation. If the guest needs a counter, it configures 
the related msr,

which traps to KVM, and the counter emulation has it own emulation path
(e.g. reprogram_gp_counter which is called when the guest writes to the 
emulated

eventsel msr).



what happens when a CPU
event claims the LBR before the task event can claim it? CPU events have
precedence over task events.


I think the precedence (cpu pined and task pined) is for the counter 
multiplexing,

right?

For the lbr feature, could we thought of it as first come, first served?
For example, if we have 2 host threads who want to use lbr at the same time,
I think one of them would simply fail to use.

So if guest first gets the lbr, host wouldn't take over unless some 
userspace

command (we added to QEMU) is executed to have the vCPU actively
stop using lbr.




I'm missing all these details in the Changelogs. Please describe the
whole setup and explain why this approach.


OK, just shared some important background above.
I'll see if any more important details missed.

Best,
Wei


Re: [PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN

2019-07-08 Thread Wei Wang

On 07/08/2019 11:09 PM, Peter Zijlstra wrote:

On Mon, Jul 08, 2019 at 09:23:19AM +0800, Wei Wang wrote:

This patch enables the LBR related features in Arch v4 in advance,
though the current vPMU only has v2 support. Other arch v4 related
support will be enabled later in another series.

Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM,
the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi
has been set and a PMI is generated. The CTR_FRZ bit is set when
debugctl.freeze_perfmon_on_pmi is set and a PMI is generated.

(that's still a misnomer; it is: freeze_perfmon_on_overflow)


OK. (but that was directly copied from the sdm 18.2.4.1)



Why?

Who uses that v4 crud?


I saw the native perf driver has been updated to v4.
After the vPMU gets updated to v4, the guest perf would use that.

If you prefer to hold on this patch until vPMU v4 support,
we could do that as well.



It's broken. It looses events between overflow
and PMI.


Do you mean it's a v4 hardware issue?

Best,
Wei


Re: [PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack

2019-07-08 Thread Wei Wang

On 07/08/2019 10:53 PM, Peter Zijlstra wrote:

On Mon, Jul 08, 2019 at 09:23:17AM +0800, Wei Wang wrote:

When the vCPU is scheduled in:
- if the lbr feature was used in the last vCPU time slice, set the lbr
   stack to be interceptible, so that the host can capture whether the
   lbr feature will be used in this time slice;
- if the lbr feature wasn't used in the last vCPU time slice, disable
   the vCPU support of the guest lbr switching.

Upon the first access to one of the lbr related MSRs (since the vCPU was
scheduled in):
- record that the guest has used the lbr;
- create a host perf event to help save/restore the guest lbr stack;
- pass the stack through to the guest.

I don't understand a word of that.

Who cares if the LBR MSRs are touched; the guest expects up-to-date
values when it does reads them.


Another host thread who shares the same pCPU with this vCPU thread
may use the lbr stack, so the host needs to save/restore the vCPU's lbr 
state.

Otherwise the guest perf inside the vCPU wouldn't read the correct data
from the lbr msr (as the msrs are changed by another host thread already).

As Andi also replied, if the vCPU isn't using lbr anymore, host doesn't need
to save the lbr msr then.

Best,
Wei


Re: [PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

2019-07-08 Thread Wei Wang

On 07/08/2019 10:48 PM, Peter Zijlstra wrote:

On Mon, Jul 08, 2019 at 09:23:15AM +0800, Wei Wang wrote:

From: Like Xu 

This patch adds support to enable/disable the host side save/restore

This patch should be disqualified on Changelog alone...

   Documentation/process/submitting-patches.rst:instead of "[This patch] makes xyzzy do 
frotz" or "[I] changed xyzzy


OK, we will discard "This patch" in the description:

To enable/disable the host side save/restore for the guest lbr stack on
vCPU switching, the host creates a perf event for the vCPU..


for the guest lbr stack on vCPU switching. To enable that, the host
creates a perf event for the vCPU, and the event attributes are set
to the user callstack mode lbr so that all the conditions are meet in
the host perf subsystem to save the lbr stack on task switching.

The host side lbr perf event are created only for the purpose of saving
and restoring the lbr stack. There is no need to enable the lbr
functionality for this perf event, because the feature is essentially
used in the vCPU. So perf_event_create is invoked with need_counter=false
to get no counter assigned for the perf event.

The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
used by the vCPU only for context switching. When the perf subsystem
handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
it's non-zero, it simply returns.

*WHY* does the host need to save/restore? Why not make VMENTER/VMEXIT do
this?


Because the VMX transition is much more frequent than the vCPU switching.
On SKL, saving 32 LBR entries could add 3000~4000 cycles overhead, this
would be too large for the frequent VMX transitions.

LBR state is saved when vCPU is scheduled out to ensure that this vCPU's
LBR data doesn't get lost (as another vCPU or host thread that is 
scheduled in

may use LBR)



Many of these patches don't explain why things are done; that's a
problem.


We'll improve, please help when you find anywhere isn't clear to you, 
thanks.


Best,
Wei


Re: [PATCH v7 07/12] perf/x86: no counter allocation support

2019-07-08 Thread Wei Wang

On 07/08/2019 10:29 PM, Peter Zijlstra wrote:

Thanks for the comments.




diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0ab99c7..19e6593 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
   */
  #define PERF_EV_CAP_SOFTWARE  BIT(0)
  #define PERF_EV_CAP_READ_ACTIVE_PKG   BIT(1)
+#define PERF_EV_CAP_NO_COUNTER BIT(2)
  
  #define SWEVENT_HLIST_BITS		8

  #define SWEVENT_HLIST_SIZE(1 << SWEVENT_HLIST_BITS)
@@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, 
int refresh);
  extern void perf_event_update_userpage(struct perf_event *event);
  extern int perf_event_release_kernel(struct perf_event *event);
  extern struct perf_event *
+perf_event_create(struct perf_event_attr *attr,
+ int cpu,
+ struct task_struct *task,
+ perf_overflow_handler_t overflow_handler,
+ void *context,
+ bool counter_assignment);
+extern struct perf_event *
  perf_event_create_kernel_counter(struct perf_event_attr *attr,
int cpu,
struct task_struct *task,

Why the heck are you creating this wrapper nonsense?


(please see early discussions: https://lkml.org/lkml/2018/9/20/868)
I thought we agreed that the perf event created here don't need to consume
an extra counter.

In the previous version, we added a "no_counter" bit to perf_event_attr, and
that will be exposed to user ABI, which seems not good.
(https://lkml.org/lkml/2019/2/14/791)
So we wrap a new kernel API above to support this.

Do you have a different suggestion to do this?
(exclude host/guest just clears the enable bit when on VM-exit/entry,
still consumes the counter)

Best,
Wei


[PATCH v7 04/12] KVM/x86: intel_pmu_lbr_enable

2019-07-07 Thread Wei Wang
The lbr stack is architecturally specific, for example, SKX has 32 lbr
stack entries while HSW has 16 entries, so a HSW guest running on a SKX
machine may not get accurate perf results. Currently, we forbid the
guest lbr enabling when the guest and host see different lbr stack
entries or the host and guest see different lbr stack msr indices.

Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/kvm/pmu.c   |   8 +++
 arch/x86/kvm/pmu.h   |   2 +
 arch/x86/kvm/vmx/pmu_intel.c | 136 +++
 arch/x86/kvm/x86.c   |   3 +-
 4 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 132d149..7d7ac18 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -296,6 +296,14 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 
*data)
return 0;
 }
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+   if (kvm_x86_ops->pmu_ops->lbr_enable)
+   return kvm_x86_ops->pmu_ops->lbr_enable(vcpu);
+
+   return false;
+}
+
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
if (lapic_in_kernel(vcpu))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 22dff66..c099b4b 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -29,6 +29,7 @@ struct kvm_pmu_ops {
  u64 *mask);
int (*is_valid_msr_idx)(struct kvm_vcpu *vcpu, unsigned idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
+   bool (*lbr_enable)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*refresh)(struct kvm_vcpu *vcpu);
@@ -107,6 +108,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
eventsel);
 void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
 void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
 
+bool kvm_pmu_lbr_enable(struct kvm_vcpu *vcpu);
 void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
 int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 68d231d..ef8ebd4 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
@@ -162,6 +163,140 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 
msr)
return ret;
 }
 
+static bool intel_pmu_lbr_enable(struct kvm_vcpu *vcpu)
+{
+   struct kvm *kvm = vcpu->kvm;
+   u8 vcpu_model = guest_cpuid_model(vcpu);
+   unsigned int vcpu_lbr_from, vcpu_lbr_nr;
+
+   if (x86_perf_get_lbr_stack(>arch.lbr_stack))
+   return false;
+
+   if (guest_cpuid_family(vcpu) != boot_cpu_data.x86)
+   return false;
+
+   /*
+* It could be possible that people have vcpus of old model run on
+* physcal cpus of newer model, for example a BDW guest on a SKX
+* machine (but not possible to be the other way around).
+* The BDW guest may not get accurate results on a SKX machine as it
+* only reads 16 entries of the lbr stack while there are 32 entries
+* of recordings. We currently forbid the lbr enabling when the vcpu
+* and physical cpu see different lbr stack entries or the guest lbr
+* msr indices are not compatible with the host.
+*/
+   switch (vcpu_model) {
+   case INTEL_FAM6_CORE2_MEROM:
+   case INTEL_FAM6_CORE2_MEROM_L:
+   case INTEL_FAM6_CORE2_PENRYN:
+   case INTEL_FAM6_CORE2_DUNNINGTON:
+   /* intel_pmu_lbr_init_core() */
+   vcpu_lbr_nr = 4;
+   vcpu_lbr_from = MSR_LBR_CORE_FROM;
+   break;
+   case INTEL_FAM6_NEHALEM:
+   case INTEL_FAM6_NEHALEM_EP:
+   case INTEL_FAM6_NEHALEM_EX:
+   /* intel_pmu_lbr_init_nhm() */
+   vcpu_lbr_nr = 16;
+   vcpu_lbr_from = MSR_LBR_NHM_FROM;
+   break;
+   case INTEL_FAM6_ATOM_BONNELL:
+   case INTEL_FAM6_ATOM_BONNELL_MID:
+   case INTEL_FAM6_ATOM_SALTWELL:
+   case INTEL_FAM6_ATOM_SALTWELL_MID:
+   case INTEL_FAM6_ATOM_SALTWELL_TABLET:
+   /* intel_pmu_lbr_init_atom() */
+   vcpu_lbr_nr = 8;
+   vcpu_lbr_from = MSR_LBR_CORE_FROM;
+   break;
+   case INTEL_FAM6_ATOM_SILVERMONT:
+   case INTEL_FAM6_ATOM_SILVERMONT_X:
+   case INTEL_FAM6_ATOM_SILVERMONT_MID:
+   case INTEL_FAM6_ATOM_AIRMONT:
+   case INTEL_FAM6_ATOM_AIRMONT_MID:
+   /* intel_pmu_lbr_init_slm() */
+   vcpu_lbr_nr = 8;
+   vcpu_lbr_from = MSR_LBR_CORE_FROM;
+   break;
+  

[PATCH v7 12/12] KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN

2019-07-07 Thread Wei Wang
This patch enables the LBR related features in Arch v4 in advance,
though the current vPMU only has v2 support. Other arch v4 related
support will be enabled later in another series.

Arch v4 supports streamlined Freeze_LBR_on_PMI. According to the SDM,
the LBR_FRZ bit is set to global status when debugctl.freeze_lbr_on_pmi
has been set and a PMI is generated. The CTR_FRZ bit is set when
debugctl.freeze_perfmon_on_pmi is set and a PMI is generated.

Signed-off-by: Wei Wang 
Cc: Andi Kleen 
Cc: Paolo Bonzini 
Cc: Kan Liang 
---
 arch/x86/kvm/pmu.c   | 11 +--
 arch/x86/kvm/pmu.h   |  1 +
 arch/x86/kvm/vmx/pmu_intel.c | 20 
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 323bb45..89bff8f 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -52,6 +52,13 @@ static void kvm_pmi_trigger_fn(struct irq_work *irq_work)
kvm_pmu_deliver_pmi(vcpu);
 }
 
+static void kvm_perf_set_global_status(struct kvm_pmu *pmu, u8 idx)
+{
+   __set_bit(idx, (unsigned long *)>global_status);
+   if (kvm_x86_ops->pmu_ops->set_global_status)
+   kvm_x86_ops->pmu_ops->set_global_status(pmu, idx);
+}
+
 static void kvm_perf_overflow(struct perf_event *perf_event,
  struct perf_sample_data *data,
  struct pt_regs *regs)
@@ -61,7 +68,7 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
 
if (!test_and_set_bit(pmc->idx,
  (unsigned long *)>reprogram_pmi)) {
-   __set_bit(pmc->idx, (unsigned long *)>global_status);
+   kvm_perf_set_global_status(pmu, pmc->idx);
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}
 }
@@ -75,7 +82,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
*perf_event,
 
if (!test_and_set_bit(pmc->idx,
  (unsigned long *)>reprogram_pmi)) {
-   __set_bit(pmc->idx, (unsigned long *)>global_status);
+   kvm_perf_set_global_status(pmu, pmc->idx);
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
 
/*
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index cadf91a..408ddc2 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -24,6 +24,7 @@ struct kvm_pmu_ops {
u8 unit_mask);
unsigned (*find_fixed_event)(int idx);
bool (*pmc_is_enabled)(struct kvm_pmc *pmc);
+   void (*set_global_status)(struct kvm_pmu *pmu, u8 idx);
struct kvm_pmc *(*pmc_idx_to_pmc)(struct kvm_pmu *pmu, int pmc_idx);
struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, unsigned idx,
  u64 *mask);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index fd09777..6f74b69 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -413,6 +413,22 @@ static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
return ret;
 }
 
+static void intel_pmu_set_global_status(struct kvm_pmu *pmu, u8 idx)
+{
+   u64 guest_debugctl;
+
+   if (pmu->version >= 4) {
+   guest_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+
+   if (guest_debugctl & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI)
+   __set_bit(GLOBAL_STATUS_LBRS_FROZEN,
+ (unsigned long *)>global_status);
+   if (guest_debugctl & DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI)
+   __set_bit(GLOBAL_STATUS_COUNTERS_FROZEN,
+ (unsigned long *)>global_status);
+   }
+}
+
 static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -597,6 +613,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl_mask = pmu->global_ctrl_mask
& ~(MSR_CORE_PERF_GLOBAL_OVF_CTRL_OVF_BUF |
MSR_CORE_PERF_GLOBAL_OVF_CTRL_COND_CHGD);
+   if (pmu->version >= 4)
+   pmu->global_ovf_ctrl_mask &= ~(GLOBAL_STATUS_LBRS_FROZEN |
+  GLOBAL_STATUS_COUNTERS_FROZEN);
if (kvm_x86_ops->pt_supported())
pmu->global_ovf_ctrl_mask &=
~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI;
@@ -711,6 +730,7 @@ void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu)
 struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
+   .set_global_status = intel_pmu_set_global_status,
.pmc_is_enabled = intel_pmc_is_enabled,
.pmc_idx_to_pmc = intel_pmc_idx_to_pmc,
.msr_idx_to_pmc = intel_msr_idx_to_pmc,
-- 
2.7.4



[PATCH v7 02/12] perf/x86: add a function to get the lbr stack

2019-07-07 Thread Wei Wang
The LBR stack MSRs are architecturally specific. The perf subsystem has
already assigned the abstracted MSR values based on the CPU architecture.

This patch enables a caller outside the perf subsystem to get the LBR
stack info. This is useful for hyperviosrs to prepare the lbr feature
for the guest.

Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/events/intel/lbr.c   | 23 +++
 arch/x86/include/asm/perf_event.h | 14 ++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 6f814a2..784642a 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1311,3 +1311,26 @@ void intel_pmu_lbr_init_knl(void)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
 }
+
+/**
+ * x86_perf_get_lbr_stack - get the lbr stack related MSRs
+ *
+ * @stack: the caller's memory to get the lbr stack
+ *
+ * Returns: 0 indicates that the lbr stack has been successfully obtained.
+ */
+int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+   stack->nr = x86_pmu.lbr_nr;
+   stack->tos = x86_pmu.lbr_tos;
+   stack->from = x86_pmu.lbr_from;
+   stack->to = x86_pmu.lbr_to;
+
+   if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+   stack->info = MSR_LBR_INFO_0;
+   else
+   stack->info = 0;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_lbr_stack);
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 1392d5e..2606100 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -318,7 +318,16 @@ struct perf_guest_switch_msr {
u64 host, guest;
 };
 
+struct x86_perf_lbr_stack {
+   unsigned intnr;
+   unsigned inttos;
+   unsigned intfrom;
+   unsigned intto;
+   unsigned intinfo;
+};
+
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
 extern int x86_perf_rdpmc_index(struct perf_event *event);
@@ -329,6 +338,11 @@ static inline struct perf_guest_switch_msr 
*perf_guest_get_msrs(int *nr)
return NULL;
 }
 
+static inline int x86_perf_get_lbr_stack(struct x86_perf_lbr_stack *stack)
+{
+   return -1;
+}
+
 static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 {
memset(cap, 0, sizeof(*cap));
-- 
2.7.4



[PATCH v7 06/12] KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest

2019-07-07 Thread Wei Wang
Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of
the addresses stored in the LBR stack. Expose those bits to the guest
when the guest lbr feature is enabled.

Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
---
 arch/x86/include/asm/perf_event.h |  2 ++
 arch/x86/kvm/cpuid.c  |  2 +-
 arch/x86/kvm/vmx/pmu_intel.c  | 16 
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 2606100..aa77da2 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -95,6 +95,8 @@
 #define PEBS_DATACFG_LBRS  BIT_ULL(3)
 #define PEBS_DATACFG_LBR_SHIFT 24
 
+#define X86_PERF_CAP_MASK_LBR_FMT  0x3f
+
 /*
  * Intel "Architectural Performance Monitoring" CPUID
  * detection/enumeration details:
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4992e7c..4b9e713 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -361,7 +361,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
-   F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+   F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) |
F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 1e19b01..09ae6ff 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -151,6 +151,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 
msr)
case MSR_CORE_PERF_GLOBAL_STATUS:
case MSR_CORE_PERF_GLOBAL_CTRL:
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+   case MSR_IA32_PERF_CAPABILITIES:
ret = pmu->version > 1;
break;
default:
@@ -316,6 +317,19 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+   case MSR_IA32_PERF_CAPABILITIES: {
+   u64 data;
+
+   if (!boot_cpu_has(X86_FEATURE_PDCM) ||
+   (!msr_info->host_initiated &&
+!guest_cpuid_has(vcpu, X86_FEATURE_PDCM)))
+   return 1;
+   data = native_read_msr(MSR_IA32_PERF_CAPABILITIES);
+   msr_info->data = 0;
+   if (vcpu->kvm->arch.lbr_in_guest)
+   msr_info->data |= (data & X86_PERF_CAP_MASK_LBR_FMT);
+   return 0;
+   }
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -374,6 +388,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 0;
}
break;
+   case MSR_IA32_PERF_CAPABILITIES:
+   return 1; /* RO MSR */
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (msr_info->host_initiated)
-- 
2.7.4



[PATCH v7 11/12] KVM/x86: remove the common handling of the debugctl msr

2019-07-07 Thread Wei Wang
The debugctl msr is not completely identical on AMD and Intel CPUs, for
example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this
msr is handled separatedly in svm.c and intel_pmu.c. So remove the
common debugctl msr handling code in kvm_get/set_msr_common.

Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/kvm/x86.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3bc7f2..08aa34b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2513,18 +2513,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
return 1;
}
break;
-   case MSR_IA32_DEBUGCTLMSR:
-   if (!data) {
-   /* We support the non-activated case already */
-   break;
-   } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
-   /* Values other than LBR and BTF are vendor-specific,
-  thus reserved and should throw a #GP */
-   return 1;
-   }
-   vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
-   __func__, data);
-   break;
case 0x200 ... 0x2ff:
return kvm_mtrr_set_msr(vcpu, msr, data);
case MSR_IA32_APICBASE:
@@ -2766,7 +2754,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
switch (msr_info->index) {
case MSR_IA32_PLATFORM_ID:
case MSR_IA32_EBL_CR_POWERON:
-   case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:
case MSR_IA32_LASTBRANCHTOIP:
case MSR_IA32_LASTINTFROMIP:
-- 
2.7.4



[PATCH v7 10/12] KVM/x86/lbr: lazy save the guest lbr stack

2019-07-07 Thread Wei Wang
When the vCPU is scheduled in:
- if the lbr feature was used in the last vCPU time slice, set the lbr
  stack to be interceptible, so that the host can capture whether the
  lbr feature will be used in this time slice;
- if the lbr feature wasn't used in the last vCPU time slice, disable
  the vCPU support of the guest lbr switching.

Upon the first access to one of the lbr related MSRs (since the vCPU was
scheduled in):
- record that the guest has used the lbr;
- create a host perf event to help save/restore the guest lbr stack;
- pass the stack through to the guest.

Suggested-by: Andi Kleen 
Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/pmu.c  |   6 ++
 arch/x86/kvm/pmu.h  |   2 +
 arch/x86/kvm/vmx/pmu_intel.c| 141 
 arch/x86/kvm/vmx/vmx.c  |   4 +-
 arch/x86/kvm/vmx/vmx.h  |   2 +
 arch/x86/kvm/x86.c  |   2 +
 7 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 79e9c92..cf8996e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -469,6 +469,8 @@ struct kvm_pmu {
u64 global_ctrl_mask;
u64 global_ovf_ctrl_mask;
u64 reserved_bits;
+   /* Indicate if the lbr msrs were accessed in this vCPU time slice */
+   bool lbr_used;
u8 version;
struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index ee6ed47..323bb45 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -325,6 +325,12 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data 
*msr_info)
return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
 }
 
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+   if (kvm_x86_ops->pmu_ops->sched_in)
+   kvm_x86_ops->pmu_ops->sched_in(vcpu, cpu);
+}
+
 /* refresh PMU settings. This function generally is called when underlying
  * settings are changed (such as changes of PMU CPUID by guest VMs), which
  * should rarely happen.
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 384a0b7..cadf91a 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -32,6 +32,7 @@ struct kvm_pmu_ops {
bool (*lbr_enable)(struct kvm_vcpu *vcpu);
int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+   void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
@@ -116,6 +117,7 @@ int kvm_pmu_is_valid_msr_idx(struct kvm_vcpu *vcpu, 
unsigned idx);
 bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
 int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
 int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+void kvm_pmu_sched_in(struct kvm_vcpu *vcpu, int cpu);
 void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
 void kvm_pmu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_init(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 24a544e..fd09777 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -13,10 +13,12 @@
 #include 
 #include 
 #include 
+#include 
 #include "x86.h"
 #include "cpuid.h"
 #include "lapic.h"
 #include "pmu.h"
+#include "vmx.h"
 
 static struct kvm_event_hw_type_mapping intel_arch_events[] = {
/* Index must match CPUID 0x0A.EBX bit vector */
@@ -141,6 +143,18 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct 
kvm_vcpu *vcpu,
return [idx];
 }
 
+static inline bool is_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
+{
+   struct x86_perf_lbr_stack *stack = >kvm->arch.lbr_stack;
+   int nr = stack->nr;
+
+   return !!(index == MSR_LBR_SELECT ||
+ index == stack->tos ||
+ (index >= stack->from && index < stack->from + nr) ||
+ (index >= stack->to && index < stack->to + nr) ||
+ (index >= stack->info && index < stack->info));
+}
+
 static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
 {
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -152,9 +166,12 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 
msr)
case MSR_CORE_PERF_GLOBAL_CTRL:
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
case MSR_IA32_PERF_CAPABILITIES:
+   case MSR_IA32_DEBUGCTLMSR:
ret = pmu->version > 1;
break;
default:
+   if (is_lbr_msr(vcpu, msr))
+   retu

[PATCH v7 09/12] perf/x86: save/restore LBR_SELECT on vCPU switching

2019-07-07 Thread Wei Wang
The vCPU lbr event relies on the host to save/restore all the lbr
related MSRs. So add the LBR_SELECT save/restore to the related
functions for the vCPU case.

Signed-off-by: Wei Wang 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
---
 arch/x86/events/intel/lbr.c  | 7 +++
 arch/x86/events/perf_event.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 118764b..4861a9d 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -383,6 +383,9 @@ static void __intel_pmu_lbr_restore(struct 
x86_perf_task_context *task_ctx)
 
wrmsrl(x86_pmu.lbr_tos, tos);
task_ctx->lbr_stack_state = LBR_NONE;
+
+   if (cpuc->vcpu_lbr)
+   wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
 }
 
 static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -409,6 +412,10 @@ static void __intel_pmu_lbr_save(struct 
x86_perf_task_context *task_ctx)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
rdmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
}
+
+   if (cpuc->vcpu_lbr)
+   rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
+
task_ctx->valid_lbrs = i;
task_ctx->tos = tos;
task_ctx->lbr_stack_state = LBR_VALID;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 86605d1..e37ff82 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -721,6 +721,7 @@ struct x86_perf_task_context {
u64 lbr_from[MAX_LBR_ENTRIES];
u64 lbr_to[MAX_LBR_ENTRIES];
u64 lbr_info[MAX_LBR_ENTRIES];
+   u64 lbr_sel;
int tos;
int valid_lbrs;
int lbr_callstack_users;
-- 
2.7.4



[PATCH v7 07/12] perf/x86: no counter allocation support

2019-07-07 Thread Wei Wang
In some cases, an event may be created without needing a counter
allocation. For example, an lbr event may be created by the host
only to help save/restore the lbr stack on the vCPU context switching.

This patch adds a new interface to allow users to create a perf event
without the need of counter assignment.

Signed-off-by: Wei Wang 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/events/core.c | 12 
 include/linux/perf_event.h | 13 +
 kernel/events/core.c   | 37 +
 3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f315425..eebbd65 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -410,6 +410,9 @@ int x86_setup_perfctr(struct perf_event *event)
struct hw_perf_event *hwc = >hw;
u64 config;
 
+   if (is_no_counter_event(event))
+   return 0;
+
if (!is_sampling_event(event)) {
hwc->sample_period = x86_pmu.max_period;
hwc->last_period = hwc->sample_period;
@@ -1248,6 +1251,12 @@ static int x86_pmu_add(struct perf_event *event, int 
flags)
hwc = >hw;
 
n0 = cpuc->n_events;
+
+   if (is_no_counter_event(event)) {
+   n = n0;
+   goto done_collect;
+   }
+
ret = n = collect_events(cpuc, event, false);
if (ret < 0)
goto out;
@@ -1422,6 +1431,9 @@ static void x86_pmu_del(struct perf_event *event, int 
flags)
if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
goto do_del;
 
+   if (is_no_counter_event(event))
+   goto do_del;
+
/*
 * Not a TXN, therefore cleanup properly.
 */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0ab99c7..19e6593 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -528,6 +528,7 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *,
  */
 #define PERF_EV_CAP_SOFTWARE   BIT(0)
 #define PERF_EV_CAP_READ_ACTIVE_PKGBIT(1)
+#define PERF_EV_CAP_NO_COUNTER BIT(2)
 
 #define SWEVENT_HLIST_BITS 8
 #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS)
@@ -895,6 +896,13 @@ extern int perf_event_refresh(struct perf_event *event, 
int refresh);
 extern void perf_event_update_userpage(struct perf_event *event);
 extern int perf_event_release_kernel(struct perf_event *event);
 extern struct perf_event *
+perf_event_create(struct perf_event_attr *attr,
+ int cpu,
+ struct task_struct *task,
+ perf_overflow_handler_t overflow_handler,
+ void *context,
+ bool counter_assignment);
+extern struct perf_event *
 perf_event_create_kernel_counter(struct perf_event_attr *attr,
int cpu,
struct task_struct *task,
@@ -1032,6 +1040,11 @@ static inline bool is_sampling_event(struct perf_event 
*event)
return event->attr.sample_period != 0;
 }
 
+static inline bool is_no_counter_event(struct perf_event *event)
+{
+   return !!(event->event_caps & PERF_EV_CAP_NO_COUNTER);
+}
+
 /*
  * Return 1 for a software event, 0 for a hardware event
  */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3..70884df 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11162,18 +11162,10 @@ SYSCALL_DEFINE5(perf_event_open,
return err;
 }
 
-/**
- * perf_event_create_kernel_counter
- *
- * @attr: attributes of the counter to create
- * @cpu: cpu in which the counter is bound
- * @task: task to profile (NULL for percpu)
- */
-struct perf_event *
-perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
-struct task_struct *task,
-perf_overflow_handler_t overflow_handler,
-void *context)
+struct perf_event *perf_event_create(struct perf_event_attr *attr, int cpu,
+struct task_struct *task,
+perf_overflow_handler_t overflow_handler,
+void *context, bool need_counter)
 {
struct perf_event_context *ctx;
struct perf_event *event;
@@ -11193,6 +11185,9 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr, int cpu,
/* Mark owner so we could distinguish it from user events. */
event->owner = TASK_TOMBSTONE;
 
+   if (!need_counter)
+   event->event_caps |= PERF_EV_CAP_NO_COUNTER;
+
ctx = find_get_context(event->pmu, task, event);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
@@ -11241,6 +11236,24 @@ perf_event_create_kernel_counter(struct 
perf_event_attr *attr, int cpu,
 err:
return ERR_PTR(err);
 }
+EXPORT_SYMBOL_GPL(perf_event

[PATCH v7 01/12] perf/x86: fix the variable type of the LBR MSRs

2019-07-07 Thread Wei Wang
The MSR variable type can be "unsigned int", which uses less memory than
the longer unsigned long. The lbr nr won't be a negative number, so make
it "unsigned int" as well.

Suggested-by: Peter Zijlstra 
Signed-off-by: Wei Wang 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
---
 arch/x86/events/perf_event.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a6ac2f4..186c1c7 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -682,8 +682,8 @@ struct x86_pmu {
/*
 * Intel LBR
 */
-   unsigned long   lbr_tos, lbr_from, lbr_to; /* MSR base regs   */
-   int lbr_nr;/* hardware stack size */
+   unsigned intlbr_tos, lbr_from, lbr_to,
+   lbr_nr;/* lbr stack and size */
u64 lbr_sel_mask;  /* LBR_SELECT valid bits */
const int   *lbr_sel_map;  /* lbr_select mappings */
boollbr_double_abort;  /* duplicated lbr aborts */
-- 
2.7.4



[PATCH v7 08/12] KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr stack

2019-07-07 Thread Wei Wang
From: Like Xu 

This patch adds support to enable/disable the host side save/restore
for the guest lbr stack on vCPU switching. To enable that, the host
creates a perf event for the vCPU, and the event attributes are set
to the user callstack mode lbr so that all the conditions are meet in
the host perf subsystem to save the lbr stack on task switching.

The host side lbr perf event are created only for the purpose of saving
and restoring the lbr stack. There is no need to enable the lbr
functionality for this perf event, because the feature is essentially
used in the vCPU. So perf_event_create is invoked with need_counter=false
to get no counter assigned for the perf event.

The vcpu_lbr field is added to cpuc, to indicate if the lbr perf event is
used by the vCPU only for context switching. When the perf subsystem
handles this event (e.g. lbr enable or read lbr stack on PMI) and finds
it's non-zero, it simply returns.

Signed-off-by: Like Xu 
Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/events/intel/lbr.c | 13 +++--
 arch/x86/events/perf_event.h|  1 +
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.h  |  3 ++
 arch/x86/kvm/vmx/pmu_intel.c| 61 +
 5 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 784642a..118764b 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -462,6 +462,9 @@ void intel_pmu_lbr_add(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return;
 
+   if (event->attr.exclude_guest && is_no_counter_event(event))
+   cpuc->vcpu_lbr = 1;
+
cpuc->br_sel = event->hw.branch_reg.reg;
 
if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -509,6 +512,9 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}
 
+   if (event->attr.exclude_guest && is_no_counter_event(event))
+   cpuc->vcpu_lbr = 0;
+
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
@@ -521,7 +527,7 @@ void intel_pmu_lbr_enable_all(bool pmi)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
 
-   if (cpuc->lbr_users)
+   if (cpuc->lbr_users && !cpuc->vcpu_lbr)
__intel_pmu_lbr_enable(pmi);
 }
 
@@ -529,7 +535,7 @@ void intel_pmu_lbr_disable_all(void)
 {
struct cpu_hw_events *cpuc = this_cpu_ptr(_hw_events);
 
-   if (cpuc->lbr_users)
+   if (cpuc->lbr_users && !cpuc->vcpu_lbr)
__intel_pmu_lbr_disable();
 }
 
@@ -669,7 +675,8 @@ void intel_pmu_lbr_read(void)
 * This could be smarter and actually check the event,
 * but this simple approach seems to work for now.
 */
-   if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+   if (!cpuc->lbr_users || cpuc->vcpu_lbr ||
+   cpuc->lbr_users == cpuc->lbr_pebs_users)
return;
 
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 186c1c7..86605d1 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -238,6 +238,7 @@ struct cpu_hw_events {
/*
 * Intel LBR bits
 */
+   u8  vcpu_lbr;
int lbr_users;
int lbr_pebs_users;
struct perf_branch_stacklbr_stack;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d80925..79e9c92 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -474,6 +474,7 @@ struct kvm_pmu {
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
struct irq_work irq_work;
u64 reprogram_pmi;
+   struct perf_event *vcpu_lbr_event;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7926b65..384a0b7 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -123,6 +123,9 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
+extern int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu);
+extern void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu);
+
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 09ae6ff..24a544e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -507,6 +507,67 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->glo

[PATCH v7 03/12] KVM/x86: KVM_CAP_X86_GUEST_LBR

2019-07-07 Thread Wei Wang
Introduce KVM_CAP_X86_GUEST_LBR to allow per-VM enabling of the guest
lbr feature.

Signed-off-by: Wei Wang 
Cc: Paolo Bonzini 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/x86.c  | 14 ++
 include/uapi/linux/kvm.h|  1 +
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26d1eb8..8d80925 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -871,6 +871,7 @@ struct kvm_arch {
atomic_t vapics_in_nmi_mode;
struct mutex apic_map_lock;
struct kvm_apic_map *apic_map;
+   struct x86_perf_lbr_stack lbr_stack;
 
bool apic_access_page_done;
 
@@ -879,6 +880,7 @@ struct kvm_arch {
bool mwait_in_guest;
bool hlt_in_guest;
bool pause_in_guest;
+   bool lbr_in_guest;
 
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9857992..b35a118 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3086,6 +3086,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_GET_MSR_FEATURES:
case KVM_CAP_MSR_PLATFORM_INFO:
case KVM_CAP_EXCEPTION_PAYLOAD:
+   case KVM_CAP_X86_GUEST_LBR:
r = 1;
break;
case KVM_CAP_SYNC_REGS:
@@ -4622,6 +4623,19 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.exception_payload_enabled = cap->args[0];
r = 0;
break;
+   case KVM_CAP_X86_GUEST_LBR:
+   r = -EINVAL;
+   if (cap->args[0] &&
+   x86_perf_get_lbr_stack(>arch.lbr_stack))
+   break;
+
+   if (copy_to_user((void __user *)cap->args[1],
+>arch.lbr_stack,
+sizeof(struct x86_perf_lbr_stack)))
+   break;
+   kvm->arch.lbr_in_guest = cap->args[0];
+   r = 0;
+   break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2fe12b4..5391cbc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -993,6 +993,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_SVE 170
 #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171
 #define KVM_CAP_ARM_PTRAUTH_GENERIC 172
+#define KVM_CAP_X86_GUEST_LBR 173
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4



[PATCH v7 00/12] Guest LBR Enabling

2019-07-07 Thread Wei Wang
Last Branch Recording (LBR) is a performance monitor unit (PMU) feature
on Intel CPUs that captures branch related info. This patch series enables
this feature to KVM guests.

Here is a conclusion of the fundamental methods that we use:
1) the LBR feature is enabled per guest via QEMU setting of
   KVM_CAP_X86_GUEST_LBR;
2) the LBR stack is passed through to the guest for direct accesses after
   the guest's first access to any of the lbr related MSRs;
3) the host will help save/resotre the LBR stack when the vCPU is
   scheduled out/in.

ChangeLog:
pmu.c:
- patch 4: remove guest_cpuid_is_intel, which can be avoided by
   directly checking pmu_ops->lbr_enable.
pmu_intel.c:
- patch 10: include MSR_LBR_SELECT into is_lbr_msr check, and pass
this msr through to the guest, instead of trapping
accesses.
- patch 12: fix the condition check of setting LBRS_FROZEN and
COUNTERS_FROZEN;
update the global_ovf_ctrl_mask for LBRS_FROZEN and
COUNTERS_FROZEN.

previous:
https://lkml.org/lkml/2019/6/6/133

Like Xu (1):
  KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr
stack

Wei Wang (11):
  perf/x86: fix the variable type of the LBR MSRs
  perf/x86: add a function to get the lbr stack
  KVM/x86: KVM_CAP_X86_GUEST_LBR
  KVM/x86: intel_pmu_lbr_enable
  KVM/x86/vPMU: tweak kvm_pmu_get_msr
  KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
  perf/x86: no counter allocation support
  perf/x86: save/restore LBR_SELECT on vCPU switching
  KVM/x86/lbr: lazy save the guest lbr stack
  KVM/x86: remove the common handling of the debugctl msr
  KVM/VMX/vPMU: support to report GLOBAL_STATUS_LBRS_FROZEN

 arch/x86/events/core.c|  12 ++
 arch/x86/events/intel/lbr.c   |  43 -
 arch/x86/events/perf_event.h  |   6 +-
 arch/x86/include/asm/kvm_host.h   |   5 +
 arch/x86/include/asm/perf_event.h |  16 ++
 arch/x86/kvm/cpuid.c  |   2 +-
 arch/x86/kvm/pmu.c|  29 ++-
 arch/x86/kvm/pmu.h|  12 +-
 arch/x86/kvm/pmu_amd.c|   7 +-
 arch/x86/kvm/vmx/pmu_intel.c  | 393 +-
 arch/x86/kvm/vmx/vmx.c|   4 +-
 arch/x86/kvm/vmx/vmx.h|   2 +
 arch/x86/kvm/x86.c|  32 ++--
 include/linux/perf_event.h|  13 ++
 include/uapi/linux/kvm.h  |   1 +
 kernel/events/core.c  |  37 ++--
 16 files changed, 562 insertions(+), 52 deletions(-)

-- 
2.7.4



  1   2   3   4   5   6   7   8   9   10   >