date:20151029

From: Shannon Zhao 

Since the reset value of PMSELR_EL0 is UNKNOWN, use reset_unknown for
its reset handler. As it doesn't need to deal with the acsessing action
specially, it uses default case to emulate writing and reading PMSELR
register.

Add a helper for CP15 registers reset to UNKNOWN.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 5 +++--
 arch/arm64/kvm/sys_regs.h | 8 
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 5b591d6..35d232e 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -707,7 +707,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  trap_raz_wi },
/* PMSELR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b101),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMSELR_EL0 },
/* PMCEID0_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b110),
  trap_raz_wi },
@@ -998,7 +998,8 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn( 9), CRm(12), Op2( 1), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(12), Op2( 2), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(12), Op2( 3), trap_raz_wi },
-   { Op1( 0), CRn( 9), CRm(12), Op2( 5), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMSELR },
{ Op1( 0), CRn( 9), CRm(12), Op2( 6), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(12), Op2( 7), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
diff --git a/arch/arm64/kvm/sys_regs.h b/arch/arm64/kvm/sys_regs.h
index eaa324e..8afeff7 100644
--- a/arch/arm64/kvm/sys_regs.h
+++ b/arch/arm64/kvm/sys_regs.h
@@ -110,6 +110,14 @@ static inline void reset_unknown(struct kvm_vcpu *vcpu,
vcpu_sys_reg(vcpu, r->reg) = 0x1de7ec7edbadc0deULL;
 }
 
+static inline void reset_unknown_cp15(struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *r)
+{
+   BUG_ON(!r->reg);
+   BUG_ON(r->reg >= NR_COPRO_REGS);
+   vcpu_cp15(vcpu, r->reg) = 0xdecafbad;
+}
+
 static inline void reset_val(struct kvm_vcpu *vcpu, const struct sys_reg_desc 
*r)
 {
BUG_ON(!r->reg);
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 10/21] KVM: ARM64: Add reset and access handlers for PMCCNTR register

From: Shannon Zhao 

Since the reset value of PMCCNTR is UNKNOWN, use reset_unknown for its
reset handler. Add a new case to emulate reading and writing to PMCCNTR
register.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index b7ca2cd..059c84c 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -491,6 +491,13 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
 
if (p->is_write) {
switch (r->reg) {
+   case PMCCNTR_EL0: {
+   val = kvm_pmu_get_counter_value(vcpu,
+   ARMV8_MAX_COUNTERS - 1);
+   vcpu_sys_reg(vcpu, r->reg) +=
+ (s64)*vcpu_reg(vcpu, p->Rt) - val;
+   break;
+   }
case PMXEVCNTR_EL0: {
int index = PMEVCNTR0_EL0
+ vcpu_sys_reg(vcpu, PMSELR_EL0);
@@ -529,6 +536,12 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
}
} else {
switch (r->reg) {
+   case PMCCNTR_EL0: {
+   val = kvm_pmu_get_counter_value(vcpu,
+   ARMV8_MAX_COUNTERS - 1);
+   *vcpu_reg(vcpu, p->Rt) = val;
+   break;
+   }
case PMXEVCNTR_EL0: {
val = kvm_pmu_get_counter_value(vcpu,
vcpu_sys_reg(vcpu, PMSELR_EL0));
@@ -759,7 +772,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  access_pmu_regs, reset_pmceid, PMCEID1_EL0 },
/* PMCCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMCCNTR_EL0 },
/* PMXEVTYPER_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001),
  access_pmu_regs, reset_unknown, PMXEVTYPER_EL0 },
@@ -978,6 +991,13 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
 
if (p->is_write) {
switch (r->reg) {
+   case c9_PMCCNTR: {
+   val = kvm_pmu_get_counter_value(vcpu,
+   ARMV8_MAX_COUNTERS - 1);
+   vcpu_cp15(vcpu, r->reg) += (s64)*vcpu_reg(vcpu, p->Rt)
+  - val;
+   break;
+   }
case c9_PMXEVCNTR: {
int index = c14_PMEVCNTR0 + vcpu_cp15(vcpu, c9_PMSELR);
 
@@ -1014,6 +1034,12 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
}
} else {
switch (r->reg) {
+   case c9_PMCCNTR: {
+   val = kvm_pmu_get_counter_value(vcpu,
+   ARMV8_MAX_COUNTERS - 1);
+   *vcpu_reg(vcpu, p->Rt) = val;
+   break;
+   }
case c9_PMXEVCNTR: {
val = kvm_pmu_get_counter_value(vcpu,
vcpu_cp15(vcpu, c9_PMSELR));
@@ -1075,7 +1101,8 @@ static const struct sys_reg_desc cp15_regs[] = {
  reset_pmceid, c9_PMCEID0 },
{ Op1( 0), CRn( 9), CRm(12), Op2( 7), access_pmu_cp15_regs,
  reset_pmceid, c9_PMCEID1 },
-   { Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(13), Op2( 0), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMCCNTR },
{ Op1( 0), CRn( 9), CRm(13), Op2( 1), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMXEVTYPER },
{ Op1( 0), CRn( 9), CRm(13), Op2( 2), access_pmu_cp15_regs,
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-29 Thread Yunhong Jiang

On Thu, Oct 29, 2015 at 10:45:44AM +0100, Paolo Bonzini wrote:
> 
> 
> On 29/10/2015 04:11, Alex Williamson wrote:
> > > The irqfd is already able to schedule a work item, because it runs with
> > > interrupts disabled, so I think we can always return IRQ_HANDLED.
> >
> > I'm confused by this.  The problem with adding IRQF_NO_THREAD to our
> > current handler is that it hits the spinlock that can sleep in
> > eventfd_signal() and the waitqueue further down the stack before we get
> > to the irqfd.  So if we split to a non-threaded handler vs a threaded
> > handler, where the non-threaded handler either returns IRQ_HANDLED or
> > IRQ_WAKE_THREAD to queue the threaded handler, there's only so much that
> > the non-threaded handler can do before we start running into the same
> > problem.
> 
> You're right.  I thought schedule_work used raw spinlocks (and then
> everything would be done in the inject callback), but I was wrong.
> 
> Basically where irqfd_wakeup now does schedule_work, it would need to
> return IRQ_WAKE_THREAD.  The threaded handler then can just do the
> eventfd_signal.
> 

And with this change, we even don't need the module option anymore, we first 
try the primary handler, which is in hard irq context, and if failed, then
threaded irq handler. Am I right?

Paolo/Alex, do you want to work on the patch yourself? If not, I will be 
happy to try this method.

Thanks
--jyh

> Paolo
> 
> > I think that means that the non-threaded handler needs to
> > return IRQ_WAKE_THREAD if we need to use the current eventfd_signal()
> > path, such as if the bypass path is not available.  If we can get
> > through the bypass path and the KVM irqfd side is safe for the
> > non-threaded handler, inject succeeds and we return IRQ_HANDLED, right?
> > Thanks,
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 12/21] KVM: ARM64: Add reset and access handlers for PMINTENSET and PMINTENCLR register

From: Shannon Zhao 

Since the reset value of PMINTENSET and PMINTENCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a new case to emulate writing
PMINTENSET or PMINTENCLR register.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 34 ++
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index c358ae0..6d2febf 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -540,6 +540,18 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
vcpu_sys_reg(vcpu, PMCNTENSET_EL0) &= ~val;
break;
}
+   case PMINTENSET_EL1: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   vcpu_sys_reg(vcpu, r->reg) |= val;
+   vcpu_sys_reg(vcpu, PMINTENCLR_EL1) |= val;
+   break;
+   }
+   case PMINTENCLR_EL1: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   vcpu_sys_reg(vcpu, r->reg) &= ~val;
+   vcpu_sys_reg(vcpu, PMINTENSET_EL1) &= ~val;
+   break;
+   }
case PMCR_EL0: {
/* Only update writeable bits of PMCR */
val = vcpu_sys_reg(vcpu, r->reg);
@@ -729,10 +741,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
/* PMINTENSET_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b001),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMINTENSET_EL1 },
/* PMINTENCLR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b010),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMINTENCLR_EL1 },
 
/* MAIR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000),
@@ -1059,6 +1071,18 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
vcpu_cp15(vcpu, c9_PMCNTENSET) &= ~val;
break;
}
+   case c9_PMINTENSET: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   vcpu_cp15(vcpu, r->reg) |= val;
+   vcpu_cp15(vcpu, c9_PMINTENCLR) |= val;
+   break;
+   }
+   case c9_PMINTENCLR: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   vcpu_cp15(vcpu, r->reg) &= ~val;
+   vcpu_cp15(vcpu, c9_PMINTENSET) &= ~val;
+   break;
+   }
case c9_PMCR: {
/* Only update writeable bits of PMCR */
val = vcpu_cp15(vcpu, r->reg);
@@ -1152,8 +1176,10 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn( 9), CRm(13), Op2( 2), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMXEVCNTR },
{ Op1( 0), CRn( 9), CRm(14), Op2( 0), trap_raz_wi },
-   { Op1( 0), CRn( 9), CRm(14), Op2( 1), trap_raz_wi },
-   { Op1( 0), CRn( 9), CRm(14), Op2( 2), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(14), Op2( 1), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMINTENSET },
+   { Op1( 0), CRn( 9), CRm(14), Op2( 2), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMINTENCLR },
 
{ Op1( 0), CRn(10), CRm( 2), Op2( 0), access_vm_reg, NULL, c10_PRRR },
{ Op1( 0), CRn(10), CRm( 2), Op2( 1), access_vm_reg, NULL, c10_NMRR },
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 09/21] KVM: ARM64: Add reset and access handlers for PMXEVCNTR register

From: Shannon Zhao 

Since the reset value of PMXEVCNTR is UNKNOWN, use reset_unknown for
its reset handler. Add access handler which emulates writing and reading
PMXEVCNTR register. When reading PMXEVCNTR, call perf_event_read_value
to get the count value of the perf event.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 36 ++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4e606ea..b7ca2cd 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -491,6 +491,16 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
 
if (p->is_write) {
switch (r->reg) {
+   case PMXEVCNTR_EL0: {
+   int index = PMEVCNTR0_EL0
+   + vcpu_sys_reg(vcpu, PMSELR_EL0);
+
+   val = kvm_pmu_get_counter_value(vcpu,
+   vcpu_sys_reg(vcpu, PMSELR_EL0));
+   vcpu_sys_reg(vcpu, index) += (s64)*vcpu_reg(vcpu, p->Rt)
+- val;
+   break;
+   }
case PMXEVTYPER_EL0: {
val = vcpu_sys_reg(vcpu, PMSELR_EL0);
kvm_pmu_set_counter_event_type(vcpu,
@@ -519,6 +529,12 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
}
} else {
switch (r->reg) {
+   case PMXEVCNTR_EL0: {
+   val = kvm_pmu_get_counter_value(vcpu,
+   vcpu_sys_reg(vcpu, PMSELR_EL0));
+   *vcpu_reg(vcpu, p->Rt) = val;
+   break;
+   }
case PMCR_EL0: {
/* PMCR.P & PMCR.C are RAZ */
val = vcpu_sys_reg(vcpu, r->reg)
@@ -749,7 +765,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  access_pmu_regs, reset_unknown, PMXEVTYPER_EL0 },
/* PMXEVCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b010),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMXEVCNTR_EL0 },
/* PMUSERENR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000),
  trap_raz_wi },
@@ -962,6 +978,15 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
 
if (p->is_write) {
switch (r->reg) {
+   case c9_PMXEVCNTR: {
+   int index = c14_PMEVCNTR0 + vcpu_cp15(vcpu, c9_PMSELR);
+
+   val = kvm_pmu_get_counter_value(vcpu,
+   vcpu_cp15(vcpu, c9_PMSELR));
+   vcpu_cp15(vcpu, index) += (s64)*vcpu_reg(vcpu, p->Rt)
+ - val;
+   break;
+   }
case c9_PMXEVTYPER: {
val = vcpu_cp15(vcpu, c9_PMSELR);
kvm_pmu_set_counter_event_type(vcpu,
@@ -989,6 +1014,12 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
}
} else {
switch (r->reg) {
+   case c9_PMXEVCNTR: {
+   val = kvm_pmu_get_counter_value(vcpu,
+   vcpu_cp15(vcpu, c9_PMSELR));
+   *vcpu_reg(vcpu, p->Rt) = val;
+   break;
+   }
case c9_PMCR: {
/* PMCR.P & PMCR.C are RAZ */
val = vcpu_cp15(vcpu, r->reg)
@@ -1047,7 +1078,8 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(13), Op2( 1), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMXEVTYPER },
-   { Op1( 0), CRn( 9), CRm(13), Op2( 2), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(13), Op2( 2), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMXEVCNTR },
{ Op1( 0), CRn( 9), CRm(14), Op2( 0), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(14), Op2( 1), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(14), Op2( 2), trap_raz_wi },
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 16/21] KVM: ARM64: Add access handlers for PMEVCNTRn and PMEVTYPERn register

From: Shannon Zhao 

Add access handler which emulates writing and reading PMEVCNTRn and
PMEVTYPERn.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 164 ++
 1 file changed, 164 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index c86f8dd..50bf3fb 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -634,6 +634,20 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
{ Op0(0b10), Op1(0b000), CRn(0b), CRm((n)), Op2(0b111), \
  trap_wcr, reset_wcr, n, 0,  get_wcr, set_wcr }
 
+/* Macro to expand the PMEVCNTRn_EL0 register */
+#define PMU_PMEVCNTR_EL0(n)\
+   /* PMEVCNTRn_EL0 */ \
+   { Op0(0b11), Op1(0b011), CRn(0b1110),   \
+ CRm((0b1000 | (((n) >> 3) & 0x3))), Op2(((n) & 0x7)), \
+ access_pmu_regs, reset_unknown, (PMEVCNTR0_EL0 + n), }
+
+/* Macro to expand the PMEVTYPERn_EL0 register */
+#define PMU_PMEVTYPER_EL0(n)   \
+   /* PMEVTYPERn_EL0 */\
+   { Op0(0b11), Op1(0b011), CRn(0b1110),   \
+ CRm((0b1100 | (((n) >> 3) & 0x3))), Op2(((n) & 0x7)), \
+ access_pmu_regs, reset_unknown, (PMEVTYPER0_EL0 + n), }
+
 /*
  * Architected system registers.
  * Important: Must be sorted ascending by Op0, Op1, CRn, CRm, Op2
@@ -848,6 +862,74 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ Op0(0b11), Op1(0b011), CRn(0b1101), CRm(0b), Op2(0b011),
  NULL, reset_unknown, TPIDRRO_EL0 },
 
+   /* PMEVCNTRn_EL0 */
+   PMU_PMEVCNTR_EL0(0),
+   PMU_PMEVCNTR_EL0(1),
+   PMU_PMEVCNTR_EL0(2),
+   PMU_PMEVCNTR_EL0(3),
+   PMU_PMEVCNTR_EL0(4),
+   PMU_PMEVCNTR_EL0(5),
+   PMU_PMEVCNTR_EL0(6),
+   PMU_PMEVCNTR_EL0(7),
+   PMU_PMEVCNTR_EL0(8),
+   PMU_PMEVCNTR_EL0(9),
+   PMU_PMEVCNTR_EL0(10),
+   PMU_PMEVCNTR_EL0(11),
+   PMU_PMEVCNTR_EL0(12),
+   PMU_PMEVCNTR_EL0(13),
+   PMU_PMEVCNTR_EL0(14),
+   PMU_PMEVCNTR_EL0(15),
+   PMU_PMEVCNTR_EL0(16),
+   PMU_PMEVCNTR_EL0(17),
+   PMU_PMEVCNTR_EL0(18),
+   PMU_PMEVCNTR_EL0(19),
+   PMU_PMEVCNTR_EL0(20),
+   PMU_PMEVCNTR_EL0(21),
+   PMU_PMEVCNTR_EL0(22),
+   PMU_PMEVCNTR_EL0(23),
+   PMU_PMEVCNTR_EL0(24),
+   PMU_PMEVCNTR_EL0(25),
+   PMU_PMEVCNTR_EL0(26),
+   PMU_PMEVCNTR_EL0(27),
+   PMU_PMEVCNTR_EL0(28),
+   PMU_PMEVCNTR_EL0(29),
+   PMU_PMEVCNTR_EL0(30),
+   /* PMEVTYPERn_EL0 */
+   PMU_PMEVTYPER_EL0(0),
+   PMU_PMEVTYPER_EL0(1),
+   PMU_PMEVTYPER_EL0(2),
+   PMU_PMEVTYPER_EL0(3),
+   PMU_PMEVTYPER_EL0(4),
+   PMU_PMEVTYPER_EL0(5),
+   PMU_PMEVTYPER_EL0(6),
+   PMU_PMEVTYPER_EL0(7),
+   PMU_PMEVTYPER_EL0(8),
+   PMU_PMEVTYPER_EL0(9),
+   PMU_PMEVTYPER_EL0(10),
+   PMU_PMEVTYPER_EL0(11),
+   PMU_PMEVTYPER_EL0(12),
+   PMU_PMEVTYPER_EL0(13),
+   PMU_PMEVTYPER_EL0(14),
+   PMU_PMEVTYPER_EL0(15),
+   PMU_PMEVTYPER_EL0(16),
+   PMU_PMEVTYPER_EL0(17),
+   PMU_PMEVTYPER_EL0(18),
+   PMU_PMEVTYPER_EL0(19),
+   PMU_PMEVTYPER_EL0(20),
+   PMU_PMEVTYPER_EL0(21),
+   PMU_PMEVTYPER_EL0(22),
+   PMU_PMEVTYPER_EL0(23),
+   PMU_PMEVTYPER_EL0(24),
+   PMU_PMEVTYPER_EL0(25),
+   PMU_PMEVTYPER_EL0(26),
+   PMU_PMEVTYPER_EL0(27),
+   PMU_PMEVTYPER_EL0(28),
+   PMU_PMEVTYPER_EL0(29),
+   PMU_PMEVTYPER_EL0(30),
+   /* PMCCFILTR_EL0 */
+   { Op0(0b11), Op1(0b011), CRn(0b1110), CRm(0b), Op2(0b111),
+ access_pmu_regs, reset_unknown, PMCCFILTR_EL0, },
+
/* DACR32_EL2 */
{ Op0(0b11), Op1(0b100), CRn(0b0011), CRm(0b), Op2(0b000),
  NULL, reset_unknown, DACR32_EL2 },
@@ -1172,6 +1254,20 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
return true;
 }
 
+/* Macro to expand the PMEVCNTRn register */
+#define PMU_PMEVCNTR(n)
\
+   /* PMEVCNTRn */ \
+   { Op1(0), CRn(0b1110),  \
+ CRm((0b1000 | (((n) >> 3) & 0x3))), Op2(((n) & 0x7)), \
+ access_pmu_cp15_regs, reset_unknown_cp15, (c14_PMEVCNTR0 + n), }
+
+/* Macro to expand the PMEVTYPERn register */
+#define PMU_PMEVTYPER(n)   \
+   /* PMEVTYPERn */\
+   { Op1(0), CRn(0b1110),  \
+ CRm((0b1100 | (((n) >> 3) & 0x3))), Op2(((n) & 0x7)), \
+ access_pmu_cp15_regs, reset_unknown_cp15, (c14_PMEVTYPER0 + n), }
+
 /*
  * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding,
  * d

[PATCH v4 06/21] KVM: ARM64: Add reset and access handlers for PMCEID0 and PMCEID1 register

From: Shannon Zhao 

Add reset handler which gets host value of PMCEID0 or PMCEID1. Since
write action to PMCEID0 or PMCEID1 is ignored, add a new case for this.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 35d232e..cb82b15 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -469,6 +469,19 @@ static void reset_pmcr(struct kvm_vcpu *vcpu, const struct 
sys_reg_desc *r)
vcpu_sysreg_write(vcpu, r, val);
 }
 
+static void reset_pmceid(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
+{
+   u64 pmceid;
+
+   if (r->reg == PMCEID0_EL0 || r->reg == c9_PMCEID0)
+   asm volatile("mrs %0, pmceid0_el0\n" : "=r" (pmceid));
+   else
+   /* PMCEID1_EL0 or c9_PMCEID1 */
+   asm volatile("mrs %0, pmceid1_el0\n" : "=r" (pmceid));
+
+   vcpu_sysreg_write(vcpu, r, pmceid);
+}
+
 /* PMU registers accessor. */
 static bool access_pmu_regs(struct kvm_vcpu *vcpu,
const struct sys_reg_params *p,
@@ -486,6 +499,9 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
vcpu_sys_reg(vcpu, r->reg) = val;
break;
}
+   case PMCEID0_EL0:
+   case PMCEID1_EL0:
+   return ignore_write(vcpu, p);
default:
vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
break;
@@ -710,10 +726,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  access_pmu_regs, reset_unknown, PMSELR_EL0 },
/* PMCEID0_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b110),
- trap_raz_wi },
+ access_pmu_regs, reset_pmceid, PMCEID0_EL0 },
/* PMCEID1_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b111),
- trap_raz_wi },
+ access_pmu_regs, reset_pmceid, PMCEID1_EL0 },
/* PMCCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000),
  trap_raz_wi },
@@ -943,6 +959,9 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
vcpu_cp15(vcpu, r->reg) = val;
break;
}
+   case c9_PMCEID0:
+   case c9_PMCEID1:
+   return ignore_write(vcpu, p);
default:
vcpu_cp15(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
break;
@@ -1000,8 +1019,10 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn( 9), CRm(12), Op2( 3), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMSELR },
-   { Op1( 0), CRn( 9), CRm(12), Op2( 6), trap_raz_wi },
-   { Op1( 0), CRn( 9), CRm(12), Op2( 7), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(12), Op2( 6), access_pmu_cp15_regs,
+ reset_pmceid, c9_PMCEID0 },
+   { Op1( 0), CRn( 9), CRm(12), Op2( 7), access_pmu_cp15_regs,
+ reset_pmceid, c9_PMCEID1 },
{ Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(13), Op2( 1), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(13), Op2( 2), trap_raz_wi },
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 08/21] KVM: ARM64: Add reset and access handlers for PMXEVTYPER register

From: Shannon Zhao 

Since the reset value of PMXEVTYPER is UNKNOWN, use reset_unknown or
reset_unknown_cp15 for its reset handler. Add access handler which
emulates writing and reading PMXEVTYPER register. When writing to
PMXEVTYPER, call kvm_pmu_set_counter_event_type to create a perf_event
for the selected event type.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index cb82b15..4e606ea 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -491,6 +491,17 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
 
if (p->is_write) {
switch (r->reg) {
+   case PMXEVTYPER_EL0: {
+   val = vcpu_sys_reg(vcpu, PMSELR_EL0);
+   kvm_pmu_set_counter_event_type(vcpu,
+  *vcpu_reg(vcpu, p->Rt),
+  val);
+   vcpu_sys_reg(vcpu, PMXEVTYPER_EL0) =
+*vcpu_reg(vcpu, p->Rt);
+   vcpu_sys_reg(vcpu, PMEVTYPER0_EL0 + val) =
+*vcpu_reg(vcpu, p->Rt);
+   break;
+   }
case PMCR_EL0: {
/* Only update writeable bits of PMCR */
val = vcpu_sys_reg(vcpu, r->reg);
@@ -735,7 +746,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  trap_raz_wi },
/* PMXEVTYPER_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMXEVTYPER_EL0 },
/* PMXEVCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b010),
  trap_raz_wi },
@@ -951,6 +962,16 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
 
if (p->is_write) {
switch (r->reg) {
+   case c9_PMXEVTYPER: {
+   val = vcpu_cp15(vcpu, c9_PMSELR);
+   kvm_pmu_set_counter_event_type(vcpu,
+  *vcpu_reg(vcpu, p->Rt),
+  val);
+   vcpu_cp15(vcpu, c9_PMXEVTYPER) = *vcpu_reg(vcpu, p->Rt);
+   vcpu_cp15(vcpu, c14_PMEVTYPER0 + val) =
+*vcpu_reg(vcpu, p->Rt);
+   break;
+   }
case c9_PMCR: {
/* Only update writeable bits of PMCR */
val = vcpu_cp15(vcpu, r->reg);
@@ -1024,7 +1045,8 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn( 9), CRm(12), Op2( 7), access_pmu_cp15_regs,
  reset_pmceid, c9_PMCEID1 },
{ Op1( 0), CRn( 9), CRm(13), Op2( 0), trap_raz_wi },
-   { Op1( 0), CRn( 9), CRm(13), Op2( 1), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(13), Op2( 1), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMXEVTYPER },
{ Op1( 0), CRn( 9), CRm(13), Op2( 2), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(14), Op2( 0), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(14), Op2( 1), trap_raz_wi },
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 07/21] KVM: ARM64: PMU: Add perf event map and introduce perf event creating function

From: Shannon Zhao 

When we use tools like perf on host, perf passes the event type and the
id of this event type category to kernel, then kernel will map them to
hardware event number and write this number to PMU PMEVTYPER_EL0
register. When getting the event number in KVM, directly use raw event
type to create a perf_event for it.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/include/asm/pmu.h |   2 +
 arch/arm64/kvm/Makefile  |   1 +
 include/kvm/arm_pmu.h|  13 +
 virt/kvm/arm/pmu.c   | 117 +++
 4 files changed, 133 insertions(+)
 create mode 100644 virt/kvm/arm/pmu.c

diff --git a/arch/arm64/include/asm/pmu.h b/arch/arm64/include/asm/pmu.h
index b9f394a..2c025f2 100644
--- a/arch/arm64/include/asm/pmu.h
+++ b/arch/arm64/include/asm/pmu.h
@@ -31,6 +31,8 @@
 #define ARMV8_PMCR_D   (1 << 3) /* CCNT counts every 64th cpu cycle */
 #define ARMV8_PMCR_X   (1 << 4) /* Export to ETM */
 #define ARMV8_PMCR_DP  (1 << 5) /* Disable CCNT if non-invasive debug*/
+/* Determines which PMCCNTR_EL0 bit generates an overflow */
+#define ARMV8_PMCR_LC  (1 << 6)
 #defineARMV8_PMCR_N_SHIFT  11   /* Number of counters 
supported */
 #defineARMV8_PMCR_N_MASK   0x1f
 #defineARMV8_PMCR_MASK 0x3f /* Mask for writable bits */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 1949fe5..18d56d8 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -27,3 +27,4 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic-v3.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic-v3-emul.o
 kvm-$(CONFIG_KVM_ARM_HOST) += vgic-v3-switch.o
 kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
+kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 254d2b4..1908c88 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -38,4 +38,17 @@ struct kvm_pmu {
 #endif
 };
 
+#ifdef CONFIG_KVM_ARM_PMU
+unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx);
+void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
+   u32 select_idx);
+#else
+unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx)
+{
+   return 0;
+}
+void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
+   u32 select_idx) {}
+#endif
+
 #endif
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
new file mode 100644
index 000..900a64c
--- /dev/null
+++ b/virt/kvm/arm/pmu.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright (C) 2015 Linaro Ltd.
+ * Author: Shannon Zhao 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * kvm_pmu_get_counter_value - get PMU counter value
+ * @vcpu: The vcpu pointer
+ * @select_idx: The counter index
+ */
+unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx)
+{
+   u64 counter, enabled, running;
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+   struct kvm_pmc *pmc = &pmu->pmc[select_idx];
+
+   if (!vcpu_mode_is_32bit(vcpu))
+   counter = vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + select_idx);
+   else
+   counter = vcpu_cp15(vcpu, c14_PMEVCNTR0 + select_idx);
+
+   if (pmc->perf_event)
+   counter += perf_event_read_value(pmc->perf_event, &enabled,
+&running);
+
+   return counter & pmc->bitmask;
+}
+
+/**
+ * kvm_pmu_stop_counter - stop PMU counter
+ * @pmc: The PMU counter pointer
+ *
+ * If this counter has been configured to monitor some event, release it here.
+ */
+static void kvm_pmu_stop_counter(struct kvm_pmc *pmc)
+{
+   struct kvm_vcpu *vcpu = pmc->vcpu;
+   u64 counter;
+
+   if (pmc->perf_event) {
+   counter = kvm_pmu_get_counter_value(vcpu, pmc->idx);
+   if (!vcpu_mode_is_32bit(vcpu))
+   vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + pmc->idx) = counter;
+   else
+   vcpu_cp15(vcpu, c14_PMEVCNTR0 + pmc->idx) = counter;
+
+   perf_event_release_kernel(pmc->perf_event);
+   pmc->perf_event = NULL;
+   }
+}
+
+/**
+ * kvm_pmu_set_counter_event_type - set selected counter to monitor some event
+ * @vcpu: The vcpu pointer
+ * @data: The data

[PATCH v4 11/21] KVM: ARM64: Add reset and access handlers for PMCNTENSET and PMCNTENCLR register

From: Shannon Zhao 

Since the reset value of PMCNTENSET and PMCNTENCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a new case to emulate writing
PMCNTENSET or PMCNTENCLR register.

When writing to PMCNTENSET, call perf_event_enable to enable the perf
event. When writing to PMCNTENCLR, call perf_event_disable to disable
the perf event.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 52 +++
 include/kvm/arm_pmu.h |  4 
 virt/kvm/arm/pmu.c| 52 +++
 3 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 059c84c..c358ae0 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -519,6 +519,27 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
 *vcpu_reg(vcpu, p->Rt);
break;
}
+   case PMCNTENSET_EL0: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_enable_counter(vcpu, val,
+  vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMCR_E);
+   /* Value 1 of PMCNTENSET_EL0 and PMCNTENCLR_EL0 means
+* corresponding counter enabled.
+*/
+   vcpu_sys_reg(vcpu, r->reg) |= val;
+   vcpu_sys_reg(vcpu, PMCNTENCLR_EL0) |= val;
+   break;
+   }
+   case PMCNTENCLR_EL0: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_disable_counter(vcpu, val);
+   /* Value 0 of PMCNTENSET_EL0 and PMCNTENCLR_EL0 means
+* corresponding counter disabled.
+*/
+   vcpu_sys_reg(vcpu, r->reg) &= ~val;
+   vcpu_sys_reg(vcpu, PMCNTENSET_EL0) &= ~val;
+   break;
+   }
case PMCR_EL0: {
/* Only update writeable bits of PMCR */
val = vcpu_sys_reg(vcpu, r->reg);
@@ -751,10 +772,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  access_pmu_regs, reset_pmcr, PMCR_EL0, },
/* PMCNTENSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b001),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMCNTENSET_EL0 },
/* PMCNTENCLR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b010),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMCNTENCLR_EL0 },
/* PMOVSCLR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b011),
  trap_raz_wi },
@@ -1017,6 +1038,27 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
 *vcpu_reg(vcpu, p->Rt);
break;
}
+   case c9_PMCNTENSET: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_enable_counter(vcpu, val,
+  vcpu_cp15(vcpu, c9_PMCR) & ARMV8_PMCR_E);
+   /* Value 1 of PMCNTENSET_EL0 and PMCNTENCLR_EL0 means
+* corresponding counter enabled.
+*/
+   vcpu_cp15(vcpu, r->reg) |= val;
+   vcpu_cp15(vcpu, c9_PMCNTENCLR) |= val;
+   break;
+   }
+   case c9_PMCNTENCLR: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_disable_counter(vcpu, val);
+   /* Value 0 of PMCNTENSET_EL0 and PMCNTENCLR_EL0 means
+* corresponding counter disabled.
+*/
+   vcpu_cp15(vcpu, r->reg) &= ~val;
+   vcpu_cp15(vcpu, c9_PMCNTENSET) &= ~val;
+   break;
+   }
case c9_PMCR: {
/* Only update writeable bits of PMCR */
val = vcpu_cp15(vcpu, r->reg);
@@ -1092,8 +1134,10 @@ static const struct sys_reg_desc cp15_regs[] = {
/* PMU */
{ Op1( 0), CRn( 9), CRm(12), Op2( 0), access_pmu_cp15_regs,
  reset_pmcr, c9_PMCR },
-   { Op1( 0), CRn( 9), CRm(12), Op2( 1), trap_raz_wi },
-   { Op1( 0), CRn( 9), CRm(12), Op2( 2), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(12), Op2( 1), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMCNTENSET },
+   { Op1( 0), CRn( 9), CRm(12), Op2( 2), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMCNTENCLR },
{ Op1( 0), CRn( 9), CRm(12), Op2( 3), trap_raz_wi },
{ Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMSELR },
diff --git a/include/kvm/arm_pmu.

[PATCH v4 15/21] KVM: ARM64: Add reset and access handlers for PMSWINC register

From: Shannon Zhao 

Add access handler which emulates writing and reading PMSWINC
register and add support for creating software increment event.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 18 +++-
 include/kvm/arm_pmu.h |  2 ++
 virt/kvm/arm/pmu.c| 55 +++
 3 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index c44c8e1..c86f8dd 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -567,6 +567,11 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
vcpu_sys_reg(vcpu, PMOVSSET_EL0) &= ~val;
break;
}
+   case PMSWINC_EL0: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_software_increment(vcpu, val);
+   break;
+   }
case PMCR_EL0: {
/* Only update writeable bits of PMCR */
val = vcpu_sys_reg(vcpu, r->reg);
@@ -596,6 +601,8 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
*vcpu_reg(vcpu, p->Rt) = val;
break;
}
+   case PMSWINC_EL0:
+   return read_zero(vcpu, p);
case PMCR_EL0: {
/* PMCR.P & PMCR.C are RAZ */
val = vcpu_sys_reg(vcpu, r->reg)
@@ -808,7 +815,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  access_pmu_regs, reset_unknown, PMOVSCLR_EL0 },
/* PMSWINC_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b100),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMSWINC_EL0 },
/* PMSELR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b101),
  access_pmu_regs, reset_unknown, PMSELR_EL0 },
@@ -1113,6 +1120,11 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
vcpu_cp15(vcpu, c9_PMOVSSET) &= ~val;
break;
}
+   case c9_PMSWINC: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_software_increment(vcpu, val);
+   break;
+   }
case c9_PMCR: {
/* Only update writeable bits of PMCR */
val = vcpu_cp15(vcpu, r->reg);
@@ -1142,6 +1154,8 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
*vcpu_reg(vcpu, p->Rt) = val;
break;
}
+   case c9_PMSWINC:
+   return read_zero(vcpu, p);
case c9_PMCR: {
/* PMCR.P & PMCR.C are RAZ */
val = vcpu_cp15(vcpu, r->reg)
@@ -1194,6 +1208,8 @@ static const struct sys_reg_desc cp15_regs[] = {
  reset_unknown_cp15, c9_PMCNTENCLR },
{ Op1( 0), CRn( 9), CRm(12), Op2( 3), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMOVSCLR },
+   { Op1( 0), CRn( 9), CRm(12), Op2( 4), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMSWINC },
{ Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMSELR },
{ Op1( 0), CRn( 9), CRm(12), Op2( 6), access_pmu_cp15_regs,
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index ff17578..d7de7f1 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -44,6 +44,7 @@ void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val);
 void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool all_enable);
 void kvm_pmu_overflow_clear(struct kvm_vcpu *vcpu, u32 val, u32 reg);
 void kvm_pmu_overflow_set(struct kvm_vcpu *vcpu, u32 val);
+void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u32 val);
 void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
u32 select_idx);
 #else
@@ -55,6 +56,7 @@ void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val) 
{}
 void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool all_enable) {}
 void kvm_pmu_overflow_clear(struct kvm_vcpu *vcpu, u32 val, u32 reg) {}
 void kvm_pmu_overflow_set(struct kvm_vcpu *vcpu, u32 val) {}
+void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u32 val) {}
 void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
u32 select_idx) {}
 #endif
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 5761386..ae21089 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -151,6 +151,57 @@ void kvm_pmu_overflow_set(struct kvm_vcpu *vcpu, u32 val)
 }
 
 /**
+ * kvm_pmu_software_increment - do software increment
+ * @vcpu: The vcpu pointer
+ * @val: the value guest writes to PMSWINC register
+ */
+void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u32 val)
+{

[PATCH v4 19/21] KVM: ARM64: Reset PMU state when resetting vcpu

From: Shannon Zhao 

When resetting vcpu, it needs to reset the PMU state to initial status.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/reset.c |  3 +++
 include/kvm/arm_pmu.h  |  2 ++
 virt/kvm/arm/pmu.c | 19 +++
 3 files changed, 24 insertions(+)

diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 91cf535..4da7f6c 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -120,6 +120,9 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
/* Reset system registers */
kvm_reset_sys_regs(vcpu);
 
+   /* Reset PMU */
+   kvm_pmu_vcpu_reset(vcpu);
+
/* Reset timer */
return kvm_timer_vcpu_reset(vcpu, cpu_vtimer_irq);
 }
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 5e7f943..e708c49 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -39,6 +39,7 @@ struct kvm_pmu {
 };
 
 #ifdef CONFIG_KVM_ARM_PMU
+void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu);
 void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu);
 void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu);
 unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx);
@@ -51,6 +52,7 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, 
u32 data,
u32 select_idx);
 void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val);
 #else
+void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu) {}
 void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) {}
 void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu) {}
 unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx)
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 6d48d9a..84720a2 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -70,6 +70,25 @@ static void kvm_pmu_stop_counter(struct kvm_pmc *pmc)
 }
 
 /**
+ * kvm_pmu_vcpu_reset - reset pmu state for cpu
+ * @vcpu: The vcpu pointer
+ *
+ */
+void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu)
+{
+   int i;
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+   for (i = 0; i < ARMV8_MAX_COUNTERS; i++) {
+   kvm_pmu_stop_counter(&pmu->pmc[i]);
+   pmu->pmc[i].idx = i;
+   pmu->pmc[i].vcpu = vcpu;
+   pmu->pmc[i].bitmask = 0xUL;
+   }
+   pmu->irq_pending = false;
+}
+
+/**
  * kvm_pmu_sync_hwstate - sync pmu state for cpu
  * @vcpu: The vcpu pointer
  *
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 17/21] KVM: ARM64: Add helper to handle PMCR register bits

From: Shannon Zhao 

According to ARMv8 spec, when writing 1 to PMCR.E, all counters are
enabled by PMCNTENSET, while writing 0 to PMCR.E, all counters are
disabled. When writing 1 to PMCR.P, reset all event counters, not
including PMCCNTR, to zero. When writing 1 to PMCR.C, reset PMCCNTR to
zero.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c |  2 ++
 include/kvm/arm_pmu.h |  2 ++
 virt/kvm/arm/pmu.c| 50 +++
 3 files changed, 54 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 50bf3fb..a0bb9d2 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -578,6 +578,7 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
val &= ~ARMV8_PMCR_MASK;
val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
vcpu_sys_reg(vcpu, r->reg) = val;
+   kvm_pmu_handle_pmcr(vcpu, val);
break;
}
case PMCEID0_EL0:
@@ -1213,6 +1214,7 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
val &= ~ARMV8_PMCR_MASK;
val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
vcpu_cp15(vcpu, r->reg) = val;
+   kvm_pmu_handle_pmcr(vcpu, val);
break;
}
case c9_PMCEID0:
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index d7de7f1..acd025a 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -47,6 +47,7 @@ void kvm_pmu_overflow_set(struct kvm_vcpu *vcpu, u32 val);
 void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u32 val);
 void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
u32 select_idx);
+void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val);
 #else
 unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx)
 {
@@ -59,6 +60,7 @@ void kvm_pmu_overflow_set(struct kvm_vcpu *vcpu, u32 val) {}
 void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u32 val) {}
 void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
u32 select_idx) {}
+void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val) {}
 #endif
 
 #endif
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index ae21089..11d1bfb 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -121,6 +121,56 @@ void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 
val)
 }
 
 /**
+ * kvm_pmu_handle_pmcr - handle PMCR register
+ * @vcpu: The vcpu pointer
+ * @val: the value guest writes to PMCR register
+ */
+void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val)
+{
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+   struct kvm_pmc *pmc;
+   u32 enable;
+   int i;
+
+   if (val & ARMV8_PMCR_E) {
+   if (!vcpu_mode_is_32bit(vcpu))
+   enable = vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
+   else
+   enable = vcpu_cp15(vcpu, c9_PMCNTENSET);
+
+   kvm_pmu_enable_counter(vcpu, enable, true);
+   } else
+   kvm_pmu_disable_counter(vcpu, 0xUL);
+
+   if (val & ARMV8_PMCR_C) {
+   pmc = &pmu->pmc[ARMV8_MAX_COUNTERS - 1];
+   if (pmc->perf_event)
+   local64_set(&pmc->perf_event->count, 0);
+   if (!vcpu_mode_is_32bit(vcpu))
+   vcpu_sys_reg(vcpu, PMCCNTR_EL0) = 0;
+   else
+   vcpu_cp15(vcpu, c9_PMCCNTR) = 0;
+   }
+
+   if (val & ARMV8_PMCR_P) {
+   for (i = 0; i < ARMV8_MAX_COUNTERS - 1; i++) {
+   pmc = &pmu->pmc[i];
+   if (pmc->perf_event)
+   local64_set(&pmc->perf_event->count, 0);
+   if (!vcpu_mode_is_32bit(vcpu))
+   vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i) = 0;
+   else
+   vcpu_cp15(vcpu, c14_PMEVCNTR0 + i) = 0;
+   }
+   }
+
+   if (val & ARMV8_PMCR_LC) {
+   pmc = &pmu->pmc[ARMV8_MAX_COUNTERS - 1];
+   pmc->bitmask = 0xUL;
+   }
+}
+
+/**
  * kvm_pmu_overflow_clear - clear PMU overflow interrupt
  * @vcpu: The vcpu pointer
  * @val: the value guest writes to PMOVSCLR register
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 21/21] KVM: ARM64: Add a new kvm ARM PMU device

From: Shannon Zhao 

Add a new kvm device type KVM_DEV_TYPE_ARM_PMU_V3 for ARM PMU. Implement
the kvm_device_ops for it.

Signed-off-by: Shannon Zhao 
---
 Documentation/virtual/kvm/devices/arm-pmu.txt | 15 +
 arch/arm64/include/uapi/asm/kvm.h |  3 +
 include/linux/kvm_host.h  |  1 +
 include/uapi/linux/kvm.h  |  2 +
 virt/kvm/arm/pmu.c| 92 +++
 virt/kvm/arm/vgic.c   |  8 +++
 virt/kvm/arm/vgic.h   |  1 +
 virt/kvm/kvm_main.c   |  4 ++
 8 files changed, 126 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/arm-pmu.txt

diff --git a/Documentation/virtual/kvm/devices/arm-pmu.txt 
b/Documentation/virtual/kvm/devices/arm-pmu.txt
new file mode 100644
index 000..49481c4
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/arm-pmu.txt
@@ -0,0 +1,15 @@
+ARM Virtual Performance Monitor Unit (vPMU)
+===
+
+Device types supported:
+  KVM_DEV_TYPE_ARM_PMU_V3 ARM Performance Monitor Unit v3
+
+Instantiate one PMU instance for per VCPU through this API.
+
+Groups:
+  KVM_DEV_ARM_PMU_GRP_IRQ
+  Attributes:
+A value describing the interrupt number of PMU overflow interrupt.
+
+  Errors:
+-EINVAL: Value set is out of the expected range
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index 0cd7b59..1309a93 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -204,6 +204,9 @@ struct kvm_arch_memory_slot {
 #define KVM_DEV_ARM_VGIC_GRP_CTRL  4
 #define   KVM_DEV_ARM_VGIC_CTRL_INIT   0
 
+/* Device Control API: ARM PMU */
+#define KVM_DEV_ARM_PMU_GRP_IRQ0
+
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_TYPE_SHIFT 24
 #define KVM_ARM_IRQ_TYPE_MASK  0xff
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1bef9e2..f6be696 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1122,6 +1122,7 @@ extern struct kvm_device_ops kvm_mpic_ops;
 extern struct kvm_device_ops kvm_xics_ops;
 extern struct kvm_device_ops kvm_arm_vgic_v2_ops;
 extern struct kvm_device_ops kvm_arm_vgic_v3_ops;
+extern struct kvm_device_ops kvm_arm_pmu_ops;
 
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a9256f0..f41e6b6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1025,6 +1025,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_FLIC  KVM_DEV_TYPE_FLIC
KVM_DEV_TYPE_ARM_VGIC_V3,
 #define KVM_DEV_TYPE_ARM_VGIC_V3   KVM_DEV_TYPE_ARM_VGIC_V3
+   KVM_DEV_TYPE_ARM_PMU_V3,
+#defineKVM_DEV_TYPE_ARM_PMU_V3 KVM_DEV_TYPE_ARM_PMU_V3
KVM_DEV_TYPE_MAX,
 };
 
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index d78ce7b..0a00d04 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -19,10 +19,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 
+#include "vgic.h"
+
 /**
  * kvm_pmu_get_counter_value - get PMU counter value
  * @vcpu: The vcpu pointer
@@ -416,3 +419,92 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, 
u32 data,
 
pmc->perf_event = event;
 }
+
+static int kvm_arm_pmu_set_irq(struct kvm *kvm, int irq)
+{
+   int j;
+   struct kvm_vcpu *vcpu;
+
+   kvm_for_each_vcpu(j, vcpu, kvm) {
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+   kvm_debug("Set kvm ARM PMU irq: %d\n", irq);
+   pmu->irq_num = irq;
+   vgic_dist_irq_set_cfg(vcpu, irq, true);
+   }
+
+   return 0;
+}
+
+static int kvm_arm_pmu_create(struct kvm_device *dev, u32 type)
+{
+   int i, j;
+   struct kvm_vcpu *vcpu;
+   struct kvm *kvm = dev->kvm;
+
+   kvm_for_each_vcpu(j, vcpu, kvm) {
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+   memset(pmu, 0, sizeof(*pmu));
+   for (i = 0; i < ARMV8_MAX_COUNTERS; i++) {
+   pmu->pmc[i].idx = i;
+   pmu->pmc[i].vcpu = vcpu;
+   pmu->pmc[i].bitmask = 0xUL;
+   }
+   pmu->irq_num = -1;
+   }
+
+   return 0;
+}
+
+static void kvm_arm_pmu_destroy(struct kvm_device *dev)
+{
+   kfree(dev);
+}
+
+static int kvm_arm_pmu_set_attr(struct kvm_device *dev,
+   struct kvm_device_attr *attr)
+{
+   switch (attr->group) {
+   case KVM_DEV_ARM_PMU_GRP_IRQ: {
+   int __user *uaddr = (int __user *)(long)attr->addr;
+   int reg;
+
+   if (get_user(reg, uaddr))
+   return -EFAULT;
+
+   if (reg < VGIC_NR_SGIS || reg > dev->kvm->arch.vgic.nr_irqs)
+   return -EINVAL;
+
+   return kvm_arm_pmu_set_irq(dev->kvm,

[PATCH v4 18/21] KVM: ARM64: Add PMU overflow interrupt routing

From: Shannon Zhao 

When calling perf_event_create_kernel_counter to create perf_event,
assign a overflow handler. Then when perf event overflows, set
irq_pending and call kvm_vcpu_kick() to sync the interrupt.

Signed-off-by: Shannon Zhao 
---
 arch/arm/kvm/arm.c|  4 +++
 include/kvm/arm_pmu.h |  4 +++
 virt/kvm/arm/pmu.c| 76 ++-
 3 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 78b2869..9c0fec4 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -551,6 +552,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
if (ret <= 0 || need_new_vmid_gen(vcpu->kvm)) {
local_irq_enable();
+   kvm_pmu_sync_hwstate(vcpu);
kvm_vgic_sync_hwstate(vcpu);
preempt_enable();
kvm_timer_sync_hwstate(vcpu);
@@ -598,6 +600,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
kvm_guest_exit();
trace_kvm_exit(kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
 
+   kvm_pmu_post_sync_hwstate(vcpu);
+
kvm_vgic_sync_hwstate(vcpu);
 
preempt_enable();
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index acd025a..5e7f943 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -39,6 +39,8 @@ struct kvm_pmu {
 };
 
 #ifdef CONFIG_KVM_ARM_PMU
+void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu);
+void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu);
 unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx);
 void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 val);
 void kvm_pmu_enable_counter(struct kvm_vcpu *vcpu, u32 val, bool all_enable);
@@ -49,6 +51,8 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, 
u32 data,
u32 select_idx);
 void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val);
 #else
+void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) {}
+void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu) {}
 unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx)
 {
return 0;
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 11d1bfb..6d48d9a 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * kvm_pmu_get_counter_value - get PMU counter value
@@ -69,6 +70,78 @@ static void kvm_pmu_stop_counter(struct kvm_pmc *pmc)
 }
 
 /**
+ * kvm_pmu_sync_hwstate - sync pmu state for cpu
+ * @vcpu: The vcpu pointer
+ *
+ * Inject virtual PMU IRQ if IRQ is pending for this cpu.
+ */
+void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu)
+{
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+   u32 overflow;
+
+   if (!vcpu_mode_is_32bit(vcpu))
+   overflow = vcpu_sys_reg(vcpu, PMOVSSET_EL0);
+   else
+   overflow = vcpu_cp15(vcpu, c9_PMOVSSET);
+
+   if ((pmu->irq_pending || overflow != 0) && (pmu->irq_num != -1))
+   kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id, pmu->irq_num, 1);
+
+   pmu->irq_pending = false;
+}
+
+/**
+ * kvm_pmu_post_sync_hwstate - post sync pmu state for cpu
+ * @vcpu: The vcpu pointer
+ *
+ * Inject virtual PMU IRQ if IRQ is pending for this cpu when back from guest.
+ */
+void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu)
+{
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+   if (pmu->irq_pending && (pmu->irq_num != -1))
+   kvm_vgic_inject_irq(vcpu->kvm, vcpu->vcpu_id, pmu->irq_num, 1);
+
+   pmu->irq_pending = false;
+}
+
+/**
+ * When perf event overflows, set irq_pending and call kvm_vcpu_kick() to 
inject
+ * the interrupt.
+ */
+static void kvm_pmu_perf_overflow(struct perf_event *perf_event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+   struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+   struct kvm_vcpu *vcpu = pmc->vcpu;
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+   int idx = pmc->idx;
+
+   if (!vcpu_mode_is_32bit(vcpu)) {
+   if ((vcpu_sys_reg(vcpu, PMINTENSET_EL1) >> idx) & 0x1) {
+   __set_bit(idx,
+   (unsigned long *)&vcpu_sys_reg(vcpu, PMOVSSET_EL0));
+   __set_bit(idx,
+   (unsigned long *)&vcpu_sys_reg(vcpu, PMOVSCLR_EL0));
+   pmu->irq_pending = true;
+   kvm_vcpu_kick(vcpu);
+   }
+   } else {
+   if ((vcpu_cp15(vcpu, c9_PMINTENSET) >> idx) & 0x1) {
+   __set_bit(idx,
+   (unsigned long *)&vcpu_cp15(vcpu, c9_PMOVSSET));
+

[PATCH v4 13/21] KVM: ARM64: Add reset and access handlers for PMOVSSET and PMOVSCLR register

From: Shannon Zhao 

Since the reset value of PMOVSSET and PMOVSCLR is UNKNOWN, use
reset_unknown for its reset handler. Add a new case to emulate writing
PMOVSSET or PMOVSCLR register.

When writing non-zero value to PMOVSSET, pend PMU interrupt. When the
value writing to PMOVSCLR is equal to the current value, clear the PMU
pending interrupt.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 39 ---
 include/kvm/arm_pmu.h |  4 
 virt/kvm/arm/pmu.c| 30 ++
 3 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 6d2febf..e03d3b8d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -552,6 +552,21 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
vcpu_sys_reg(vcpu, PMINTENSET_EL1) &= ~val;
break;
}
+   case PMOVSSET_EL0: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_overflow_set(vcpu, val);
+   vcpu_sys_reg(vcpu, r->reg) |= val;
+   vcpu_sys_reg(vcpu, PMOVSCLR_EL0) |= val;
+   break;
+   }
+   case PMOVSCLR_EL0: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_overflow_clear(vcpu, val,
+  vcpu_sys_reg(vcpu, r->reg));
+   vcpu_sys_reg(vcpu, r->reg) &= ~val;
+   vcpu_sys_reg(vcpu, PMOVSSET_EL0) &= ~val;
+   break;
+   }
case PMCR_EL0: {
/* Only update writeable bits of PMCR */
val = vcpu_sys_reg(vcpu, r->reg);
@@ -790,7 +805,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  access_pmu_regs, reset_unknown, PMCNTENCLR_EL0 },
/* PMOVSCLR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b011),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMOVSCLR_EL0 },
/* PMSWINC_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b100),
  trap_raz_wi },
@@ -817,7 +832,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  trap_raz_wi },
/* PMOVSSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b011),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMOVSSET_EL0 },
 
/* TPIDR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1101), CRm(0b), Op2(0b010),
@@ -1083,6 +1098,21 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
vcpu_cp15(vcpu, c9_PMINTENSET) &= ~val;
break;
}
+   case c9_PMOVSSET: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_overflow_set(vcpu, val);
+   vcpu_cp15(vcpu, r->reg) |= val;
+   vcpu_cp15(vcpu, c9_PMOVSCLR) |= val;
+   break;
+   }
+   case c9_PMOVSCLR: {
+   val = *vcpu_reg(vcpu, p->Rt);
+   kvm_pmu_overflow_clear(vcpu, val,
+  vcpu_cp15(vcpu, r->reg));
+   vcpu_cp15(vcpu, r->reg) &= ~val;
+   vcpu_cp15(vcpu, c9_PMOVSSET) &= ~val;
+   break;
+   }
case c9_PMCR: {
/* Only update writeable bits of PMCR */
val = vcpu_cp15(vcpu, r->reg);
@@ -1162,7 +1192,8 @@ static const struct sys_reg_desc cp15_regs[] = {
  reset_unknown_cp15, c9_PMCNTENSET },
{ Op1( 0), CRn( 9), CRm(12), Op2( 2), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMCNTENCLR },
-   { Op1( 0), CRn( 9), CRm(12), Op2( 3), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(12), Op2( 3), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMOVSCLR },
{ Op1( 0), CRn( 9), CRm(12), Op2( 5), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMSELR },
{ Op1( 0), CRn( 9), CRm(12), Op2( 6), access_pmu_cp15_regs,
@@ -1180,6 +1211,8 @@ static const struct sys_reg_desc cp15_regs[] = {
  reset_unknown_cp15, c9_PMINTENSET },
{ Op1( 0), CRn( 9), CRm(14), Op2( 2), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMINTENCLR },
+   { Op1( 0), CRn( 9), CRm(14), Op2( 3), access_pmu_cp15_regs,
+ reset_unknown_cp15, c9_PMOVSSET },
 
{ Op1( 0), CRn(10), CRm( 2), Op2( 0), access_vm_reg, NULL, c10_PRRR },
{ Op1( 0), CRn(10), CRm( 2), Op2( 1), access_vm_reg, NULL, c10_NMRR },
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 53d5907..ff17578 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -42,6 +42,8 @@ struct kvm_pmu {
 unsigned long kvm_pmu_get_counter_v

[PATCH v4 04/21] KVM: ARM64: Add reset and access handlers for PMCR_EL0 register

From: Shannon Zhao 

Add reset handler which gets host value of PMCR_EL0 and make writable
bits architecturally UNKNOWN except PMCR.E to zero. Add a common access
handler for PMU registers which emulates writing and reading register
and add emulation for PMCR.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 106 +-
 1 file changed, 104 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index d03d3af..5b591d6 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -446,6 +447,67 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const 
struct sys_reg_desc *r)
vcpu_sys_reg(vcpu, MPIDR_EL1) = (1ULL << 31) | mpidr;
 }
 
+static void vcpu_sysreg_write(struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *r, u64 val)
+{
+   if (!vcpu_mode_is_32bit(vcpu))
+   vcpu_sys_reg(vcpu, r->reg) = val;
+   else
+   vcpu_cp15(vcpu, r->reg) = lower_32_bits(val);
+}
+
+static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
+{
+   u64 pmcr, val;
+
+   asm volatile("mrs %0, pmcr_el0\n" : "=r" (pmcr));
+   /* Writable bits of PMCR_EL0 (ARMV8_PMCR_MASK) is reset to UNKNOWN
+* except PMCR.E resetting to zero.
+*/
+   val = ((pmcr & ~ARMV8_PMCR_MASK) | (ARMV8_PMCR_MASK & 0xdecafbad))
+ & (~ARMV8_PMCR_E);
+   vcpu_sysreg_write(vcpu, r, val);
+}
+
+/* PMU registers accessor. */
+static bool access_pmu_regs(struct kvm_vcpu *vcpu,
+   const struct sys_reg_params *p,
+   const struct sys_reg_desc *r)
+{
+   unsigned long val;
+
+   if (p->is_write) {
+   switch (r->reg) {
+   case PMCR_EL0: {
+   /* Only update writeable bits of PMCR */
+   val = vcpu_sys_reg(vcpu, r->reg);
+   val &= ~ARMV8_PMCR_MASK;
+   val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
+   vcpu_sys_reg(vcpu, r->reg) = val;
+   break;
+   }
+   default:
+   vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
+   break;
+   }
+   } else {
+   switch (r->reg) {
+   case PMCR_EL0: {
+   /* PMCR.P & PMCR.C are RAZ */
+   val = vcpu_sys_reg(vcpu, r->reg)
+ & ~(ARMV8_PMCR_P | ARMV8_PMCR_C);
+   *vcpu_reg(vcpu, p->Rt) = val;
+   break;
+   }
+   default:
+   *vcpu_reg(vcpu, p->Rt) = vcpu_sys_reg(vcpu, r->reg);
+   break;
+   }
+   }
+
+   return true;
+}
+
 /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
 #define DBG_BCR_BVR_WCR_WVR_EL1(n) \
/* DBGBVRn_EL1 */   \
@@ -630,7 +692,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
/* PMCR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b000),
- trap_raz_wi },
+ access_pmu_regs, reset_pmcr, PMCR_EL0, },
/* PMCNTENSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b001),
  trap_raz_wi },
@@ -864,6 +926,45 @@ static const struct sys_reg_desc cp14_64_regs[] = {
{ Op1( 0), CRm( 2), .access = trap_raz_wi },
 };
 
+/* PMU CP15 registers accessor. */
+static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
+const struct sys_reg_params *p,
+const struct sys_reg_desc *r)
+{
+   unsigned long val;
+
+   if (p->is_write) {
+   switch (r->reg) {
+   case c9_PMCR: {
+   /* Only update writeable bits of PMCR */
+   val = vcpu_cp15(vcpu, r->reg);
+   val &= ~ARMV8_PMCR_MASK;
+   val |= *vcpu_reg(vcpu, p->Rt) & ARMV8_PMCR_MASK;
+   vcpu_cp15(vcpu, r->reg) = val;
+   break;
+   }
+   default:
+   vcpu_cp15(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
+   break;
+   }
+   } else {
+   switch (r->reg) {
+   case c9_PMCR: {
+   /* PMCR.P & PMCR.C are RAZ */
+   val = vcpu_cp15(vcpu, r->reg)
+ & ~(ARMV8_PMCR_P | ARMV8_PMCR_C);
+   *vcpu_reg(vcpu, p->Rt) = val;
+   break;
+   }
+   default:
+   *vcpu_reg(vcpu, p->Rt) = vcpu_cp15(vcpu, r->reg);
+

[PATCH v4 01/21] ARM64: Move PMU register related defines to asm/pmu.h

From: Shannon Zhao 

To use the ARMv8 PMU related register defines from the KVM code,
we move the relevant definitions to asm/pmu.h header file.

Signed-off-by: Anup Patel 
Signed-off-by: Shannon Zhao 
---
 arch/arm64/include/asm/pmu.h   | 45 ++
 arch/arm64/kernel/perf_event.c | 35 
 2 files changed, 45 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/include/asm/pmu.h b/arch/arm64/include/asm/pmu.h
index b7710a5..b9f394a 100644
--- a/arch/arm64/include/asm/pmu.h
+++ b/arch/arm64/include/asm/pmu.h
@@ -19,6 +19,51 @@
 #ifndef __ASM_PMU_H
 #define __ASM_PMU_H
 
+#define ARMV8_MAX_COUNTERS  32
+#define ARMV8_COUNTER_MASK  (ARMV8_MAX_COUNTERS - 1)
+
+/*
+ * Per-CPU PMCR: config reg
+ */
+#define ARMV8_PMCR_E   (1 << 0) /* Enable all counters */
+#define ARMV8_PMCR_P   (1 << 1) /* Reset all counters */
+#define ARMV8_PMCR_C   (1 << 2) /* Cycle counter reset */
+#define ARMV8_PMCR_D   (1 << 3) /* CCNT counts every 64th cpu cycle */
+#define ARMV8_PMCR_X   (1 << 4) /* Export to ETM */
+#define ARMV8_PMCR_DP  (1 << 5) /* Disable CCNT if non-invasive debug*/
+#defineARMV8_PMCR_N_SHIFT  11   /* Number of counters 
supported */
+#defineARMV8_PMCR_N_MASK   0x1f
+#defineARMV8_PMCR_MASK 0x3f /* Mask for writable bits */
+
+/*
+ * PMCNTEN: counters enable reg
+ */
+#defineARMV8_CNTEN_MASK0x  /* Mask for writable 
bits */
+
+/*
+ * PMINTEN: counters interrupt enable reg
+ */
+#defineARMV8_INTEN_MASK0x  /* Mask for writable 
bits */
+
+/*
+ * PMOVSR: counters overflow flag status reg
+ */
+#defineARMV8_OVSR_MASK 0x  /* Mask for writable 
bits */
+#defineARMV8_OVERFLOWED_MASK   ARMV8_OVSR_MASK
+
+/*
+ * PMXEVTYPER: Event selection reg
+ */
+#defineARMV8_EVTYPE_MASK   0xc80003ff  /* Mask for writable 
bits */
+#defineARMV8_EVTYPE_EVENT  0x3ff   /* Mask for EVENT bits 
*/
+
+/*
+ * Event filters for PMUv3
+ */
+#defineARMV8_EXCLUDE_EL1   (1 << 31)
+#defineARMV8_EXCLUDE_EL0   (1 << 30)
+#defineARMV8_INCLUDE_EL2   (1 << 27)
+
 #ifdef CONFIG_HW_PERF_EVENTS
 
 /* The events for a given PMU register set. */
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index f9a74d4..534e8ad 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -741,9 +741,6 @@ static const unsigned 
armv8_pmuv3_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
 #defineARMV8_IDX_COUNTER0  1
 #defineARMV8_IDX_COUNTER_LAST  (ARMV8_IDX_CYCLE_COUNTER + 
cpu_pmu->num_events - 1)
 
-#defineARMV8_MAX_COUNTERS  32
-#defineARMV8_COUNTER_MASK  (ARMV8_MAX_COUNTERS - 1)
-
 /*
  * ARMv8 low level PMU access
  */
@@ -754,38 +751,6 @@ static const unsigned 
armv8_pmuv3_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
 #defineARMV8_IDX_TO_COUNTER(x) \
(((x) - ARMV8_IDX_COUNTER0) & ARMV8_COUNTER_MASK)
 
-/*
- * Per-CPU PMCR: config reg
- */
-#define ARMV8_PMCR_E   (1 << 0) /* Enable all counters */
-#define ARMV8_PMCR_P   (1 << 1) /* Reset all counters */
-#define ARMV8_PMCR_C   (1 << 2) /* Cycle counter reset */
-#define ARMV8_PMCR_D   (1 << 3) /* CCNT counts every 64th cpu cycle */
-#define ARMV8_PMCR_X   (1 << 4) /* Export to ETM */
-#define ARMV8_PMCR_DP  (1 << 5) /* Disable CCNT if non-invasive debug*/
-#defineARMV8_PMCR_N_SHIFT  11   /* Number of counters 
supported */
-#defineARMV8_PMCR_N_MASK   0x1f
-#defineARMV8_PMCR_MASK 0x3f /* Mask for writable bits */
-
-/*
- * PMOVSR: counters overflow flag status reg
- */
-#defineARMV8_OVSR_MASK 0x  /* Mask for writable 
bits */
-#defineARMV8_OVERFLOWED_MASK   ARMV8_OVSR_MASK
-
-/*
- * PMXEVTYPER: Event selection reg
- */
-#defineARMV8_EVTYPE_MASK   0xc80003ff  /* Mask for writable 
bits */
-#defineARMV8_EVTYPE_EVENT  0x3ff   /* Mask for EVENT bits 
*/
-
-/*
- * Event filters for PMUv3
- */
-#defineARMV8_EXCLUDE_EL1   (1 << 31)
-#defineARMV8_EXCLUDE_EL0   (1 << 30)
-#defineARMV8_INCLUDE_EL2   (1 << 27)
-
 static inline u32 armv8pmu_pmcr_read(void)
 {
u32 val;
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 20/21] KVM: ARM64: Free perf event of PMU when destroying vcpu

From: Shannon Zhao 

When KVM frees VCPU, it needs to free the perf_event of PMU.

Signed-off-by: Shannon Zhao 
---
 arch/arm/kvm/arm.c|  1 +
 include/kvm/arm_pmu.h |  2 ++
 virt/kvm/arm/pmu.c| 21 +
 3 files changed, 24 insertions(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 9c0fec4..90ddb93 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -259,6 +259,7 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
kvm_mmu_free_memory_caches(vcpu);
kvm_timer_vcpu_terminate(vcpu);
kvm_vgic_vcpu_destroy(vcpu);
+   kvm_pmu_vcpu_destroy(vcpu);
kmem_cache_free(kvm_vcpu_cache, vcpu);
 }
 
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index e708c49..f2cd8d9 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -40,6 +40,7 @@ struct kvm_pmu {
 
 #ifdef CONFIG_KVM_ARM_PMU
 void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu);
+void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu);
 void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu);
 void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu);
 unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx);
@@ -53,6 +54,7 @@ void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, 
u32 data,
 void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val);
 #else
 void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu) {}
+void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu) {}
 void kvm_pmu_sync_hwstate(struct kvm_vcpu *vcpu) {}
 void kvm_pmu_post_sync_hwstate(struct kvm_vcpu *vcpu) {}
 unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 select_idx)
diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
index 84720a2..d78ce7b 100644
--- a/virt/kvm/arm/pmu.c
+++ b/virt/kvm/arm/pmu.c
@@ -89,6 +89,27 @@ void kvm_pmu_vcpu_reset(struct kvm_vcpu *vcpu)
 }
 
 /**
+ * kvm_pmu_vcpu_destroy - free perf event of PMU for cpu
+ * @vcpu: The vcpu pointer
+ *
+ */
+void kvm_pmu_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+   int i;
+   struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+   for (i = 0; i < ARMV8_MAX_COUNTERS; i++) {
+   struct kvm_pmc *pmc = &pmu->pmc[i];
+
+   if (pmc->perf_event) {
+   perf_event_disable(pmc->perf_event);
+   perf_event_release_kernel(pmc->perf_event);
+   pmc->perf_event = NULL;
+   }
+   }
+}
+
+/**
  * kvm_pmu_sync_hwstate - sync pmu state for cpu
  * @vcpu: The vcpu pointer
  *
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 03/21] KVM: ARM64: Add offset defines for PMU registers

From: Shannon Zhao 

We are about to trap and emulate acccesses to each PMU register
individually. This adds the context offsets for the AArch64 PMU
registers and their AArch32 counterparts.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/include/asm/kvm_asm.h | 55 
 1 file changed, 50 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 5e37710..4f804c1 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -48,12 +48,34 @@
 #define MDSCR_EL1  22  /* Monitor Debug System Control Register */
 #define MDCCINT_EL123  /* Monitor Debug Comms Channel Interrupt Enable 
Reg */
 
+/* Performance Monitors Registers */
+#define PMCR_EL0   24  /* Control Register */
+#define PMOVSSET_EL0   25  /* Overflow Flag Status Set Register */
+#define PMOVSCLR_EL0   26  /* Overflow Flag Status Clear Register */
+#define PMSELR_EL0 27  /* Event Counter Selection Register */
+#define PMCEID0_EL028  /* Common Event Identification Register 0 */
+#define PMCEID1_EL029  /* Common Event Identification Register 1 */
+#define PMEVCNTR0_EL0  30  /* Event Counter Register (0-30) */
+#define PMEVCNTR30_EL0 60
+#define PMCCNTR_EL061  /* Cycle Counter Register */
+#define PMEVTYPER0_EL0 62  /* Event Type Register (0-30) */
+#define PMEVTYPER30_EL092
+#define PMCCFILTR_EL0  93  /* Cycle Count Filter Register */
+#define PMXEVCNTR_EL0  94  /* Selected Event Count Register */
+#define PMXEVTYPER_EL0 95  /* Selected Event Type Register */
+#define PMCNTENSET_EL0 96  /* Count Enable Set Register */
+#define PMCNTENCLR_EL0 97  /* Count Enable Clear Register */
+#define PMINTENSET_EL1 98  /* Interrupt Enable Set Register */
+#define PMINTENCLR_EL1 99  /* Interrupt Enable Clear Register */
+#define PMUSERENR_EL0  100 /* User Enable Register */
+#define PMSWINC_EL0101 /* Software Increment Register */
+
 /* 32bit specific registers. Keep them at the end of the range */
-#defineDACR32_EL2  24  /* Domain Access Control Register */
-#defineIFSR32_EL2  25  /* Instruction Fault Status Register */
-#defineFPEXC32_EL2 26  /* Floating-Point Exception Control 
Register */
-#defineDBGVCR32_EL227  /* Debug Vector Catch Register */
-#defineNR_SYS_REGS 28
+#defineDACR32_EL2  102 /* Domain Access Control Register */
+#defineIFSR32_EL2  103 /* Instruction Fault Status Register */
+#defineFPEXC32_EL2 104 /* Floating-Point Exception Control 
Register */
+#defineDBGVCR32_EL2105 /* Debug Vector Catch Register */
+#defineNR_SYS_REGS 106
 
 /* 32bit mapping */
 #define c0_MPIDR   (MPIDR_EL1 * 2) /* MultiProcessor ID Register */
@@ -75,6 +97,24 @@
 #define c6_IFAR(c6_DFAR + 1)   /* Instruction Fault Address 
Register */
 #define c7_PAR (PAR_EL1 * 2)   /* Physical Address Register */
 #define c7_PAR_high(c7_PAR + 1)/* PAR top 32 bits */
+
+/* Performance Monitors*/
+#define c9_PMCR(PMCR_EL0 * 2)
+#define c9_PMOVSSET(PMOVSSET_EL0 * 2)
+#define c9_PMOVSCLR(PMOVSCLR_EL0 * 2)
+#define c9_PMCCNTR (PMCCNTR_EL0 * 2)
+#define c9_PMSELR  (PMSELR_EL0 * 2)
+#define c9_PMCEID0 (PMCEID0_EL0 * 2)
+#define c9_PMCEID1 (PMCEID1_EL0 * 2)
+#define c9_PMXEVCNTR   (PMXEVCNTR_EL0 * 2)
+#define c9_PMXEVTYPER  (PMXEVTYPER_EL0 * 2)
+#define c9_PMCNTENSET  (PMCNTENSET_EL0 * 2)
+#define c9_PMCNTENCLR  (PMCNTENCLR_EL0 * 2)
+#define c9_PMINTENSET  (PMINTENSET_EL1 * 2)
+#define c9_PMINTENCLR  (PMINTENCLR_EL1 * 2)
+#define c9_PMUSERENR   (PMUSERENR_EL0 * 2)
+#define c9_PMSWINC (PMSWINC_EL0 * 2)
+
 #define c10_PRRR   (MAIR_EL1 * 2)  /* Primary Region Remap Register */
 #define c10_NMRR   (c10_PRRR + 1)  /* Normal Memory Remap Register */
 #define c12_VBAR   (VBAR_EL1 * 2)  /* Vector Base Address Register */
@@ -86,6 +126,11 @@
 #define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
 #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
 
+/* Performance Monitors*/
+#define c14_PMEVCNTR0  (PMEVCNTR0_EL0 * 2)
+#define c14_PMEVTYPER0 (PMEVTYPER0_EL0 * 2)
+#define c14_PMCCFILTR  (PMCCFILTR_EL0 * 2)
+
 #define cp14_DBGDSCRext(MDSCR_EL1 * 2)
 #define cp14_DBGBCR0   (DBGBCR0_EL1 * 2)
 #define cp14_DBGBVR0   (DBGBVR0_EL1 * 2)
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 00/21] KVM: ARM64: Add guest PMU support

From: Shannon Zhao 

This patchset adds guest PMU support for KVM on ARM64. It takes
trap-and-emulate approach. When guest wants to monitor one event, it
will be trapped by KVM and KVM will call perf_event API to create a perf
event and call relevant perf_event APIs to get the count value of event.

Use perf to test this patchset in guest. When using "perf list", it
shows the list of the hardware events and hardware cache events perf
supports. Then use "perf stat -e EVENT" to monitor some event. For
example, use "perf stat -e cycles" to count cpu cycles and
"perf stat -e cache-misses" to count cache misses.

Below are the outputs of "perf stat -r 5 sleep 5" when running in host
and guest.

Host:
 Performance counter stats for 'sleep 5' (5 runs):

  0.522048  task-clock (msec) #0.000 CPUs utilized  
  ( +-  1.50% )
 1  context-switches  #0.002 M/sec
 0  cpu-migrations#0.383 K/sec  
  ( +-100.00% )
48  page-faults   #0.092 M/sec  
  ( +-  0.66% )
   1088597  cycles#2.085 GHz
  ( +-  1.50% )
 stalled-cycles-frontend
 stalled-cycles-backend
524457  instructions  #0.48  insns per cycle
  ( +-  0.89% )
 branches
  9688  branch-misses #   18.557 M/sec  
  ( +-  1.78% )

   5.000851736 seconds time elapsed 
 ( +-  0.00% )

Guest:
 Performance counter stats for 'sleep 5' (5 runs):

  0.632288  task-clock (msec) #0.000 CPUs utilized  
  ( +-  1.11% )
 1  context-switches  #0.002 M/sec
 0  cpu-migrations#0.000 K/sec
49  page-faults   #0.078 M/sec  
  ( +-  1.19% )
   1119933  cycles#1.771 GHz
  ( +-  1.19% )
 stalled-cycles-frontend
 stalled-cycles-backend
568318  instructions  #0.51  insns per cycle
  ( +-  0.91% )
 branches
 10227  branch-misses #   16.175 M/sec  
  ( +-  1.71% )

   5.001170616 seconds time elapsed 
 ( +-  0.00% )

Have a cycle counter read test like below in guest and host:

static void test(void)
{
unsigned long count, count1, count2;
count1 = read_cycles();
count++;
count2 = read_cycles();
}

Host:
count1: 3044948797
count2: 3044948931
delta: 134

Guest:
count1: 5782364731
count2: 5782364885
delta: 154

The gap between guest and host is very small. One reason for this I
think is that it doesn't count the cycles in EL2 and host since we add
exclude_hv = 1. So the cycles spent to store/restore registers which
happens at EL2 are not included.

This patchset can be fetched from [1] and the relevant QEMU version for
test can be fetched from [2].

The results of 'perf test' can be found from [3][4].
The results of perf_event_tests test suite can be found from [5][6].

Thanks,
Shannon

[1] https://git.linaro.org/people/shannon.zhao/linux-mainline.git  
KVM_ARM64_PMU_v4
[2] https://git.linaro.org/people/shannon.zhao/qemu.git  virtual_PMU
[3] http://people.linaro.org/~shannon.zhao/PMU/perf-test-host.txt
[4] http://people.linaro.org/~shannon.zhao/PMU/perf-test-guest.txt
[5] http://people.linaro.org/~shannon.zhao/PMU/perf_event_tests-host.txt
[6] http://people.linaro.org/~shannon.zhao/PMU/perf_event_tests-guest.txt

Changes since v3:
* Rebase on new linux kernel mainline 
* Use ARMV8_MAX_COUNTERS instead of 32
* Reset PMCR.E to zero.
* Trigger overflow for software increment.
* Optimize PMU interrupt inject logic.
* Add handler for E,C,P bits of PMCR
* Fix the overflow bug found by perf_event_tests
* Run 'perf test', 'perf top' and perf_event_tests test suite
* Add exclude_hv = 1 configuration to not count in EL2

Changes since v2:
* Directly use perf raw event type to create perf_event in KVM
* Add a helper vcpu_sysreg_write
* remove unrelated header file

Changes since v1:
* Use switch...case for registers access handler instead of adding
  alone handler for each register
* Try to use the sys_regs to store the register value instead of adding
  new variables in struct kvm_pmc
* Fix the handle of cp15 regs
* Create a new kvm device vPMU, then userspace could choose whether to
  create PMU
* Fix the handle of PMU overflow interrupt

Shannon Zhao (21):
  ARM64: Move PMU register related defines to asm/pmu.h
  KVM: ARM64: Define PMU data structure for each vcpu
  KVM: ARM64: Add offset defines for PMU registers
  KVM: ARM64: Add reset and access handlers for PMCR_EL0 register
  KVM: ARM64: Add reset and access handlers for PMSELR register
  KVM: ARM64: A

[PATCH v4 14/21] KVM: ARM64: Add reset and access handlers for PMUSERENR register

From: Shannon Zhao 

The reset value of PMUSERENR_EL0 is UNKNOWN, use reset_unknown. While
the reset value of PMUSERENR is zero, use reset_val_cp15 with zero for
its reset handler.

Add a helper for CP15 registers reset to specified value.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/kvm/sys_regs.c | 5 +++--
 arch/arm64/kvm/sys_regs.h | 8 
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index e03d3b8d..c44c8e1 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -829,7 +829,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  access_pmu_regs, reset_unknown, PMXEVCNTR_EL0 },
/* PMUSERENR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000),
- trap_raz_wi },
+ access_pmu_regs, reset_unknown, PMUSERENR_EL0 },
/* PMOVSSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b011),
  access_pmu_regs, reset_unknown, PMOVSSET_EL0 },
@@ -1206,7 +1206,8 @@ static const struct sys_reg_desc cp15_regs[] = {
  reset_unknown_cp15, c9_PMXEVTYPER },
{ Op1( 0), CRn( 9), CRm(13), Op2( 2), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMXEVCNTR },
-   { Op1( 0), CRn( 9), CRm(14), Op2( 0), trap_raz_wi },
+   { Op1( 0), CRn( 9), CRm(14), Op2( 0), access_pmu_cp15_regs,
+ reset_val_cp15,  c9_PMUSERENR, 0 },
{ Op1( 0), CRn( 9), CRm(14), Op2( 1), access_pmu_cp15_regs,
  reset_unknown_cp15, c9_PMINTENSET },
{ Op1( 0), CRn( 9), CRm(14), Op2( 2), access_pmu_cp15_regs,
diff --git a/arch/arm64/kvm/sys_regs.h b/arch/arm64/kvm/sys_regs.h
index 8afeff7..aba997d 100644
--- a/arch/arm64/kvm/sys_regs.h
+++ b/arch/arm64/kvm/sys_regs.h
@@ -125,6 +125,14 @@ static inline void reset_val(struct kvm_vcpu *vcpu, const 
struct sys_reg_desc *r
vcpu_sys_reg(vcpu, r->reg) = r->val;
 }
 
+static inline void reset_val_cp15(struct kvm_vcpu *vcpu,
+ const struct sys_reg_desc *r)
+{
+   BUG_ON(!r->reg);
+   BUG_ON(r->reg >= NR_SYS_REGS);
+   vcpu_cp15(vcpu, r->reg) = r->val;
+}
+
 static inline int cmp_sys_reg(const struct sys_reg_desc *i1,
  const struct sys_reg_desc *i2)
 {
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 02/21] KVM: ARM64: Define PMU data structure for each vcpu

From: Shannon Zhao 

Here we plan to support virtual PMU for guest by full software
emulation, so define some basic structs and functions preparing for
futher steps. Define struct kvm_pmc for performance monitor counter and
struct kvm_pmu for performance monitor unit for each vcpu. According to
ARMv8 spec, the PMU contains at most 32(ARMV8_MAX_COUNTERS) counters.

Since this only supports ARM64 (or PMUv3), add a separate config symbol
for it.

Signed-off-by: Shannon Zhao 
---
 arch/arm64/include/asm/kvm_host.h |  2 ++
 arch/arm64/kvm/Kconfig|  8 
 include/kvm/arm_pmu.h | 41 +++
 3 files changed, 51 insertions(+)
 create mode 100644 include/kvm/arm_pmu.h

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index ed03968..cc843ca 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
 
 #include 
 #include 
+#include 
 
 #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS
 
@@ -132,6 +133,7 @@ struct kvm_vcpu_arch {
/* VGIC state */
struct vgic_cpu vgic_cpu;
struct arch_timer_cpu timer_cpu;
+   struct kvm_pmu pmu;
 
/*
 * Anything that is not used directly from assembly code goes
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 5c7e920..8f321b1 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -31,6 +31,7 @@ config KVM
select KVM_VFIO
select HAVE_KVM_EVENTFD
select HAVE_KVM_IRQFD
+   select KVM_ARM_PMU
---help---
  Support hosting virtualized guest machines.
 
@@ -41,4 +42,11 @@ config KVM_ARM_HOST
---help---
  Provides host support for ARM processors.
 
+config KVM_ARM_PMU
+   bool
+   depends on KVM_ARM_HOST
+   ---help---
+ Adds support for a virtual Performance Monitoring Unit (PMU) in
+ virtual machines.
+
 endif # VIRTUALIZATION
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
new file mode 100644
index 000..254d2b4
--- /dev/null
+++ b/include/kvm/arm_pmu.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2015 Linaro Ltd.
+ * Author: Shannon Zhao 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#ifndef __ASM_ARM_KVM_PMU_H
+#define __ASM_ARM_KVM_PMU_H
+
+#include 
+#include 
+
+struct kvm_pmc {
+   u8 idx;/* index into the pmu->pmc array */
+   struct perf_event *perf_event;
+   struct kvm_vcpu *vcpu;
+   u64 bitmask;
+};
+
+struct kvm_pmu {
+#ifdef CONFIG_KVM_ARM_PMU
+   /* PMU IRQ Number per VCPU */
+   int irq_num;
+   /* IRQ pending flag */
+   bool irq_pending;
+   struct kvm_pmc pmc[ARMV8_MAX_COUNTERS];
+#endif
+};
+
+#endif
-- 
2.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 01/33] acpi: add aml_derefof

Implement DeRefOf term which is used by NVDIMM _DSM method in later patch

Reviewed-by: Igor Mammedov 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 0d4b324..cbd53f4 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1135,6 +1135,14 @@ Aml *aml_unicode(const char *str)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefDerefOf */
+Aml *aml_derefof(Aml *arg)
+{
+Aml *var = aml_opcode(0x83 /* DerefOfOp */);
+aml_append(var, arg);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1b632dc..5a03d33 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -274,6 +274,7 @@ Aml *aml_create_dword_field(Aml *srcbuf, Aml *index, const 
char *name);
 Aml *aml_varpackage(uint32_t num_elements);
 Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
+Aml *aml_derefof(Aml *arg);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 25/33] nvdimm acpi: build ACPI nvdimm devices

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

There is a root device under \_SB and specified NVDIMM devices are under the
root device. Each NVDIMM device has _ADR which returns its handle used to
associate MEMDEV structure in NFIT

We reserve handle 0 for root device. In this patch, we save handle, handle,
arg1 and arg2 to dsm memory. Arg3 is conditionally saved in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 184 +++
 1 file changed, 184 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index dd84e5f..53ed675 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -368,6 +368,15 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+struct NvdimmDsmIn {
+uint32_t handle;
+uint32_t revision;
+uint32_t function;
+   /* the remaining size in the page is used by arg3. */
+uint8_t arg3[0];
+} QEMU_PACKED;
+typedef struct NvdimmDsmIn NvdimmDsmIn;
+
 static uint64_t
 nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 {
@@ -377,6 +386,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 static void
 nvdimm_dsm_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
 {
+fprintf(stderr, "BUG: we never write DSM notification IO Port.\n");
 }
 
 static const MemoryRegionOps nvdimm_dsm_ops = {
@@ -402,6 +412,179 @@ void nvdimm_init_acpi_state(MemoryRegion *memory, 
MemoryRegion *io,
 memory_region_add_subregion(io, NVDIMM_ACPI_IO_BASE, &state->io_mr);
 }
 
+#define BUILD_STA_METHOD(_dev_, _method_)  \
+do {   \
+_method_ = aml_method("_STA", 0);  \
+aml_append(_method_, aml_return(aml_int(0x0f)));   \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_DSM_METHOD(_dev_, _method_, _handle_, _uuid_)\
+do {   \
+Aml *ifctx, *uuid; \
+_method_ = aml_method("_DSM", 4);  \
+/* check UUID if it is we expect, return the errorcode if not.*/   \
+uuid = aml_touuid(_uuid_); \
+ifctx = aml_if(aml_lnot(aml_equal(aml_arg(0), uuid))); \
+aml_append(ifctx, aml_return(aml_int(1 /* Not Supported */))); \
+aml_append(method, ifctx); \
+aml_append(method, aml_return(aml_call4("NCAL", aml_int(_handle_), \
+   aml_arg(1), aml_arg(2), aml_arg(3;  \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_FIELD_UNIT_SIZE(_field_, _byte_, _name_) \
+aml_append(_field_, aml_named_field(_name_, (_byte_) * BITS_PER_BYTE))
+
+#define BUILD_FIELD_UNIT_STRUCT(_field_, _s_, _f_, _name_) \
+BUILD_FIELD_UNIT_SIZE(_field_, sizeof(typeof_field(_s_, _f_)), _name_)
+
+static void build_nvdimm_devices(GSList *device_list, Aml *root_dev)
+{
+for (; device_list; device_list = device_list->next) {
+NVDIMMDevice *nvdimm = device_list->data;
+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
+   NULL);
+uint32_t handle = nvdimm_slot_to_handle(slot);
+Aml *dev, *method;
+
+dev = aml_device("NV%02X", slot);
+aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
+
+BUILD_STA_METHOD(dev, method);
+
+/*
+ * Chapter 4: _DSM Interface for NVDIMM Device (non-root) - Example
+ * in DSM Spec Rev1.
+ */
+BUILD_DSM_METHOD(dev, method,
+ handle /* NVDIMM Device Handle */,
+ "4309AC30-0D11-11E4-9191-0800200C9A66"
+ /* UUID for NVDIMM Devices. */);
+
+aml_append(root_dev, dev);
+}
+}
+
+static void nvdimm_build_acpi_devices(GSList *device_list, Aml *sb_scope)
+{
+Aml *dev, *method, *field;
+uint64_t page_size = TARGET_PAGE_SIZE;
+
+dev = aml_device("NVDR");
+aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
+
+/* map DSM memory and IO into ACPI namespace. */
+aml_append(dev, aml_operation_region("NPIO", AML_SYSTEM_IO,
+   NVDIMM_ACPI_IO_BASE, NVDIMM_ACPI_IO_LEN));
+aml_append(dev, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
+   NVDIMM_ACPI_MEM_BASE, page_size));
+
+/*
+ * DSM notifier:
+ * @NOTI: Read it will notify QEMU that _DSM method is being
+ *called and the parameters can be found in NvdimmDsmIn.
+ *The value read from it is the buffer size of

[PATCH v6 30/33] nvdimm acpi: support Set Namespace Label Data function

Function 6 is used to set Namespace Label Data

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 8c27b25..5c8be41 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -587,6 +587,34 @@ exit:
 nvdimm_dsm_write_status(out, status);
 }
 
+/*
+ * DSM Spec Rev1 4.6 Set Namespace Label Data (Function Index 6).
+ */
+static void nvdimm_dsm_func_set_label_data(NVDIMMDevice *nvdimm,
+   NvdimmDsmIn *in, GArray *out)
+{
+NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
+NvdimmFuncInSetLabelData *set_label_data = &in->func_set_label_data;
+uint32_t status;
+
+le32_to_cpus(&set_label_data->offset);
+le32_to_cpus(&set_label_data->length);
+
+nvdimm_debug("Write Label Data: offset %#x length %#x.\n",
+ set_label_data->offset, set_label_data->length);
+
+status = nvdimm_rw_label_data_check(nvdimm, set_label_data->offset,
+set_label_data->length);
+if (status != NVDIMM_DSM_STATUS_SUCCESS) {
+goto exit;
+}
+
+nvc->write_label_data(nvdimm, set_label_data->in_buf,
+  set_label_data->length, set_label_data->offset);
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
 static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -617,6 +645,9 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 case 0x5 /* Get Namespace Label Data */:
 nvdimm_dsm_func_get_label_data(nvdimm, in, out);
 goto free;
+case 0x6 /* Set Namespace Label Data */:
+nvdimm_dsm_func_set_label_data(nvdimm, in, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 23/33] nvdimm acpi: init the resource used by NVDIMM ACPI

A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
for detailed design

A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
that controls if nvdimm support is enabled, it is true on default and
it is false on 2.4 and its earlier version to keep compatibility

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak |  1 +
 default-configs/mips-softmmu.mak |  1 +
 default-configs/mips64-softmmu.mak   |  1 +
 default-configs/mips64el-softmmu.mak |  1 +
 default-configs/mipsel-softmmu.mak   |  1 +
 default-configs/x86_64-softmmu.mak   |  1 +
 hw/acpi/Makefile.objs|  1 +
 hw/acpi/ich9.c   | 24 ++
 hw/acpi/nvdimm.c | 63 
 hw/acpi/piix4.c  | 27 
 include/hw/acpi/ich9.h   |  3 ++
 include/hw/i386/pc.h | 10 ++
 include/hw/mem/nvdimm.h  | 34 +++
 13 files changed, 161 insertions(+), 7 deletions(-)
 create mode 100644 hw/acpi/nvdimm.c

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 4e84a1c..51e71d4 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -48,6 +48,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/mips-softmmu.mak b/default-configs/mips-softmmu.mak
index 44467c3..6b8b70e 100644
--- a/default-configs/mips-softmmu.mak
+++ b/default-configs/mips-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mips64-softmmu.mak 
b/default-configs/mips64-softmmu.mak
index 66ed5f9..ea820f6 100644
--- a/default-configs/mips64-softmmu.mak
+++ b/default-configs/mips64-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mips64el-softmmu.mak 
b/default-configs/mips64el-softmmu.mak
index bfca2b2..8993851 100644
--- a/default-configs/mips64el-softmmu.mak
+++ b/default-configs/mips64el-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mipsel-softmmu.mak 
b/default-configs/mipsel-softmmu.mak
index 0162ef0..87ab964 100644
--- a/default-configs/mipsel-softmmu.mak
+++ b/default-configs/mipsel-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index e877a86..0a7dc10 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -48,6 +48,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
index 7d3230c..84c082d 100644
--- a/hw/acpi/Makefile.objs
+++ b/hw/acpi/Makefile.objs
@@ -2,6 +2,7 @@ common-obj-$(CONFIG_ACPI_X86) += core.o piix4.o pcihp.o
 common-obj-$(CONFIG_ACPI_X86_ICH) += ich9.o tco.o
 common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu_hotplug.o
 common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
+obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
 common-obj-$(CONFIG_ACPI) += acpi_interface.o
 common-obj-$(CONFIG_ACPI) += bios-linker-loader.o
 common-obj-$(CONFIG_ACPI) += aml-build.o
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index 1e9ae20..603c1bd 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -280,6 +280,12 @@ void ich9_pm_init(PCIDevice *lpc_pci, ICH9LPCPMRegs *pm,
 acpi_memory_hotplug_init(pci_address_space_io(lpc_pci), 
OBJECT(lpc_pci),
  &pm->acpi_memory_hotplug);
 }
+
+if (pm->acpi_nvdimm_state.is_enabled) {
+nvdimm_init_acpi_state(pci_address_space(lpc_pci),
+   pci_address_space_io(lpc_pci), OBJECT(lpc_pci),
+   &pm->acpi_nvdimm_state);
+}
 }
 
 static void ich9_pm_get_gpe0_blk(Object *obj, Visitor *v,
@@ -307,6 +313,20 @@ static void ich9_pm_set_memory_hotplug_support(Object 
*obj, bool value,
 s->pm.acpi_memory_hotplug.is_enabled = value;
 }
 
+static bool ich9_pm_get_nvdimm_support(Object *obj, Error **errp)
+{
+ICH9LPCState *s = ICH9_LPC_DEVICE(obj);
+
+return s->pm.acpi_nvdimm_state.is_enabled;
+}
+
+static void ich9_pm_set_nvdimm_support(Object *obj, bool value, Error **

[PATCH v6 28/33] nvdimm acpi: support Get Namespace Label Size function

Function 4 is used to get Namespace label size

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 87 +++-
 1 file changed, 86 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 300a3aa..67c4699 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -407,15 +407,48 @@ enum {
 NVDIMM_DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4,
 };
 
+struct NvdimmFuncInGetLabelData {
+uint32_t offset; /* the offset in the namespace label data area. */
+uint32_t length; /* the size of data is to be read via the function. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncInGetLabelData NvdimmFuncInGetLabelData;
+
+struct NvdimmFuncInSetLabelData {
+uint32_t offset; /* the offset in the namespace label data area. */
+uint32_t length; /* the size of data is to be written via the function. */
+uint8_t in_buf[0]; /* the data written to label data area. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncInSetLabelData NvdimmFuncInSetLabelData;
+
 struct NvdimmDsmIn {
 uint32_t handle;
 uint32_t revision;
 uint32_t function;
/* the remaining size in the page is used by arg3. */
-uint8_t arg3[0];
+union {
+uint8_t arg3[0];
+NvdimmFuncInSetLabelData func_set_label_data;
+};
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
 
+struct NvdimmFuncOutLabelSize {
+uint32_t status; /* return status code. */
+uint32_t label_size; /* the size of label data area. */
+/*
+ * Maximum size of the namespace label data length supported by
+ * the platform in Get/Set Namespace Label Data functions.
+ */
+uint32_t max_xfer;
+} QEMU_PACKED;
+typedef struct NvdimmFuncOutLabelSize NvdimmFuncOutLabelSize;
+
+struct NvdimmFuncOutGetLabelData {
+uint32_t status;/*return status code. */
+uint8_t out_buf[0]; /* the data got via Get Namesapce Label function. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncOutGetLabelData NvdimmFuncOutGetLabelData;
+
 static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
 {
 status = cpu_to_le32(status);
@@ -445,6 +478,55 @@ static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
 nvdimm_dsm_write_status(out, status);
 }
 
+/*
+ * the max transfer size is the max size transferred by both a
+ * 'Get Namespace Label Data' function and a 'Set Namespace Label Data'
+ * function.
+ */
+static uint32_t nvdimm_get_max_xfer_label_size(void)
+{
+NvdimmDsmIn *in;
+uint32_t max_get_size, max_set_size, dsm_memory_size = TARGET_PAGE_SIZE;
+
+/*
+ * the max data ACPI can read one time which is transferred by
+ * the response of 'Get Namespace Label Data' function.
+ */
+max_get_size = dsm_memory_size - sizeof(NvdimmFuncOutGetLabelData);
+
+/*
+ * the max data ACPI can write one time which is transferred by
+ * 'Set Namespace Label Data' function.
+ */
+max_set_size = dsm_memory_size - offsetof(NvdimmDsmIn, arg3) -
+   sizeof(in->func_set_label_data);
+
+return MIN(max_get_size, max_set_size);
+}
+
+/*
+ * DSM Spec Rev1 4.4 Get Namespace Label Size (Function Index 4).
+ *
+ * It gets the size of Namespace Label data area and the max data size
+ * that Get/Set Namespace Label Data functions can transfer.
+ */
+static void nvdimm_dsm_func_label_size(NVDIMMDevice *nvdimm, GArray *out)
+{
+NvdimmFuncOutLabelSize func_label_size;
+uint32_t label_size, mxfer;
+
+label_size = nvdimm->label_size;
+mxfer = nvdimm_get_max_xfer_label_size();
+
+nvdimm_debug("label_size %#x, max_xfer %#x.\n", label_size, mxfer);
+
+func_label_size.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
+func_label_size.label_size = cpu_to_le32(label_size);
+func_label_size.max_xfer = cpu_to_le32(mxfer);
+
+g_array_append_vals(out, &func_label_size, sizeof(func_label_size));
+}
+
 static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -469,6 +551,9 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
1 << 6 /* Set Namespace Label Data */);
 build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
 goto free;
+case 0x4 /* Get Namespace Label Size */:
+nvdimm_dsm_func_label_size(nvdimm, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 00/33] implement vNVDIMM

This patchset can be found at:
  https://github.com/xiaogr/qemu.git nvdimm-v6

It is based on pci branch on Michael's tree and the top commit is:
commit 6f96a31a06c2a1 (tests: re-enable vhost-user-test).

Changelog in v6:
- changes from Stefan's comments:
  1) fix code style of struct naming by CamelCase way
  2) fix offset + length overflow when read/write label data
  3) compile hw/acpi/nvdimm.c for per target so that TARGET_PAGE_SIZE can
 be used to replace getpagesize()

Changelog in v5:
- changes from Michael's comments:
  1) prefix nvdimm_ to everything in NVDIMM source files
  2) make parsing _DSM Arg3 more clear
  3) comment style fix
  5) drop single used definition
  6) fix dirty dsm buffer lost due to memory write happened on host
  7) check dsm buffer if it is big enough to contain input data
  8) use build_append_int_noprefix to store single value to GArray

- changes from Michael's and Igor's comments:
  1) introduce 'nvdimm-support' parameter to control nvdimm
 enablement and it is disabled for 2.4 and its earlier versions
 to make live migration compatible
  2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
 virtualization

- changes from Stefan's comments:
  1) do endian adjustment for the buffer length

- changes from Bharata B Rao's comments:
  1) fix compile on ppc

- others:
  1) the buffer length is directly got from IO read rather than got
 from dsm memory
  2) fix dirty label data lost due to memory write happened on host

Changelog in v4:
- changes from Michael's comments:
  1) show the message, "Memory is not allocated from HugeTlbfs", if file
 based memory is not allocated from hugetlbfs.
  2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
 from Machine.
  3) statically define UUID and make its operation more clear
  4) use GArray to build device structures to avoid potential buffer
 overflow
  4) improve comments in the code
  5) improve code style

- changes from Igor's comments:
  1) add NVDIMM ACPI spec document
  2) use serialized method to avoid Mutex
  3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
  4) introduce a common ASL method used by _DSM for all devices to reduce
 ACPI size
  5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
 it's better to upgrade QEMU to support Rev2 in the future

- changes from Stefan's comments:
  1) copy input data from DSM memory to local buffer to avoid potential
 issues as DSM memory is visible to guest. Output data is handled
 in a similar way

- changes from Dan's comments:
  1) drop static namespace as Linux has already supported label-less
 nvdimm devices

- changes from Vladimir's comments:
  1) print better message, "failed to get file size for %s, can't create
 backend on it", if any file operation filed to obtain file size

- others:
  create a git repo on github.com for better review/test

Also, thanks for Eric Blake's review on QAPI's side.

Thank all of you to review this patchset.

Changelog in v3:
There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
Michael for their valuable comments, the patchset finally gets better shape.
- changes from Igor's comments:
  1) abstract dimm device type from pc-dimm and create nvdimm device based on
 dimm, then it uses memory backend device as nvdimm's memory and NUMA has
 easily been implemented.
  2) let file-backend device support any kind of filesystem not only for
 hugetlbfs and let it work on file not only for directory which is
 achieved by extending 'mem-path' - if it's a directory then it works as
 current behavior, otherwise if it's file then directly allocates memory
 from it.
  3) we figure out a unused memory hole below 4G that is 0xFF0 ~ 
 0xFFF0, this range is large enough for NVDIMM ACPI as build 64-bit
 ACPI SSDT/DSDT table will break windows XP.
 BTW, only make SSDT.rev = 2 can not work since the width is only depended
 on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
 in ACPI spec:
| Note: For compatibility with ACPI versions before ACPI 2.0, the bit 
| width of Integer objects is dependent on the ComplianceRevision of the DSDT.
| If the ComplianceRevision is less than 2, all integers are restricted to 32 
| bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets 
| the global integer width for all integers, including integers in SSDTs.
  4) use the lowest ACPI spec version to document AML terms.
  5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"

- changes from Stefan's comments:
  1) do not do endian adjustment in-place since _DSM memory is visible to guest
  2) use target platform's target page size instead of fixed PAGE_SIZE
 definition
  3) lots of code style improvement and typo fixes.
  4) live migration fix
- changes from Paolo's comments:
  1) improve the name of memory region
  
- other changes:
  1) return exact buffer siz

[PATCH v6 02/33] acpi: add aml_sizeof

Implement SizeOf term which is used by NVDIMM _DSM method in later patch

Reviewed-by: Igor Mammedov 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index cbd53f4..a72214d 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1143,6 +1143,14 @@ Aml *aml_derefof(Aml *arg)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefSizeOf */
+Aml *aml_sizeof(Aml *arg)
+{
+Aml *var = aml_opcode(0x87 /* SizeOfOp */);
+aml_append(var, arg);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 5a03d33..7296efb 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -275,6 +275,7 @@ Aml *aml_varpackage(uint32_t num_elements);
 Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
+Aml *aml_sizeof(Aml *arg);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 27/33] nvdimm acpi: support function 0

__DSM is defined in ACPI 6.0: 9.14.1 _DSM (Device Specific Method)

Function 0 is a query function. We do not support any function on root
device and only 3 functions are support for NVDIMM device, Get Namespace
Label Size, Get Namespace Label Data and Set Namespace Label Data, that
means we currently only allow to access device's Label Namespace

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c |   2 +-
 hw/acpi/nvdimm.c| 156 +++-
 include/hw/acpi/aml-build.h |   1 +
 3 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 8bee8b2..90229c5 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -231,7 +231,7 @@ static void build_extop_package(GArray *package, uint8_t op)
 build_prepend_byte(package, 0x5B); /* ExtOpPrefix */
 }
 
-static void build_append_int_noprefix(GArray *table, uint64_t value, int size)
+void build_append_int_noprefix(GArray *table, uint64_t value, int size)
 {
 int i;
 
diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index e179a72..300a3aa 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -212,6 +212,22 @@ static uint32_t nvdimm_slot_to_dcr_index(int slot)
 return nvdimm_slot_to_spa_index(slot) + 1;
 }
 
+static NVDIMMDevice
+*nvdimm_get_device_by_handle(GSList *list, uint32_t handle)
+{
+for (; list; list = list->next) {
+NVDIMMDevice *nvdimm = list->data;
+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
+   NULL);
+
+if (nvdimm_slot_to_handle(slot) == handle) {
+return nvdimm;
+}
+}
+
+return NULL;
+}
+
 /* ACPI 6.0: 5.2.25.1 System Physical Address Range Structure */
 static void
 nvdimm_build_structure_spa(GArray *structures, NVDIMMDevice *nvdimm)
@@ -368,6 +384,29 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+/* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
+enum {
+/* Common return status codes. */
+/* Success */
+NVDIMM_DSM_STATUS_SUCCESS = 0,
+/* Not Supported */
+NVDIMM_DSM_STATUS_NOT_SUPPORTED = 1,
+
+/* NVDIMM Root Device _DSM function return status codes*/
+/* Invalid Input Parameters */
+NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS = 2,
+/* Function-Specific Error */
+NVDIMM_DSM_ROOT_DEV_STATUS_FUNCTION_SPECIFIC_ERROR = 3,
+
+/* NVDIMM Device (non-root) _DSM function return status codes*/
+/* Non-Existing Memory Device */
+NVDIMM_DSM_DEV_STATUS_NON_EXISTING_MEM_DEV = 2,
+/* Invalid Input Parameters */
+NVDIMM_DSM_DEV_STATUS_INVALID_PARAS = 3,
+/* Vendor Specific Error */
+NVDIMM_DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4,
+};
+
 struct NvdimmDsmIn {
 uint32_t handle;
 uint32_t revision;
@@ -377,10 +416,125 @@ struct NvdimmDsmIn {
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
 
+static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
+{
+status = cpu_to_le32(status);
+build_append_int_noprefix(out, status, sizeof(status));
+}
+
+static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
+{
+uint32_t status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+
+/*
+ * Query command implemented per ACPI Specification, it is defined in
+ * ACPI 6.0: 9.14.1 _DSM (Device Specific Method).
+ */
+if (in->function == 0x0) {
+/*
+ * Set it to zero to indicate no function is supported for NVDIMM
+ * root.
+ */
+uint64_t cmd_list = cpu_to_le64(0);
+
+build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
+return;
+}
+
+nvdimm_debug("Return status %#x.\n", status);
+nvdimm_dsm_write_status(out, status);
+}
+
+static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
+{
+GSList *list = nvdimm_get_plugged_device_list();
+NVDIMMDevice *nvdimm = nvdimm_get_device_by_handle(list, in->handle);
+uint32_t status = NVDIMM_DSM_DEV_STATUS_NON_EXISTING_MEM_DEV;
+uint64_t cmd_list;
+
+if (!nvdimm) {
+goto set_status_free;
+}
+
+/* Encode DSM function according to DSM Spec Rev1. */
+switch (in->function) {
+/* see comments in nvdimm_dsm_root(). */
+case 0x0:
+cmd_list = cpu_to_le64(0x1 /* Bit 0 indicates whether there is
+  support for any functions other
+  than function 0.
+*/   |
+   1 << 4 /* Get Namespace Label Size */ |
+   1 << 5 /* Get Namespace Label Data */ |
+   1 << 6 /* Set Namespace Label Data */);
+build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
+goto free;
+default:
+status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+

[PATCH v6 11/33] hostmem-file: use whole file size if possible

Use the whole file size if @size is not specified which is useful
if we want to directly pass a file to guest

Signed-off-by: Xiao Guangrong 
---
 backends/hostmem-file.c | 48 
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 9097a57..e1bc9ff 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -9,6 +9,9 @@
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
  */
+#include 
+#include 
+
 #include "qemu-common.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/sysemu.h"
@@ -33,20 +36,57 @@ struct HostMemoryBackendFile {
 char *mem_path;
 };
 
+static uint64_t get_file_size(const char *file)
+{
+struct stat stat_buf;
+uint64_t size = 0;
+int fd;
+
+fd = open(file, O_RDONLY);
+if (fd < 0) {
+return 0;
+}
+
+if (stat(file, &stat_buf) < 0) {
+goto exit;
+}
+
+if ((S_ISBLK(stat_buf.st_mode)) && !ioctl(fd, BLKGETSIZE64, &size)) {
+goto exit;
+}
+
+size = lseek(fd, 0, SEEK_END);
+if (size == -1) {
+size = 0;
+}
+exit:
+close(fd);
+return size;
+}
+
 static void
 file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
 HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
 
-if (!backend->size) {
-error_setg(errp, "can't create backend with size 0");
-return;
-}
 if (!fb->mem_path) {
 error_setg(errp, "mem-path property not set");
 return;
 }
 
+if (!backend->size) {
+/*
+ * use the whole file size if @size is not specified.
+ */
+backend->size = get_file_size(fb->mem_path);
+}
+
+if (!backend->size) {
+error_setg(errp, "failed to get file size for %s, can't create "
+ "backend on it", mem_path);
+return;
+}
+
 backend->force_prealloc = mem_prealloc;
 memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
  object_get_canonical_path(OBJECT(backend)),
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 08/33] exec: allow memory to be allocated from any kind of path

Currently file_ram_alloc() is designed for hugetlbfs, however, the memory
of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file
locates at DAX enabled filesystem

So this patch let it work on any kind of path

Signed-off-by: Xiao Guangrong 
---
 exec.c | 56 +---
 1 file changed, 17 insertions(+), 39 deletions(-)

diff --git a/exec.c b/exec.c
index 8af2570..3ca7e50 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,32 +1174,6 @@ void qemu_mutex_unlock_ramlist(void)
 }
 
 #ifdef __linux__
-
-#include 
-
-#define HUGETLBFS_MAGIC   0x958458f6
-
-static long gethugepagesize(const char *path, Error **errp)
-{
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(path, &fs);
-} while (ret != 0 && errno == EINTR);
-
-if (ret != 0) {
-error_setg_errno(errp, errno, "failed to get page size of file %s",
- path);
-return 0;
-}
-
-if (fs.f_type != HUGETLBFS_MAGIC)
-fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
-
-return fs.f_bsize;
-}
-
 static void *file_ram_alloc(RAMBlock *block,
 ram_addr_t memory,
 const char *path,
@@ -1210,20 +1184,24 @@ static void *file_ram_alloc(RAMBlock *block,
 char *c;
 void *area;
 int fd;
-uint64_t hpagesize;
-Error *local_err = NULL;
+uint64_t pagesize;
 
-hpagesize = gethugepagesize(path, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
+pagesize = qemu_file_get_page_size(path);
+if (!pagesize) {
+error_setg(errp, "can't get page size for %s", path);
 goto error;
 }
-block->mr->align = hpagesize;
 
-if (memory < hpagesize) {
+if (pagesize == getpagesize()) {
+fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");
+}
+
+block->mr->align = pagesize;
+
+if (memory < pagesize) {
 error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
-   "or larger than huge page size 0x%" PRIx64,
-   memory, hpagesize);
+   "or larger than page size 0x%" PRIx64,
+   memory, pagesize);
 goto error;
 }
 
@@ -1247,14 +1225,14 @@ static void *file_ram_alloc(RAMBlock *block,
 fd = mkstemp(filename);
 if (fd < 0) {
 error_setg_errno(errp, errno,
- "unable to create backing store for hugepages");
+ "unable to create backing store for path %s", path);
 g_free(filename);
 goto error;
 }
 unlink(filename);
 g_free(filename);
 
-memory = ROUND_UP(memory, hpagesize);
+memory = ROUND_UP(memory, pagesize);
 
 /*
  * ftruncate is not supported by hugetlbfs in older
@@ -1266,10 +1244,10 @@ static void *file_ram_alloc(RAMBlock *block,
 perror("ftruncate");
 }
 
-area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & RAM_SHARED);
+area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
 if (area == MAP_FAILED) {
 error_setg_errno(errp, errno,
- "unable to map backing store for hugepages");
+ "unable to map backing store for path %s", path);
 close(fd);
 goto error;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 31/33] nvdimm: allow using whole backend memory as pmem

Introduce a parameter, named "reserve-label-data", if it is
false which indicates that QEMU does not reserve any region
on the backend memory to support label data. It is a
'label-less' NVDIMM device mode that linux will use whole
memory on the device as a single namesapce

This is useful for the users who want to pass whole nvdimm
device and make its data completely be visible to guest

The parameter is false on default

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c| 12 
 hw/mem/nvdimm.c | 43 ---
 include/hw/mem/nvdimm.h |  6 ++
 3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 5c8be41..f8d7d19 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -531,6 +531,12 @@ static void nvdimm_dsm_func_label_size(NVDIMMDevice 
*nvdimm, GArray *out)
 static uint32_t nvdimm_rw_label_data_check(NVDIMMDevice *nvdimm,
uint32_t offset, uint32_t length)
 {
+if (!nvdimm->reserve_label_data) {
+nvdimm_debug("read/write label request on the device without "
+ "label data reserved.\n");
+return NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+}
+
 if (offset + length < offset) {
 nvdimm_debug("offset %#x + length %#x is overflow.\n", offset,
  length);
@@ -637,6 +643,12 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
1 << 4 /* Get Namespace Label Size */ |
1 << 5 /* Get Namespace Label Data */ |
1 << 6 /* Set Namespace Label Data */);
+
+/* no function support if the device does not have label data. */
+if (!nvdimm->reserve_label_data) {
+cmd_list = cpu_to_le64(0);
+}
+
 build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
 goto free;
 case 0x4 /* Get Namespace Label Size */:
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 185aa1a..1d89165 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -36,14 +36,15 @@ static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
 {
 MemoryRegion *mr;
 NVDIMMDevice *nvdimm = NVDIMM(dimm);
-uint64_t size;
+uint64_t reserved_label_size, size;
 
 nvdimm->label_size = MIN_NAMESPACE_LABEL_SIZE;
+reserved_label_size = nvdimm->reserve_label_data ? nvdimm->label_size : 0;
 
 mr = host_memory_backend_get_memory(dimm->hostmem, errp);
 size = memory_region_size(mr);
 
-if (size <= nvdimm->label_size) {
+if (size <= reserved_label_size) {
 char *path = 
object_get_canonical_path_component(OBJECT(dimm->hostmem));
 error_setg(errp, "the size of memdev %s (0x%" PRIx64 ") is too small"
" to contain nvdimm namespace label (0x%" PRIx64 ")", path,
@@ -52,15 +53,19 @@ static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
 }
 
 memory_region_init_alias(&nvdimm->nvdimm_mr, OBJECT(dimm), "nvdimm-memory",
- mr, 0, size - nvdimm->label_size);
-nvdimm->label_data = memory_region_get_ram_ptr(mr) +
- memory_region_size(&nvdimm->nvdimm_mr);
+ mr, 0, size - reserved_label_size);
+
+if (reserved_label_size) {
+nvdimm->label_data = memory_region_get_ram_ptr(mr) +
+ memory_region_size(&nvdimm->nvdimm_mr);
+}
 }
 
 static void nvdimm_read_label_data(NVDIMMDevice *nvdimm, void *buf,
uint64_t size, uint64_t offset)
 {
-assert((nvdimm->label_size >= size + offset) && (offset + size > offset));
+assert(nvdimm->reserve_label_data &&
+   (nvdimm->label_size >= size + offset) && (offset + size > offset));
 
 memcpy(buf, nvdimm->label_data + offset, size);
 }
@@ -72,7 +77,8 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, 
const void *buf,
 DIMMDevice *dimm = DIMM(nvdimm);
 uint64_t backend_offset;
 
-assert((nvdimm->label_size >= size + offset) && (offset + size > offset));
+assert(nvdimm->reserve_label_data &&
+   (nvdimm->label_size >= size + offset) && (offset + size > offset));
 
 memcpy(nvdimm->label_data + offset, buf, size);
 
@@ -97,10 +103,33 @@ static void nvdimm_class_init(ObjectClass *oc, void *data)
 nvc->write_label_data = nvdimm_write_label_data;
 }
 
+static bool nvdimm_get_reserve_label_data(Object *obj, Error **errp)
+{
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+return nvdimm->reserve_label_data;
+}
+
+static void
+nvdimm_set_reserve_label_data(Object *obj, bool value, Error **errp)
+{
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+nvdimm->reserve_label_data = value;
+}
+
+static void nvdimm_init(Object *obj)
+{
+object_property_add_bool(obj, "reserve-label-data",
+ nvdimm_get_reserve_label_data,
+ nvdimm_set_reserve_labe

[PATCH v6 10/33] hostmem-file: clean up memory allocation

- hostmem-file.c is compiled only if CONFIG_LINUX is enabled so that is
  unnecessary to do the same check in the source file

- the interface, HostMemoryBackendClass->alloc(), is not called many
  times, do not need to check if the memory-region is initialized

Signed-off-by: Xiao Guangrong 
---
 backends/hostmem-file.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index e9b6d21..9097a57 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -46,17 +46,12 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 error_setg(errp, "mem-path property not set");
 return;
 }
-#ifndef CONFIG_LINUX
-error_setg(errp, "-mem-path not supported on this host");
-#else
-if (!memory_region_size(&backend->mr)) {
-backend->force_prealloc = mem_prealloc;
-memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
+
+backend->force_prealloc = mem_prealloc;
+memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
  object_get_canonical_path(OBJECT(backend)),
  backend->size, fb->share,
  fb->mem_path, errp);
-}
-#endif
 }
 
 static void
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 29/33] nvdimm acpi: support Get Namespace Label Data function

Function 5 is used to get Namespace Label Data

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 67c4699..8c27b25 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -428,6 +428,7 @@ struct NvdimmDsmIn {
 union {
 uint8_t arg3[0];
 NvdimmFuncInSetLabelData func_set_label_data;
+NvdimmFuncInGetLabelData func_get_label_data;
 };
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
@@ -527,6 +528,65 @@ static void nvdimm_dsm_func_label_size(NVDIMMDevice 
*nvdimm, GArray *out)
 g_array_append_vals(out, &func_label_size, sizeof(func_label_size));
 }
 
+static uint32_t nvdimm_rw_label_data_check(NVDIMMDevice *nvdimm,
+   uint32_t offset, uint32_t length)
+{
+if (offset + length < offset) {
+nvdimm_debug("offset %#x + length %#x is overflow.\n", offset,
+ length);
+return NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+}
+
+if (nvdimm->label_size < offset + length) {
+nvdimm_debug("position %#x is beyond label data (len = %#lx).\n",
+ offset + length, nvdimm->label_size);
+return NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+}
+
+if (length > nvdimm_get_max_xfer_label_size()) {
+nvdimm_debug("length (%#x) is larger than max_xfer (%#x).\n",
+ length, nvdimm_get_max_xfer_label_size());
+return NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+}
+
+return NVDIMM_DSM_STATUS_SUCCESS;
+}
+
+/*
+ * DSM Spec Rev1 4.5 Get Namespace Label Data (Function Index 5).
+ */
+static void nvdimm_dsm_func_get_label_data(NVDIMMDevice *nvdimm,
+   NvdimmDsmIn *in, GArray *out)
+{
+NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
+NvdimmFuncInGetLabelData *get_label_data = &in->func_get_label_data;
+void *buf;
+uint32_t status;
+
+le32_to_cpus(&get_label_data->offset);
+le32_to_cpus(&get_label_data->length);
+
+nvdimm_debug("Read Label Data: offset %#x length %#x.\n",
+ get_label_data->offset, get_label_data->length);
+
+status = nvdimm_rw_label_data_check(nvdimm, get_label_data->offset,
+get_label_data->length);
+if (status != NVDIMM_DSM_STATUS_SUCCESS) {
+goto exit;
+}
+
+/* write nvdimm_func_out_get_label_data.status. */
+nvdimm_dsm_write_status(out, status);
+/* write nvdimm_func_out_get_label_data.out_buf. */
+buf = acpi_data_push(out, get_label_data->length);
+nvc->read_label_data(nvdimm, buf, get_label_data->length,
+ get_label_data->offset);
+return;
+
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
 static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -554,6 +614,9 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 case 0x4 /* Get Namespace Label Size */:
 nvdimm_dsm_func_label_size(nvdimm, out);
 goto free;
+case 0x5 /* Get Namespace Label Data */:
+nvdimm_dsm_func_get_label_data(nvdimm, in, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 26/33] nvdimm acpi: save arg3 for NVDIMM device _DSM method

Check if the input Arg3 is valid then store it into dsm_in if needed

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 53ed675..e179a72 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -524,13 +524,38 @@ static void nvdimm_build_acpi_devices(GSList 
*device_list, Aml *sb_scope)
 
 method = aml_method_serialized("NCAL", 4);
 {
-Aml *buffer_size = aml_local(0);
+Aml *ifctx, *pckg, *buffer_size = aml_local(0);
 
 aml_append(method, aml_store(aml_arg(0), aml_name("HDLE")));
 aml_append(method, aml_store(aml_arg(1), aml_name("REVS")));
 aml_append(method, aml_store(aml_arg(2), aml_name("FUNC")));
 
 /*
+ * The fourth parameter (Arg3) of _DSM is a package which contains
+ * a buffer, the layout of the buffer is specified by UUID (Arg0),
+ * Revision ID (Arg1) and Function Index (Arg2) which are documented
+ * in the DSM Spec.
+ */
+pckg = aml_arg(3);
+ifctx = aml_if(aml_and(aml_equal(aml_object_type(pckg),
+ aml_int(4 /* Package */)),
+   aml_equal(aml_sizeof(pckg),
+ aml_int(1;
+{
+Aml *pckg_index, *pckg_buf;
+
+pckg_index = aml_local(2);
+pckg_buf = aml_local(3);
+
+aml_append(ifctx, aml_store(aml_index(pckg, aml_int(0)),
+pckg_index));
+aml_append(ifctx, aml_store(aml_derefof(pckg_index),
+pckg_buf));
+aml_append(ifctx, aml_store(pckg_buf, aml_name("ARG3")));
+}
+aml_append(method, ifctx);
+
+/*
  * transfer control to QEMU and the buffer size filled by
  * QEMU is returned.
  */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 32/33] nvdimm acpi: support _FIT method

FIT buffer is not completely mapped into guest address space, so a new
function, Read FIT, function index 0x, is reserved by QEMU to
read the piece of FIT buffer. The buffer is concatenated before _FIT
return

Refer to docs/specs/acpi-nvdimm.txt for detailed design

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 168 +--
 1 file changed, 164 insertions(+), 4 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index f8d7d19..3f35220 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -384,6 +384,18 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+/*
+ * define UUID for NVDIMM Root Device according to Chapter 3 DSM Interface
+ * for NVDIMM Root Device - Example in DSM Spec Rev1.
+ */
+#define NVDIMM_DSM_ROOT_UUID "2F10E7A4-9E91-11E4-89D3-123B93F75CBA"
+
+/*
+ * Read FIT Function, which is a QEMU internal use only function, more detail
+ * refer to docs/specs/acpi_nvdimm.txt
+ */
+#define NVDIMM_DSM_FUNC_READ_FIT 0x
+
 /* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
 enum {
 /* Common return status codes. */
@@ -420,6 +432,11 @@ struct NvdimmFuncInSetLabelData {
 } QEMU_PACKED;
 typedef struct NvdimmFuncInSetLabelData NvdimmFuncInSetLabelData;
 
+struct NvdimmFuncInReadFit {
+uint32_t offset; /* fit offset */
+} QEMU_PACKED;
+typedef struct NvdimmFuncInReadFit NvdimmFuncInReadFit;
+
 struct NvdimmDsmIn {
 uint32_t handle;
 uint32_t revision;
@@ -429,6 +446,7 @@ struct NvdimmDsmIn {
 uint8_t arg3[0];
 NvdimmFuncInSetLabelData func_set_label_data;
 NvdimmFuncInGetLabelData func_get_label_data;
+NvdimmFuncInReadFit func_read_fit;
 };
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
@@ -450,13 +468,71 @@ struct NvdimmFuncOutGetLabelData {
 } QEMU_PACKED;
 typedef struct NvdimmFuncOutGetLabelData NvdimmFuncOutGetLabelData;
 
+struct NvdimmFuncOutReadFit {
+uint32_t status;/* return status code. */
+uint32_t length;/* the length of fit data we read. */
+uint8_t fit_data[0]; /* fit data. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncOutReadFit NvdimmFuncOutReadFit;
+
 static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
 {
 status = cpu_to_le32(status);
 build_append_int_noprefix(out, status, sizeof(status));
 }
 
-static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
+/* Build fit memory which is presented to guest via _FIT method. */
+static void nvdimm_build_fit(AcpiNVDIMMState *state)
+{
+if (!state->fit) {
+GSList *device_list = nvdimm_get_plugged_device_list();
+
+nvdimm_debug("Rebuild FIT...\n");
+state->fit = nvdimm_build_device_structure(device_list);
+g_slist_free(device_list);
+}
+}
+
+/* Read FIT data, defined in docs/specs/acpi_nvdimm.txt. */
+static void nvdimm_dsm_func_read_fit(AcpiNVDIMMState *state,
+ NvdimmDsmIn *in, GArray *out)
+{
+NvdimmFuncInReadFit *read_fit = &in->func_read_fit;
+NvdimmFuncOutReadFit fit_out;
+uint32_t read_length = TARGET_PAGE_SIZE - sizeof(NvdimmFuncOutReadFit);
+uint32_t status = NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS;
+
+nvdimm_build_fit(state);
+
+le32_to_cpus(&read_fit->offset);
+
+nvdimm_debug("Read FIT offset %#x.\n", read_fit->offset);
+
+if (read_fit->offset > state->fit->len) {
+nvdimm_debug("offset %#x is beyond fit size (%#x).\n",
+ read_fit->offset, state->fit->len);
+goto exit;
+}
+
+read_length = MIN(read_length, state->fit->len - read_fit->offset);
+nvdimm_debug("read length %#x.\n", read_length);
+
+fit_out.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
+fit_out.length = cpu_to_le32(read_length);
+g_array_append_vals(out, &fit_out, sizeof(fit_out));
+
+if (read_length) {
+g_array_append_vals(out, state->fit->data + read_fit->offset,
+read_length);
+}
+return;
+
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
+static void nvdimm_dsm_root(AcpiNVDIMMState *state, NvdimmDsmIn *in,
+GArray *out)
 {
 uint32_t status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 
@@ -475,6 +551,10 @@ static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
 return;
 }
 
+if (in->function == NVDIMM_DSM_FUNC_READ_FIT /* FIT Read */) {
+return nvdimm_dsm_func_read_fit(state, in, out);
+}
+
 nvdimm_debug("Return status %#x.\n", status);
 nvdimm_dsm_write_status(out, status);
 }
@@ -710,7 +790,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 
 /* Handle 0 is reserved for NVDIMM Root Device. */
 if (!in->handle) {
-nvdimm_dsm_root(in, out);
+nvdimm_dsm_root(state, in, out);
 goto exit;
 }
 
@@ -925,8 +1005,88 @@ static void nvdimm_build_acpi_devices(GSList 
*device_list, A

[PATCH v6 07/33] util: introduce qemu_file_get_page_size()

There are three places use the some logic to get the page size on
the file path or file fd

This patch introduces qemu_file_get_page_size() to unify the code

Signed-off-by: Xiao Guangrong 
---
 include/qemu/osdep.h |  1 +
 target-ppc/kvm.c | 21 +++--
 util/oslib-posix.c   | 16 
 util/oslib-win32.c   |  5 +
 4 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b568424..d4dde02 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -302,4 +302,5 @@ int qemu_read_password(char *buf, int buf_size);
  */
 pid_t qemu_fork(Error **errp);
 
+size_t qemu_file_get_page_size(const char *mem_path);
 #endif
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index ac70f08..c661f1c 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -308,28 +308,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
kvm_ppc_smmu_info *info)
 
 static long gethugepagesize(const char *mem_path)
 {
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(mem_path, &fs);
-} while (ret != 0 && errno == EINTR);
+long size = qemu_file_get_page_size(mem_path);
 
-if (ret != 0) {
-fprintf(stderr, "Couldn't statfs() memory path: %s\n",
-strerror(errno));
+if (!size) {
 exit(1);
 }
 
-#define HUGETLBFS_MAGIC   0x958458f6
-
-if (fs.f_type != HUGETLBFS_MAGIC) {
-/* Explicit mempath, but it's ordinary pages */
-return getpagesize();
-}
-
-/* It's hugepage, return the huge page size */
-return fs.f_bsize;
+return size;
 }
 
 static int find_max_supported_pagesize(Object *obj, void *opaque)
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 914cef5..ad94c5a 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd)
 return getpagesize();
 }
 
+size_t qemu_file_get_page_size(const char *path)
+{
+size_t size = 0;
+int fd = qemu_open(path, O_RDONLY);
+
+if (fd < 0) {
+fprintf(stderr, "Could not open %s.\n", path);
+goto exit;
+}
+
+size = fd_getpagesize(fd);
+qemu_close(fd);
+exit:
+return size;
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory)
 {
 int ret;
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 09f9e98..a18aa87 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -462,6 +462,11 @@ size_t getpagesize(void)
 return system_info.dwPageSize;
 }
 
+size_t qemu_file_get_page_size(const char *path)
+{
+return getpagesize();
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory)
 {
 int i;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 09/33] exec: allow file_ram_alloc to work on file

Currently, file_ram_alloc() only works on directory - it creates a file
under @path and do mmap on it

This patch tries to allow it to work on file directly, if @path is a
directory it works as before, otherwise it treats @path as the target
file then directly allocate memory from it

Signed-off-by: Xiao Guangrong 
---
 exec.c | 80 ++
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/exec.c b/exec.c
index 3ca7e50..f219010 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
 }
 
 #ifdef __linux__
+static bool path_is_dir(const char *path)
+{
+struct stat fs;
+
+return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);
+}
+
+static int open_file_path(RAMBlock *block, const char *path, size_t size)
+{
+char *filename;
+char *sanitized_name;
+char *c;
+int fd;
+
+if (!path_is_dir(path)) {
+int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
+
+flags |= O_EXCL;
+return open(path, flags);
+}
+
+/* Make name safe to use with mkstemp by replacing '/' with '_'. */
+sanitized_name = g_strdup(memory_region_name(block->mr));
+for (c = sanitized_name; *c != '\0'; c++) {
+if (*c == '/') {
+*c = '_';
+}
+}
+filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
+   sanitized_name);
+g_free(sanitized_name);
+fd = mkstemp(filename);
+if (fd >= 0) {
+unlink(filename);
+/*
+ * ftruncate is not supported by hugetlbfs in older
+ * hosts, so don't bother bailing out on errors.
+ * If anything goes wrong with it under other filesystems,
+ * mmap will fail.
+ */
+if (ftruncate(fd, size)) {
+perror("ftruncate");
+}
+}
+g_free(filename);
+
+return fd;
+}
+
 static void *file_ram_alloc(RAMBlock *block,
 ram_addr_t memory,
 const char *path,
 Error **errp)
 {
-char *filename;
-char *sanitized_name;
-char *c;
 void *area;
 int fd;
 uint64_t pagesize;
@@ -1211,38 +1257,14 @@ static void *file_ram_alloc(RAMBlock *block,
 goto error;
 }
 
-/* Make name safe to use with mkstemp by replacing '/' with '_'. */
-sanitized_name = g_strdup(memory_region_name(block->mr));
-for (c = sanitized_name; *c != '\0'; c++) {
-if (*c == '/')
-*c = '_';
-}
-
-filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
-   sanitized_name);
-g_free(sanitized_name);
+memory = ROUND_UP(memory, pagesize);
 
-fd = mkstemp(filename);
+fd = open_file_path(block, path, memory);
 if (fd < 0) {
 error_setg_errno(errp, errno,
  "unable to create backing store for path %s", path);
-g_free(filename);
 goto error;
 }
-unlink(filename);
-g_free(filename);
-
-memory = ROUND_UP(memory, pagesize);
-
-/*
- * ftruncate is not supported by hugetlbfs in older
- * hosts, so don't bother bailing out on errors.
- * If anything goes wrong with it under other filesystems,
- * mmap will fail.
- */
-if (ftruncate(fd, memory)) {
-perror("ftruncate");
-}
 
 area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
 if (area == MAP_FAILED) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 24/33] nvdimm acpi: build ACPI NFIT table

NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)

Currently, we only support PMEM mode. Each device has 3 structures:
- SPA structure, defines the PMEM region info

- MEM DEV structure, it has the @handle which is used to associate specified
  ACPI NVDIMM  device we will introduce in later patch.
  Also we can happily ignored the memory device's interleave, the real
  nvdimm hardware access is hidden behind host

- DCR structure, it defines vendor ID used to associate specified vendor
  nvdimm driver. Since we only implement PMEM mode this time, Command
  window and Data window are not needed

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c| 355 
 hw/i386/acpi-build.c|   6 +
 include/hw/mem/nvdimm.h |  10 ++
 3 files changed, 371 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 1223da2..dd84e5f 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -26,8 +26,348 @@
  * License along with this library; if not, see 
  */
 
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
 #include "hw/mem/nvdimm.h"
 
+static int nvdimm_plugged_device_list(Object *obj, void *opaque)
+{
+GSList **list = opaque;
+
+if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+if (memory_region_is_mapped(&nvdimm->nvdimm_mr)) {
+*list = g_slist_append(*list, DEVICE(obj));
+}
+}
+
+object_child_foreach(obj, nvdimm_plugged_device_list, opaque);
+return 0;
+}
+
+/*
+ * inquire plugged NVDIMM devices and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+static GSList *nvdimm_get_plugged_device_list(void)
+{
+GSList *list = NULL;
+
+object_child_foreach(qdev_get_machine(), nvdimm_plugged_device_list,
+ &list);
+return list;
+}
+
+#define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7) \
+   { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
+ (b) & 0xff, ((b) >> 8) & 0xff, (c) & 0xff, ((c) >> 8) & 0xff,  \
+ (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }
+/*
+ * define Byte Addressable Persistent Memory (PM) Region according to
+ * ACPI 6.0: 5.2.25.1 System Physical Address Range Structure.
+ */
+static const uint8_t nvdimm_nfit_spa_uuid[] =
+  NVDIMM_UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d, 0x33,
+ 0x18, 0xb7, 0x8c, 0xdb);
+
+/*
+ * NVDIMM Firmware Interface Table
+ * @signature: "NFIT"
+ *
+ * It provides information that allows OSPM to enumerate NVDIMM present in
+ * the platform and associate system physical address ranges created by the
+ * NVDIMMs.
+ *
+ * It is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ */
+struct NvdimmNfitHeader {
+ACPI_TABLE_HEADER_DEF
+uint32_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitHeader NvdimmNfitHeader;
+
+/*
+ * define NFIT structures according to ACPI 6.0: 5.2.25 NVDIMM Firmware
+ * Interface Table (NFIT).
+ */
+
+/*
+ * System Physical Address Range Structure
+ *
+ * It describes the system physical address ranges occupied by NVDIMMs and
+ * the types of the regions.
+ */
+struct NvdimmNfitSpa {
+uint16_t type;
+uint16_t length;
+uint16_t spa_index;
+uint16_t flags;
+uint32_t reserved;
+uint32_t proximity_domain;
+uint8_t type_guid[16];
+uint64_t spa_base;
+uint64_t spa_length;
+uint64_t mem_attr;
+} QEMU_PACKED;
+typedef struct NvdimmNfitSpa NvdimmNfitSpa;
+
+/*
+ * Memory Device to System Physical Address Range Mapping Structure
+ *
+ * It enables identifying each NVDIMM region and the corresponding SPA
+ * describing the memory interleave
+ */
+struct NvdimmNfitMemDev {
+uint16_t type;
+uint16_t length;
+uint32_t nfit_handle;
+uint16_t phys_id;
+uint16_t region_id;
+uint16_t spa_index;
+uint16_t dcr_index;
+uint64_t region_len;
+uint64_t region_offset;
+uint64_t region_dpa;
+uint16_t interleave_index;
+uint16_t interleave_ways;
+uint16_t flags;
+uint16_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitMemDev NvdimmNfitMemDev;
+
+/*
+ * NVDIMM Control Region Structure
+ *
+ * It describes the NVDIMM and if applicable, Block Control Window.
+ */
+struct NvdimmNfitControlRegion {
+uint16_t type;
+uint16_t length;
+uint16_t dcr_index;
+uint16_t vendor_id;
+uint16_t device_id;
+uint16_t revision_id;
+uint16_t sub_vendor_id;
+uint16_t sub_device_id;
+uint16_t sub_revision_id;
+uint8_t reserved[6];
+uint32_t serial_number;
+uint16_t fic;
+uint16_t num_bcw;
+uint64_t bcw_size;
+uint64_t cmd_offset;
+uint64_t cmd_size;
+uint64_t status_offset;
+uint64_t status_size;
+uint16_t flags;
+uint8_t reserved2[6];
+} QEMU_PACK

[PATCH v6 15/33] stubs: rename qmp_pc_dimm_device_list.c

Rename qmp_pc_dimm_device_list.c to qmp_dimm_device_list.c

Signed-off-by: Xiao Guangrong 
---
 stubs/Makefile.objs | 2 +-
 stubs/{qmp_pc_dimm_device_list.c => qmp_dimm_device_list.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename stubs/{qmp_pc_dimm_device_list.c => qmp_dimm_device_list.c} (100%)

diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 251443b..d5c862a 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -32,6 +32,6 @@ stub-obj-y += vmstate.o
 stub-obj-$(CONFIG_WIN32) += fd-register.o
 stub-obj-y += cpus.o
 stub-obj-y += kvm.o
-stub-obj-y += qmp_pc_dimm_device_list.o
+stub-obj-y += qmp_dimm_device_list.o
 stub-obj-y += target-monitor-defs.o
 stub-obj-y += vhost.o
diff --git a/stubs/qmp_pc_dimm_device_list.c b/stubs/qmp_dimm_device_list.c
similarity index 100%
rename from stubs/qmp_pc_dimm_device_list.c
rename to stubs/qmp_dimm_device_list.c
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 33/33] nvdimm: add maintain info

Add NVDIMM maintainer

Signed-off-by: Xiao Guangrong 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3144113..865c0cf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -907,6 +907,13 @@ M: Jiri Pirko 
 S: Maintained
 F: hw/net/rocker/
 
+NVDIMM
+M: Xiao Guangrong 
+S: Maintained
+F: hw/acpi/nvdimm.c
+F: hw/mem/nvdimm.c
+F: include/hw/mem/nvdimm.h
+
 Subsystems
 --
 Audio
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 04/33] acpi: add aml_concatenate

Implement Concatenate term which is used by NVDIMM _DSM method
in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 14 ++
 include/hw/acpi/aml-build.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 9fe5e7b..efc06ab 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1164,6 +1164,20 @@ Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, 
const char *name)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefConcat */
+Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target)
+{
+Aml *var = aml_opcode(0x73 /* ConcatOp */);
+aml_append(var, source1);
+aml_append(var, source2);
+
+if (target) {
+aml_append(var, target);
+}
+
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 7e1c43b..325782d 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -277,6 +277,7 @@ Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
 Aml *aml_sizeof(Aml *arg);
 Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
+Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 20/33] dimm: introduce realize callback

nvdimm need check if the backend memory is large enough to contain label
data and init its memory region when the device is realized, so introduce
realize callback which is called after common dimm has been realize

Signed-off-by: Xiao Guangrong 
---
 hw/mem/dimm.c | 5 +
 include/hw/mem/dimm.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 7d1..0ae23ce 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -426,6 +426,7 @@ static void dimm_init(Object *obj)
 static void dimm_realize(DeviceState *dev, Error **errp)
 {
 DIMMDevice *dimm = DIMM(dev);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(dimm);
 
 if (!dimm->hostmem) {
 error_setg(errp, "'" DIMM_MEMDEV_PROP "' property is not set");
@@ -438,6 +439,10 @@ static void dimm_realize(DeviceState *dev, Error **errp)
dimm->node, nb_numa_nodes ? nb_numa_nodes : 1);
 return;
 }
+
+if (ddc->realize) {
+ddc->realize(dimm, errp);
+}
 }
 
 static void dimm_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/mem/dimm.h b/include/hw/mem/dimm.h
index 50f768a..72ec24c 100644
--- a/include/hw/mem/dimm.h
+++ b/include/hw/mem/dimm.h
@@ -65,6 +65,7 @@ typedef struct DIMMDeviceClass {
 DeviceClass parent_class;
 
 /* public */
+void (*realize)(DIMMDevice *dimm, Error **errp);
 MemoryRegion *(*get_memory_region)(DIMMDevice *dimm);
 } DIMMDeviceClass;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 06/33] acpi: add aml_method_serialized

It avoid explicit Mutex and will be used by NVDIMM ACPI

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 26 --
 include/hw/acpi/aml-build.h |  1 +
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 9f792ab..8bee8b2 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate)
 }
 
 /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */
-Aml *aml_method(const char *name, int arg_count)
+static Aml *__aml_method(const char *name, int arg_count, bool serialized)
 {
 Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE);
+int methodflags;
+
+/*
+ * MethodFlags:
+ *   bit 0-2: ArgCount (0-7)
+ *   bit 3: SerializeFlag
+ * 0: NotSerialized
+ * 1: Serialized
+ *   bit 4-7: reserved (must be 0)
+ */
+assert(!(arg_count & ~7));
+methodflags = arg_count | (serialized << 3);
 build_append_namestring(var->buf, "%s", name);
-build_append_byte(var->buf, arg_count); /* MethodFlags: ArgCount */
+build_append_byte(var->buf, methodflags);
 return var;
 }
 
+Aml *aml_method(const char *name, int arg_count)
+{
+return __aml_method(name, arg_count, false);
+}
+
+Aml *aml_method_serialized(const char *name, int arg_count)
+{
+return __aml_method(name, arg_count, true);
+}
+
 /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefDevice */
 Aml *aml_device(const char *name_format, ...)
 {
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 5b8a118..00cf40e 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -263,6 +263,7 @@ Aml *aml_qword_memory(AmlDecode dec, AmlMinFixed min_fixed,
 Aml *aml_scope(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
 Aml *aml_device(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
 Aml *aml_method(const char *name, int arg_count);
+Aml *aml_method_serialized(const char *name, int arg_count);
 Aml *aml_if(Aml *predicate);
 Aml *aml_else(void);
 Aml *aml_while(Aml *predicate);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 13/33] pc-dimm: make pc_existing_dimms_capacity static and rename it

pc_existing_dimms_capacity() can be static since it is not used out of
pc-dimm.c and drop the pc_ prefix to prepare the work which abstracts
dimm device type from pc-dimm

Signed-off-by: Xiao Guangrong 
---
 hw/mem/pc-dimm.c | 73 
 include/hw/mem/pc-dimm.h |  1 -
 2 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 80f424b..2dcbbcd 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -32,6 +32,38 @@ typedef struct pc_dimms_capacity {
  Error**errp;
 } pc_dimms_capacity;
 
+static int existing_dimms_capacity_internal(Object *obj, void *opaque)
+{
+pc_dimms_capacity *cap = opaque;
+uint64_t *size = &cap->size;
+
+if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
+DeviceState *dev = DEVICE(obj);
+
+if (dev->realized) {
+(*size) += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
+cap->errp);
+}
+
+if (cap->errp && *cap->errp) {
+return 1;
+}
+}
+object_child_foreach(obj, existing_dimms_capacity_internal, opaque);
+return 0;
+}
+
+static uint64_t existing_dimms_capacity(Error **errp)
+{
+pc_dimms_capacity cap;
+
+cap.size = 0;
+cap.errp = errp;
+
+existing_dimms_capacity_internal(qdev_get_machine(), &cap);
+return cap.size;
+}
+
 void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
  MemoryRegion *mr, uint64_t align, Error **errp)
 {
@@ -39,7 +71,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState 
*hpms,
 MachineState *machine = MACHINE(qdev_get_machine());
 PCDIMMDevice *dimm = PC_DIMM(dev);
 Error *local_err = NULL;
-uint64_t existing_dimms_capacity = 0;
+uint64_t dimms_capacity = 0;
 uint64_t addr;
 
 addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, 
&local_err);
@@ -55,17 +87,16 @@ void pc_dimm_memory_plug(DeviceState *dev, 
MemoryHotplugState *hpms,
 goto out;
 }
 
-existing_dimms_capacity = pc_existing_dimms_capacity(&local_err);
+dimms_capacity = existing_dimms_capacity(&local_err);
 if (local_err) {
 goto out;
 }
 
-if (existing_dimms_capacity + memory_region_size(mr) >
+if (dimms_capacity + memory_region_size(mr) >
 machine->maxram_size - machine->ram_size) {
 error_setg(&local_err, "not enough space, currently 0x%" PRIx64
" in use of total hot pluggable 0x" RAM_ADDR_FMT,
-   existing_dimms_capacity,
-   machine->maxram_size - machine->ram_size);
+   dimms_capacity, machine->maxram_size - machine->ram_size);
 goto out;
 }
 
@@ -120,38 +151,6 @@ void pc_dimm_memory_unplug(DeviceState *dev, 
MemoryHotplugState *hpms,
 vmstate_unregister_ram(mr, dev);
 }
 
-static int pc_existing_dimms_capacity_internal(Object *obj, void *opaque)
-{
-pc_dimms_capacity *cap = opaque;
-uint64_t *size = &cap->size;
-
-if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-DeviceState *dev = DEVICE(obj);
-
-if (dev->realized) {
-(*size) += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
-cap->errp);
-}
-
-if (cap->errp && *cap->errp) {
-return 1;
-}
-}
-object_child_foreach(obj, pc_existing_dimms_capacity_internal, opaque);
-return 0;
-}
-
-uint64_t pc_existing_dimms_capacity(Error **errp)
-{
-pc_dimms_capacity cap;
-
-cap.size = 0;
-cap.errp = errp;
-
-pc_existing_dimms_capacity_internal(qdev_get_machine(), &cap);
-return cap.size;
-}
-
 int qmp_pc_dimm_device_list(Object *obj, void *opaque)
 {
 MemoryDeviceInfoList ***prev = opaque;
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index 11a8937..8a43548 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -87,7 +87,6 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
 int pc_dimm_get_free_slot(const int *hint, int max_slots, Error **errp);
 
 int qmp_pc_dimm_device_list(Object *obj, void *opaque);
-uint64_t pc_existing_dimms_capacity(Error **errp);
 void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
  MemoryRegion *mr, uint64_t align, Error **errp);
 void pc_dimm_memory_unplug(DeviceState *dev, MemoryHotplugState *hpms,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 21/33] nvdimm: implement NVDIMM device abstract

Introduce "nvdimm" device which is based on dimm device type

128K memory region which is the minimum namespace label size
required by NVDIMM Namespace Spec locates at the end of
backend memory device is reserved for label data

We can use "-m 1G,maxmem=100G,slots=10 -object memory-backend-file,
id=mem1,size=1G,mem-path=/dev/pmem0 -device nvdimm,memdev=mem1" to
create NVDIMM device for guest

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak   |   1 +
 default-configs/x86_64-softmmu.mak |   1 +
 hw/acpi/memory_hotplug.c   |   6 ++
 hw/mem/Makefile.objs   |   1 +
 hw/mem/nvdimm.c| 113 +
 include/hw/mem/nvdimm.h|  83 +++
 6 files changed, 205 insertions(+)
 create mode 100644 hw/mem/nvdimm.c
 create mode 100644 include/hw/mem/nvdimm.h

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 3ece8bb..4e84a1c 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index 92ea7c1..e877a86 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index 20d3093..bb5a29f 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -1,6 +1,7 @@
 #include "hw/acpi/memory_hotplug.h"
 #include "hw/acpi/pc-hotplug.h"
 #include "hw/mem/dimm.h"
+#include "hw/mem/nvdimm.h"
 #include "hw/boards.h"
 #include "hw/qdev-core.h"
 #include "trace.h"
@@ -231,6 +232,11 @@ void acpi_memory_plug_cb(ACPIREGS *ar, qemu_irq irq, 
MemHotplugState *mem_st,
 {
 MemStatus *mdev;
 
+/* Currently, NVDIMM hotplug has not been supported yet. */
+if (object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM)) {
+return;
+}
+
 mdev = acpi_memory_slot_status(mem_st, dev, errp);
 if (!mdev) {
 return;
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index cebb4b1..12d9b72 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1,2 +1,3 @@
 common-obj-$(CONFIG_DIMM) += dimm.o
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm.o
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
new file mode 100644
index 000..185aa1a
--- /dev/null
+++ b/hw/mem/nvdimm.c
@@ -0,0 +1,113 @@
+/*
+ * Non-Volatile Dual In-line Memory Module Virtualization Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong 
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "qapi/visitor.h"
+#include "hw/mem/nvdimm.h"
+
+static MemoryRegion *nvdimm_get_memory_region(DIMMDevice *dimm)
+{
+NVDIMMDevice *nvdimm = NVDIMM(dimm);
+
+return memory_region_size(&nvdimm->nvdimm_mr) ? &nvdimm->nvdimm_mr : NULL;
+}
+
+static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
+{
+MemoryRegion *mr;
+NVDIMMDevice *nvdimm = NVDIMM(dimm);
+uint64_t size;
+
+nvdimm->label_size = MIN_NAMESPACE_LABEL_SIZE;
+
+mr = host_memory_backend_get_memory(dimm->hostmem, errp);
+size = memory_region_size(mr);
+
+if (size <= nvdimm->label_size) {
+char *path = 
object_get_canonical_path_component(OBJECT(dimm->hostmem));
+error_setg(errp, "the size of memdev %s (0x%" PRIx64 ") is too small"
+   " to contain nvdimm namespace label (0x%" PRIx64 ")", path,
+   memory_region_size(mr), nvdimm->label_size);
+return;
+}
+
+memory_region_init_alias(&nvdimm->nvdimm_mr, OBJECT(dimm), "nvdimm-memory",
+ mr, 0, size - nvdimm->label_size);
+nvdimm->label_data = memory_region_get_ram_ptr(mr) +
+ memory_region_size(&nvdimm->nvdimm_mr);
+}
+
+static void nvdimm_read_label_data(NVDIMMDevice *nvdimm, void *buf,
+   uint64_t size, uint64_t offset)
+{

[PATCH v6 03/33] acpi: add aml_create_field

Implement CreateField term which is used by NVDIMM _DSM method in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 13 +
 include/hw/acpi/aml-build.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index a72214d..9fe5e7b 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1151,6 +1151,19 @@ Aml *aml_sizeof(Aml *arg)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefCreateField */
+Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name)
+{
+Aml *var = aml_alloc();
+build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
+build_append_byte(var->buf, 0x13); /* CreateFieldOp */
+aml_append(var, srcbuf);
+aml_append(var, index);
+aml_append(var, len);
+build_append_namestring(var->buf, "%s", name);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 7296efb..7e1c43b 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -276,6 +276,7 @@ Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
 Aml *aml_sizeof(Aml *arg);
+Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 14/33] pc-dimm: drop the prefix of pc-dimm

This patch is generated by this script:

find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
| xargs sed -i "s/PC_DIMM/DIMM/g"

find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
| xargs sed -i "s/PCDIMM/DIMM/g"

find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
| xargs sed -i "s/pc_dimm/dimm/g"

find ./ -name "trace-events" -type f | xargs sed -i "s/pc-dimm/dimm/g"

It prepares the work which abstracts dimm device type for both pc-dimm and
nvdimm

Signed-off-by: Xiao Guangrong 
---
 hmp.c   |   2 +-
 hw/acpi/ich9.c  |   6 +-
 hw/acpi/memory_hotplug.c|  16 ++---
 hw/acpi/piix4.c |   6 +-
 hw/i386/pc.c|  32 -
 hw/mem/pc-dimm.c| 148 
 hw/ppc/spapr.c  |  18 ++---
 include/hw/mem/pc-dimm.h|  62 -
 numa.c  |   2 +-
 qapi-schema.json|   8 +--
 qmp.c   |   2 +-
 stubs/qmp_pc_dimm_device_list.c |   2 +-
 trace-events|   8 +--
 13 files changed, 156 insertions(+), 156 deletions(-)

diff --git a/hmp.c b/hmp.c
index 5048eee..5c617d2 100644
--- a/hmp.c
+++ b/hmp.c
@@ -1952,7 +1952,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 MemoryDeviceInfoList *info_list = qmp_query_memory_devices(&err);
 MemoryDeviceInfoList *info;
 MemoryDeviceInfo *value;
-PCDIMMDeviceInfo *di;
+DIMMDeviceInfo *di;
 
 for (info = info_list; info; info = info->next) {
 value = info->value;
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index 1c7fcfa..b0d6a67 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -440,7 +440,7 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs *pm, 
Error **errp)
 void ich9_pm_device_plug_cb(ICH9LPCPMRegs *pm, DeviceState *dev, Error **errp)
 {
 if (pm->acpi_memory_hotplug.is_enabled &&
-object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
 acpi_memory_plug_cb(&pm->acpi_regs, pm->irq, &pm->acpi_memory_hotplug,
 dev, errp);
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
@@ -455,7 +455,7 @@ void ich9_pm_device_unplug_request_cb(ICH9LPCPMRegs *pm, 
DeviceState *dev,
   Error **errp)
 {
 if (pm->acpi_memory_hotplug.is_enabled &&
-object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
 acpi_memory_unplug_request_cb(&pm->acpi_regs, pm->irq,
   &pm->acpi_memory_hotplug, dev, errp);
 } else {
@@ -468,7 +468,7 @@ void ich9_pm_device_unplug_cb(ICH9LPCPMRegs *pm, 
DeviceState *dev,
   Error **errp)
 {
 if (pm->acpi_memory_hotplug.is_enabled &&
-object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
 acpi_memory_unplug_cb(&pm->acpi_memory_hotplug, dev, errp);
 } else {
 error_setg(errp, "acpi: device unplug for not supported device"
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index ce428df..e687852 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -54,23 +54,23 @@ static uint64_t acpi_memory_hotplug_read(void *opaque, 
hwaddr addr,
 o = OBJECT(mdev->dimm);
 switch (addr) {
 case 0x0: /* Lo part of phys address where DIMM is mapped */
-val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) : 0;
+val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) : 0;
 trace_mhp_acpi_read_addr_lo(mem_st->selector, val);
 break;
 case 0x4: /* Hi part of phys address where DIMM is mapped */
-val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) >> 32 : 
0;
+val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) >> 32 : 0;
 trace_mhp_acpi_read_addr_hi(mem_st->selector, val);
 break;
 case 0x8: /* Lo part of DIMM size */
-val = o ? object_property_get_int(o, PC_DIMM_SIZE_PROP, NULL) : 0;
+val = o ? object_property_get_int(o, DIMM_SIZE_PROP, NULL) : 0;
 trace_mhp_acpi_read_size_lo(mem_st->selector, val);
 break;
 case 0xc: /* Hi part of DIMM size */
-val = o ? object_property_get_int(o, PC_DIMM_SIZE_PROP, NULL) >> 32 : 
0;
+val = o ? object_property_get_int(o, DIMM_SIZE_PROP, NULL) >> 32 : 0;
 trace_mhp_acpi_read_size_hi(mem_st->selector, val);
 break;
 case 0x10: /* node proximity for _PXM method */
-val = o ? object_property_get_int(o, PC_DIMM_NODE_PROP, NULL) : 0;
+val = o ? object_property_get_int(o, DIMM_NODE_PROP, NULL) : 0;
 trace_mhp_acpi_read_pxm(mem_st->selector, val);
 break;
 case 0x14: /* pack and return is_* fields */

[PATCH v6 22/33] docs: add NVDIMM ACPI documentation

It describes the basic concepts of NVDIMM ACPI and the interface
between QEMU and the ACPI BIOS

Signed-off-by: Xiao Guangrong 
---
 docs/specs/acpi_nvdimm.txt | 179 +
 1 file changed, 179 insertions(+)
 create mode 100644 docs/specs/acpi_nvdimm.txt

diff --git a/docs/specs/acpi_nvdimm.txt b/docs/specs/acpi_nvdimm.txt
new file mode 100644
index 000..cc5db2c
--- /dev/null
+++ b/docs/specs/acpi_nvdimm.txt
@@ -0,0 +1,179 @@
+QEMU<->ACPI BIOS NVDIMM interface
+-
+
+QEMU supports NVDIMM via ACPI. This document describes the basic concepts of
+NVDIMM ACPI and the interface between QEMU and the ACPI BIOS.
+
+NVDIMM ACPI Background
+--
+NVDIMM is introduced in ACPI 6.0 which defines an NVDIMM root device under
+_SB scope with a _HID of “ACPI0012”. For each NVDIMM present or intended
+to be supported by platform, platform firmware also exposes an ACPI
+Namespace Device under the root device.
+
+The NVDIMM child devices under the NVDIMM root device are defined with _ADR
+corresponding to the NFIT device handle. The NVDIMM root device and the
+NVDIMM devices can have device specific methods (_DSM) to provide additional
+functions specific to a particular NVDIMM implementation.
+
+This is an example from ACPI 6.0, a platform contains one NVDIMM:
+
+Scope (\_SB){
+   Device (NVDR) // Root device
+   {
+  Name (_HID, “ACPI0012”)
+  Method (_STA) {...}
+  Method (_FIT) {...}
+  Method (_DSM, ...) {...}
+  Device (NVD)
+  {
+ Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
+ Method (_DSM, ...) {...}
+  }
+   }
+}
+
+Methods supported on both NVDIMM root device and NVDIMM device are
+1) _STA(Status)
+   It returns the current status of a device, which can be one of the
+   following: enabled, disabled, or removed.
+
+   Arguments: None
+
+   Return Value:
+   It returns an An Integer which is defined as followings:
+   Bit [0] – Set if the device is present.
+   Bit [1] – Set if the device is enabled and decoding its resources.
+   Bit [2] – Set if the device should be shown in the UI.
+   Bit [3] – Set if the device is functioning properly (cleared if device
+ failed its diagnostics).
+   Bit [4] – Set if the battery is present.
+   Bits [31:5] – Reserved (must be cleared).
+
+2) _DSM (Device Specific Method)
+   It is a control method that enables devices to provide device specific
+   control functions that are consumed by the device driver.
+   The NVDIMM DSM specification can be found at:
+http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+
+   Arguments:
+   Arg0 – A Buffer containing a UUID (16 Bytes)
+   Arg1 – An Integer containing the Revision ID (4 Bytes)
+   Arg2 – An Integer containing the Function Index (4 Bytes)
+   Arg3 – A package containing parameters for the function specified by the
+  UUID, Revision ID, and Function Index
+
+   Return Value:
+   If Function Index = 0, a Buffer containing a function index bitfield.
+   Otherwise, the return value and type depends on the UUID, revision ID
+   and function index which are described in the DSM specification.
+
+Methods on NVDIMM ROOT Device
+_FIT(Firmware Interface Table)
+   It evaluates to a buffer returning data in the format of a series of NFIT
+   Type Structure.
+
+   Arguments: None
+
+   Return Value:
+   A Buffer containing a list of NFIT Type structure entries.
+
+   The detailed definition of the structure can be found at ACPI 6.0: 5.2.25
+   NVDIMM Firmware Interface Table (NFIT).
+
+QEMU NVDIMM Implemention
+
+QEMU reserves a page starting from 0xFF0 and 4 bytes IO Port starting
+from 0x0a18 for NVDIMM ACPI.
+
+Memory 0xFF0 - 0xFF00FFF:
+   This page is RAM-based and it is used to transfer data between _DSM
+   method and QEMU. If ACPI has control, this pages is owned by ACPI which
+   writes _DSM input data to it, otherwise, it is owned by QEMU which
+   emulates _DSM access and writes the output data to it.
+
+   ACPI Writes _DSM Input Data:
+   [0xFF0 - 0xFF3]: 4 bytes, NVDIMM Devcie Handle, 0 is reserved
+for NVDIMM Root device.
+   [0xFF4 - 0xFF7]: 4 bytes, Revision ID, that is the Arg1 of _DSM
+method.
+   [0xFF8 - 0xFFB]: 4 bytes. Function Index, that is the Arg2 of
+_DSM method.
+   [0xFFC - 0xFF00FFF]: 4084 bytes, the Arg3 of _DSM method
+
+   QEMU Writes Output Data:
+   [0xFF0 - 0xFF00FFF]: the DSM return result filled by QEMU
+
+IO Port 0x0a18 - 0xa1b:
+   ACPI uses it to transfer control from guest to QEMU and read the size
+   of return result filled by QEMU
+
+   Read Access:
+   [0x0a18 - 0xa1b]: 4 bytes, the buffer size of _DSM output data.
+
+_DSM process diagram:
+-
+The page, 0xFF0 - 0xFF00FFF, is used by _DSM Virtualization.
+
+ +--+  +

[PATCH v6 17/33] dimm: abstract dimm device from pc-dimm

A base device, dimm, is abstracted from pc-dimm, so that we can
build nvdimm device based on dimm in the later patch

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak   |  1 +
 default-configs/ppc64-softmmu.mak  |  1 +
 default-configs/x86_64-softmmu.mak |  1 +
 hw/mem/Makefile.objs   |  3 ++-
 hw/mem/dimm.c  | 11 ++---
 hw/mem/pc-dimm.c   | 46 ++
 include/hw/mem/dimm.h  |  4 ++--
 include/hw/mem/pc-dimm.h   |  7 ++
 8 files changed, 62 insertions(+), 12 deletions(-)
 create mode 100644 hw/mem/pc-dimm.c
 create mode 100644 include/hw/mem/pc-dimm.h

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 43c96d1..3ece8bb 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -18,6 +18,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_X86_ICH=y
+CONFIG_DIMM=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
diff --git a/default-configs/ppc64-softmmu.mak 
b/default-configs/ppc64-softmmu.mak
index bb71b23..482b8a1 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -54,3 +54,4 @@ CONFIG_XICS_KVM=$(and $(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_MC146818RTC=y
 CONFIG_ISA_TESTDEV=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_DIMM=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index dfb8095..92ea7c1 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -18,6 +18,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_X86_ICH=y
+CONFIG_DIMM=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index 7563ef5..cebb4b1 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1 +1,2 @@
-common-obj-$(CONFIG_MEM_HOTPLUG) += dimm.o
+common-obj-$(CONFIG_DIMM) += dimm.o
+common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 9f55cee..4a63409 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -1,5 +1,5 @@
 /*
- * Dimm device for Memory Hotplug
+ * Dimm device abstraction
  *
  * Copyright ProfitBricks GmbH 2012
  * Copyright (C) 2014 Red Hat Inc
@@ -429,21 +429,13 @@ static void dimm_realize(DeviceState *dev, Error **errp)
 }
 }
 
-static MemoryRegion *dimm_get_memory_region(DIMMDevice *dimm)
-{
-return host_memory_backend_get_memory(dimm->hostmem, &error_abort);
-}
-
 static void dimm_class_init(ObjectClass *oc, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(oc);
-DIMMDeviceClass *ddc = DIMM_CLASS(oc);
 
 dc->realize = dimm_realize;
 dc->props = dimm_properties;
 dc->desc = "DIMM memory module";
-
-ddc->get_memory_region = dimm_get_memory_region;
 }
 
 static TypeInfo dimm_info = {
@@ -453,6 +445,7 @@ static TypeInfo dimm_info = {
 .instance_init = dimm_init,
 .class_init= dimm_class_init,
 .class_size= sizeof(DIMMDeviceClass),
+.abstract  = true,
 };
 
 static void dimm_register_types(void)
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
new file mode 100644
index 000..38323e9
--- /dev/null
+++ b/hw/mem/pc-dimm.c
@@ -0,0 +1,46 @@
+/*
+ * Dimm device for Memory Hotplug
+ *
+ * Copyright ProfitBricks GmbH 2012
+ * Copyright (C) 2014 Red Hat Inc
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "hw/mem/pc-dimm.h"
+
+static MemoryRegion *pc_dimm_get_memory_region(DIMMDevice *dimm)
+{
+return host_memory_backend_get_memory(dimm->hostmem, &error_abort);
+}
+
+static void pc_dimm_class_init(ObjectClass *oc, void *data)
+{
+DIMMDeviceClass *ddc = DIMM_CLASS(oc);
+
+ddc->get_memory_region = pc_dimm_get_memory_region;
+}
+
+static TypeInfo pc_dimm_info = {
+.name  = TYPE_PC_DIMM,
+.parent= TYPE_DIMM,
+.class_init= pc_dimm_class_init,
+};
+
+static void pc_dimm_register_types(void)
+{
+type_register_static(&pc_dimm_info);
+}
+
+type_init(pc_dimm_register_types)
diff --git a/include/hw/mem/dimm.h b/include/hw/mem/dimm.h
index ece8786..50f768a 100644
--- a/include/hw/mem/dimm.h
+++ b/include/hw/mem/dimm.h
@@ -1,5 +1,5 @@
 /*
- * PC DIMM device
+ * Dimm device abstraction
  *
  * Copyright ProfitBricks GmbH 2012
  * Copyright (C) 201

[PATCH v6 12/33] pc-dimm: remove DEFAULT_PC_DIMMSIZE

It's not used any more

Signed-off-by: Xiao Guangrong 
---
 include/hw/mem/pc-dimm.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index d83bf30..11a8937 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -20,8 +20,6 @@
 #include "sysemu/hostmem.h"
 #include "hw/qdev.h"
 
-#define DEFAULT_PC_DIMMSIZE (1024*1024*1024)
-
 #define TYPE_PC_DIMM "pc-dimm"
 #define PC_DIMM(obj) \
 OBJECT_CHECK(PCDIMMDevice, (obj), TYPE_PC_DIMM)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 05/33] acpi: add aml_object_type

Implement ObjectType which is used by NVDIMM _DSM method in
later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index efc06ab..9f792ab 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml 
*target)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */
+Aml *aml_object_type(Aml *object)
+{
+Aml *var = aml_opcode(0x8E /* ObjectTypeOp */);
+aml_append(var, object);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 325782d..5b8a118 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg);
 Aml *aml_sizeof(Aml *arg);
 Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
 Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
+Aml *aml_object_type(Aml *object);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 16/33] pc-dimm: rename pc-dimm.c and pc-dimm.h

Rename:
   pc-dimm.c => dimm.c
   pc-dimm.h => dimm.h

It prepares the work which abstracts dimm device type for both pc-dimm and
nvdimm

Signed-off-by: Xiao Guangrong 
---
 hw/Makefile.objs | 2 +-
 hw/acpi/ich9.c   | 2 +-
 hw/acpi/memory_hotplug.c | 4 ++--
 hw/acpi/piix4.c  | 2 +-
 hw/i386/pc.c | 2 +-
 hw/mem/Makefile.objs | 2 +-
 hw/mem/{pc-dimm.c => dimm.c} | 2 +-
 hw/ppc/spapr.c   | 2 +-
 include/hw/i386/pc.h | 2 +-
 include/hw/mem/{pc-dimm.h => dimm.h} | 0
 include/hw/ppc/spapr.h   | 2 +-
 numa.c   | 2 +-
 qmp.c| 2 +-
 stubs/qmp_dimm_device_list.c | 2 +-
 14 files changed, 14 insertions(+), 14 deletions(-)
 rename hw/mem/{pc-dimm.c => dimm.c} (99%)
 rename include/hw/mem/{pc-dimm.h => dimm.h} (100%)

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 7e7c241..12ecda9 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -30,8 +30,8 @@ devices-dirs-$(CONFIG_SOFTMMU) += vfio/
 devices-dirs-$(CONFIG_VIRTIO) += virtio/
 devices-dirs-$(CONFIG_SOFTMMU) += watchdog/
 devices-dirs-$(CONFIG_SOFTMMU) += xen/
-devices-dirs-$(CONFIG_MEM_HOTPLUG) += mem/
 devices-dirs-$(CONFIG_SMBIOS) += smbios/
+devices-dirs-y += mem/
 devices-dirs-y += core/
 common-obj-y += $(devices-dirs-y)
 obj-y += $(devices-dirs-y)
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index b0d6a67..1e9ae20 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -35,7 +35,7 @@
 #include "exec/address-spaces.h"
 
 #include "hw/i386/ich9.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 
 //#define DEBUG
 
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index e687852..20d3093 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -1,6 +1,6 @@
 #include "hw/acpi/memory_hotplug.h"
 #include "hw/acpi/pc-hotplug.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "hw/boards.h"
 #include "hw/qdev-core.h"
 #include "trace.h"
@@ -148,7 +148,7 @@ static void acpi_memory_hotplug_write(void *opaque, hwaddr 
addr, uint64_t data,
 
 dev = DEVICE(mdev->dimm);
 hotplug_ctrl = qdev_get_hotplug_handler(dev);
-/* call pc-dimm unplug cb */
+/* call dimm unplug cb */
 hotplug_handler_unplug(hotplug_ctrl, dev, &local_err);
 if (local_err) {
 trace_mhp_acpi_dimm_delete_failed(mem_st->selector);
diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
index 0b2cb6e..b2f5b2c 100644
--- a/hw/acpi/piix4.c
+++ b/hw/acpi/piix4.c
@@ -33,7 +33,7 @@
 #include "hw/acpi/pcihp.h"
 #include "hw/acpi/cpu_hotplug.h"
 #include "hw/hotplug.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "hw/acpi/memory_hotplug.h"
 #include "hw/acpi/acpi_dev_interface.h"
 #include "hw/xen/xen.h"
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 67ecc4f..6bf569a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -62,7 +62,7 @@
 #include "hw/boards.h"
 #include "hw/pci/pci_host.h"
 #include "acpi-build.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
 
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index b000fb4..7563ef5 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1 +1 @@
-common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_MEM_HOTPLUG) += dimm.o
diff --git a/hw/mem/pc-dimm.c b/hw/mem/dimm.c
similarity index 99%
rename from hw/mem/pc-dimm.c
rename to hw/mem/dimm.c
index 67afc53..9f55cee 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/dimm.c
@@ -18,7 +18,7 @@
  * License along with this library; if not, see 
  */
 
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
 #include "qemu/range.h"
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index ab6eb83..9ff24fd 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2199,7 +2199,7 @@ static void spapr_machine_device_plug(HotplugHandler 
*hotplug_dev,
  *
  * - Memory gets hotplugged to a different node than what the user
  *   specified.
- * - Since pc-dimm subsystem in QEMU still thinks that memory belongs
+ * - Since dimm subsystem in QEMU still thinks that memory belongs
  *   to memory-less node, a reboot will set things accordingly
  *   and the previously hotplugged memory now ends in the right node.
  *   This appears as if some memory moved from one node to another.
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 606dbc2..62e8fb5 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -16,7 +16,7 @@
 #include "hw/pci/pci.h"
 #include "hw/boards.h"
 #include "hw/compat.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 
 #define HPET_INTCAP "hpet-intcap"
 
diff --git a/include/hw/mem/pc-dimm.h b

[PATCH v6 19/33] dimm: keep the state of the whole backend memory

QEMU keeps the state of memory of dimm device during live migration,
however, it is not enough for nvdimm device as its memory does not
contain its label data, so that we should protect the whole backend
memory instead

Signed-off-by: Xiao Guangrong 
---
 hw/mem/dimm.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 498d380..7d1 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -134,9 +134,16 @@ void dimm_memory_plug(DeviceState *dev, MemoryHotplugState 
*hpms,
 }
 
 memory_region_add_subregion(&hpms->mr, addr - hpms->base, mr);
-vmstate_register_ram(mr, dev);
 numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node);
 
+/*
+ * save the state only for @mr is not enough as it does not contain
+ * the label data of NVDIMM device, so that we keep the state of
+ * whole hostmem instead.
+ */
+vmstate_register_ram(host_memory_backend_get_memory(dimm->hostmem, errp),
+ dev);
+
 out:
 error_propagate(errp, local_err);
 }
@@ -145,10 +152,13 @@ void dimm_memory_unplug(DeviceState *dev, 
MemoryHotplugState *hpms,
MemoryRegion *mr)
 {
 DIMMDevice *dimm = DIMM(dev);
+MemoryRegion *backend_mr;
+
+backend_mr = host_memory_backend_get_memory(dimm->hostmem, &error_abort);
 
 numa_unset_mem_node_id(dimm->addr, memory_region_size(mr), dimm->node);
 memory_region_del_subregion(&hpms->mr, mr);
-vmstate_unregister_ram(mr, dev);
+vmstate_unregister_ram(backend_mr, dev);
 }
 
 int qmp_dimm_device_list(Object *obj, void *opaque)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 18/33] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Signed-off-by: Xiao Guangrong 
---
 hw/mem/dimm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 4a63409..498d380 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -377,8 +377,9 @@ static void dimm_get_size(Object *obj, Visitor *v, void 
*opaque,
 int64_t value;
 MemoryRegion *mr;
 DIMMDevice *dimm = DIMM(obj);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(obj);
 
-mr = host_memory_backend_get_memory(dimm->hostmem, errp);
+mr = ddc->get_memory_region(dimm);
 value = memory_region_size(mr);
 
 visit_type_int(v, &value, name, errp);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] KVM: PPC: Book3S HV: Send IPI to host core to wake VCPU

2015-10-29 Thread kbuild test robot

Hi Suresh,

[auto build test ERROR on kvm/linux-next -- if it's inappropriate base, please 
suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Suresh-Warrier/KVM-PPC-Book3S-HV-Optimize-wakeup-VCPU-from-H_IPI/20151030-081329
config: powerpc-defconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/kvm/book3s_hv_rm_xics.c: In function 'icp_rm_set_vcpu_irq':
>> arch/powerpc/kvm/book3s_hv_rm_xics.c:142:4: error: implicit declaration of 
>> function 'smp_muxed_ipi_rm_message_pass' 
>> [-Werror=implicit-function-declaration]
   smp_muxed_ipi_rm_message_pass(hcpu,
   ^
   arch/powerpc/kvm/book3s_hv_rm_xics.c:143:7: error: 'PPC_MSG_RM_HOST_ACTION' 
undeclared (first use in this function)
  PPC_MSG_RM_HOST_ACTION);
  ^
   arch/powerpc/kvm/book3s_hv_rm_xics.c:143:7: note: each undeclared identifier 
is reported only once for each function it appears in
   cc1: all warnings being treated as errors

vim +/smp_muxed_ipi_rm_message_pass +142 arch/powerpc/kvm/book3s_hv_rm_xics.c

   136  hcore = -1;
   137  if (kvmppc_host_rm_ops_hv)
   138  hcore = 
find_available_hostcore(XICS_RM_KICK_VCPU);
   139  if (hcore != -1) {
   140  hcpu = hcore << threads_shift;
   141  kvmppc_host_rm_ops_hv->rm_core[hcore].rm_data = 
vcpu;
 > 142  smp_muxed_ipi_rm_message_pass(hcpu,
   143  PPC_MSG_RM_HOST_ACTION);
   144  } else {
   145  this_icp->rm_action |= XICS_RM_KICK_VCPU;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Lan Tianyu

On 2015年10月30日 00:17, Alexander Duyck wrote:
> On 10/29/2015 01:33 AM, Lan Tianyu wrote:
>> On 2015年10月29日 14:58, Alexander Duyck wrote:
>>> Your code was having to do a bunch of shuffling in order to get things
>>> set up so that you could bring the interface back up.  I would argue
>>> that it may actually be faster at least on the bring-up to just drop the
>>> old rings and start over since it greatly reduced the complexity and the
>>> amount of device related data that has to be moved.
>> If give up the old ring after migration and keep DMA running before
>> stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
>> just make sure that all Rx buffers delivered to stack has been migrated.
>>
>> 1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
>> migrated first.
> 
> Don't dummy write the Rx descriptor.  You should only really need to
> dummy write the Rx buffer and you would do so after checking the
> descriptor, not before.  Otherwise you risk corrupting the Rx buffer
> because it is possible for you to read the Rx buffer, DMA occurs, and
> then you write back the Rx buffer and now you have corrupted the memory.
> 
>> 2) Make a copy of Rx descriptor and then use the copied data to check
>> buffer status. Not use the original descriptor because it won't be
>> migrated and migration may happen between two access of the Rx
>> descriptor.
> 
> Do not just blindly copy the Rx descriptor ring.  That is a recipe for
> disaster.  The problem is DMA has to happen in a very specific order for
> things to function correctly.  The Rx buffer has to be written and then
> the Rx descriptor.  The problem is you will end up getting a read-ahead
> on the Rx descriptor ring regardless of which order you dirty things in.

Sorry, I didn't say clearly.
I meant to copy one Rx descriptor when receive rx irq and handle Rx ring.

Current code in the ixgbevf_clean_rx_irq() checks status of the Rx
descriptor whether its Rx buffer has been populated data and then read
the packet length from Rx descriptor to handle the Rx buffer.

My idea is to do the following three steps when receive Rx buffer in the
ixgbevf_clean_rx_irq().

(1) dummy write the Rx buffer first,
(2) make a copy of its Rx descriptor
(3) Check the buffer status and get length from the copy.

Migration may happen every time.
Happen between (1) and (2). If the Rx buffer has been populated data, VF
driver will not know that on the new machine because the Rx descriptor
isn't migrated. But it's still safe.

Happen between (2) and (3). The copy will be migrated to new machine
and Rx buffer is migrated firstly. If there is data in the Rx buffer,
VF driver still can handle the buffer without migrating Rx descriptor.

The next buffers will be ignored since we don't migrate Rx descriptor
for them. Their status will be not completed on the new machine.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread Benjamin Herrenschmidt

On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote:
> On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
>  wrote:
> > 
> > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> > > 
> > > > On Power, I generally have 2 IOMMU windows for a device, one at
> > > > the
> > > > bottom is remapped, and is generally used for 32-bit devices
> > > > and the
> > > > one at the top us setup as a bypass
> > > 
> > > So in the normal case of decent 64-bit devices (and not in a VM),
> > > they'll *already* be using the bypass region and have full access
> > > to
> > > all of memory, all of the time? And you have no protection
> > > against
> > > driver and firmware bugs causing stray DMA?
> > 
> > Correct, we chose to do that for performance reasons.
> 
> Could this be mitigated using pools?  I don't know if the net code
> would play along easily.

Possibly, the pools we have already limit the lock contention but we
still have the map/unmap overhead which under a hypervisor can be quite
high. I'm not necessarily against changing the way we do things but it
would have to be backed up with numbers.

Cheers,
Ben.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/6] virtio core DMA API conversion

On Thu, Oct 29, 2015 at 6:09 PM, Andy Lutomirski  wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.

...

> Andy Lutomirski (5):
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
>   virtio: Add improved queue allocation API
>   virtio_mmio: Use the DMA API
>   virtio_pci: Use the DMA API

Ugh.  The two virtio_pci patches should be squashed together.  I'll do
that for v5, but I'm not going to send it until there's more feedback.

FWIW, I'm collecting this stuff here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_dma

That branch includes this series (with the squash) and the s390
patches.  I'll keep it up to date, since it seems silly to declare it
stable enough to stop rebasing yet.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 0/6] virtio core DMA API conversion

This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

This appears to work on native and Xen x86_64 using both modern and
legacy virtio-pci.  It also appears to work on arm and arm64.

It definitely won't work as-is on s390x, and I haven't been able to
test Christian's patches because I can't get virtio-ccw to work in
QEMU at all.  I don't know what I'm doing wrong.

It doesn't work on ppc64.  Ben, consider yourself pinged to send me
a patch :)

It doesn't work on sparc64.  I didn't realize at Kernel Summit that
sparc64 has the same problem as ppc64.

DaveM, for background, we're trying to fix virtio to use the DMA
API.  That will require that every platform that uses virtio
supplies valid DMA operations on devices that use virtio_ring.
Unfortunately, QEMU historically ignores the IOMMU on virtio
devices.

On x86, this isn't really a problem.  x86 has a nice way for the
platform to describe which devices are behind an IOMMU, and QEMU
will be adjusted accordingly.  The only thing that will break is a
recently-added experimental mode.

Ben's plan for powerpc is to add a quirk for existing virtio-pci
devices and to eventually update the devicetree stuff to allow QEMU
to tell the guest which devices use the IOMMU.

AFAICT sparc has a similar problem to powerpc.  DaveM, can you come
up with a straightforward way to get sparc's DMA API to work
correctly for virtio-pci devices?

NB: Sadly, the platforms I've successfully tested on don't include any
big-endian platforms, so there could still be lurking endian problems.

Changes from v3:
 - More big-endian fixes.
 - Added better virtio-ring APIs that handle allocation and use them in
   virtio-mmio and virtio-pci.
 - Switch to Michael's virtio-net patch.

Changes from v2:
 - Fix vring_mapping_error incorrect argument

Changes from v1:
 - Fix an endian conversion error causing a BUG to hit.
 - Fix a DMA ordering issue (swiotlb=force works now).
 - Minor cleanups.

Andy Lutomirski (5):
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API
  virtio: Add improved queue allocation API
  virtio_mmio: Use the DMA API
  virtio_pci: Use the DMA API

Michael S. Tsirkin (1):
  virtio-net: Stop doing DMA from the stack

 drivers/net/virtio_net.c   |  34 ++--
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_mmio.c   |  67 ++-
 drivers/virtio/virtio_pci_common.h |   6 -
 drivers/virtio/virtio_pci_legacy.c |  42 ++---
 drivers/virtio/virtio_pci_modern.c |  61 ++-
 drivers/virtio/virtio_ring.c   | 348 ++---
 include/linux/virtio.h |  23 ++-
 include/linux/virtio_ring.h|  35 
 tools/virtio/linux/dma-mapping.h   |  17 ++
 10 files changed, 426 insertions(+), 209 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 2/6] virtio_ring: Support DMA APIs

virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 190 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 172 insertions(+), 37 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..a872eb89587f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 unsigned int total_sg, gfp_t gfp)
 {
@@ -131,7 +198,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
struct vring_virtqueue *vq =

[PATCH v4 4/6] virtio: Add improved queue allocation API

This leaves vring_new_virtqueue alone for compatbility, but it
adds two new improved APIs:

vring_create_virtqueue: Creates a virtqueue backed by automatically
allocated coherent memory.  (Some day it this could be extended to
support non-coherent memory, too, if there ends up being a platform
on which it's worthwhile.)

__vring_new_virtqueue: Creates a virtqueue with a manually-specified
layout.  This should allow mic_virtio to work much more cleanly.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/virtio_ring.c | 164 +++
 include/linux/virtio.h   |  23 +-
 include/linux/virtio_ring.h  |  35 +
 3 files changed, 190 insertions(+), 32 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index a872eb89587f..f269e1cba00c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -91,6 +91,11 @@ struct vring_virtqueue
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
 
+   /* DMA, allocation, and size information */
+   bool we_own_ring;
+   size_t queue_size_in_bytes;
+   dma_addr_t queue_dma_addr;
+
 #ifdef DEBUG
/* They're supposed to lock for us. */
unsigned int in_use;
@@ -821,36 +826,31 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
 }
 EXPORT_SYMBOL_GPL(vring_interrupt);
 
-struct virtqueue *vring_new_virtqueue(unsigned int index,
- unsigned int num,
- unsigned int vring_align,
- struct virtio_device *vdev,
- bool weak_barriers,
- void *pages,
- bool (*notify)(struct virtqueue *),
- void (*callback)(struct virtqueue *),
- const char *name)
+struct virtqueue *__vring_new_virtqueue(unsigned int index,
+   struct vring vring,
+   struct virtio_device *vdev,
+   bool weak_barriers,
+   bool (*notify)(struct virtqueue *),
+   void (*callback)(struct virtqueue *),
+   const char *name)
 {
-   struct vring_virtqueue *vq;
unsigned int i;
+   struct vring_virtqueue *vq;
 
-   /* We assume num is a power of 2. */
-   if (num & (num - 1)) {
-   dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num);
-   return NULL;
-   }
-
-   vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state),
+   vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state),
 GFP_KERNEL);
if (!vq)
return NULL;
 
-   vring_init(&vq->vring, num, pages, vring_align);
+   vq->vring = vring;
vq->vq.callback = callback;
vq->vq.vdev = vdev;
vq->vq.name = name;
-   vq->vq.num_free = num;
+   vq->vq.num_free = vring.num;
vq->vq.index = index;
+   vq->we_own_ring = false;
+   vq->queue_dma_addr = 0;
+   vq->queue_size_in_bytes = 0;
vq->notify = notify;
vq->weak_barriers = weak_barriers;
vq->broken = false;
@@ -871,18 +871,105 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
 
/* Put everything in free lists. */
vq->free_head = 0;
-   for (i = 0; i < num-1; i++)
+   for (i = 0; i < vring.num-1; i++)
vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1);
-   memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state));
+   memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state));
 
return &vq->vq;
 }
+EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
+
+struct virtqueue *vring_create_virtqueue(
+   unsigned int index,
+   unsigned int num,
+   unsigned int vring_align,
+   struct virtio_device *vdev,
+   bool weak_barriers,
+   bool may_reduce_num,
+   bool (*notify)(struct virtqueue *),
+   void (*callback)(struct virtqueue *),
+   const char *name)
+{
+   struct virtqueue *vq;
+   void *queue;
+   dma_addr_t dma_addr;
+   size_t queue_size_in_bytes;
+   struct vring vring;
+
+   /* We assume num is a power of 2. */
+   if (num & (num - 1)) {
+   dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num);
+   return NULL;
+   }
+
+   /* TODO: allocate each queue chunk individually */
+   for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) {
+   queue = dma_zalloc_coherent(
+   vdev->dev.parent, vring_size(num, vring_align),
+   &dma_addr, GFP_KERNEL|__GFP_NOWARN);
+   if (queue)
+

[PATCH v4 3/6] virtio_pci: Use the DMA API

This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size,
+ &info->queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(&vp_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(&vp_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(&pci_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   &vp_dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page. You are my only hope! */
-   return alloc_pages_exact(vring_pci_size(*num), GFP_KERNEL|__GFP_ZERO);
+   return dma_zalloc_coherent(
+   &vp_dev->pci_dev->dev,

[PATCH v4 1/6] virtio-net: Stop doing DMA from the stack

From: "Michael S. Tsirkin" 

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

[I wrote the subject and commit message.  mst wrote the code. --luto]

Signed-off-by: Andy Lutomirski 
signed-off-by: Michael S. Tsirkin 
---
 drivers/net/virtio_net.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..f94ab786088f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -140,6 +140,12 @@ struct virtnet_info {
 
/* CPU hot plug notifier */
struct notifier_block nb;
+
+   /* Control VQ buffers: protected by the rtnl lock */
+   struct virtio_net_ctrl_hdr ctrl_hdr;
+   virtio_net_ctrl_ack ctrl_status;
+   u8 ctrl_promisc;
+   u8 ctrl_allmulti;
 };
 
 struct padded_vnet_hdr {
@@ -976,31 +982,30 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
unsigned out_num = 0, tmp;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   vi->ctrl_status = ~0;
+   vi->ctrl_hdr.class = class;
+   vi->ctrl_hdr.cmd = cmd;
/* Add header */
-   sg_init_one(&hdr, &ctrl, sizeof(ctrl));
+   sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr));
sgs[out_num++] = &hdr;
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(&stat, &status, sizeof(status));
+   sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status));
sgs[out_num] = &stat;
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   return vi->ctrl_status == VIRTIO_NET_OK;
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1014,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   return vi->ctrl_status == VIRTIO_NET_OK;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1156,6 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1167,22 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0);
+   vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
 
-   sg_init_one(sg, &promisc, sizeof(promisc));
+   sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
+vi->ctrl_promisc ? "en" : "dis");
 
-   sg_init_one(sg, &allmulti, sizeof(allmulti));
+   sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+vi->ctrl_allmulti ? "en" : "dis");
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 5/6] virtio_mmio: Use the DMA API

This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/virtio_mmio.c | 67 ++--
 1 file changed, 15 insertions(+), 52 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index f499d9da7237..2b9fab52a3cb 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -99,12 +99,6 @@ struct virtio_mmio_vq_info {
/* the actual virtqueue */
struct virtqueue *vq;
 
-   /* the number of entries in the queue */
-   unsigned int num;
-
-   /* the virtual address of the ring queue */
-   void *queue;
-
/* the list node for the virtqueues list */
struct list_head node;
 };
@@ -322,15 +316,13 @@ static void vm_del_vq(struct virtqueue *vq)
 {
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
struct virtio_mmio_vq_info *info = vq->priv;
-   unsigned long flags, size;
+   unsigned long flags;
unsigned int index = vq->index;
 
spin_lock_irqsave(&vm_dev->lock, flags);
list_del(&info->node);
spin_unlock_irqrestore(&vm_dev->lock, flags);
 
-   vring_del_virtqueue(vq);
-
/* Select and deactivate the queue */
writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
if (vm_dev->version == 1) {
@@ -340,8 +332,8 @@ static void vm_del_vq(struct virtqueue *vq)
WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY));
}
 
-   size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   vring_del_virtqueue(vq);
+
kfree(info);
 }
 
@@ -356,8 +348,6 @@ static void vm_del_vqs(struct virtio_device *vdev)
free_irq(platform_get_irq(vm_dev->pdev, 0), vm_dev);
 }
 
-
-
 static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned 
index,
  void (*callback)(struct virtqueue *vq),
  const char *name)
@@ -365,7 +355,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device 
*vdev, unsigned index,
struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
struct virtio_mmio_vq_info *info;
struct virtqueue *vq;
-   unsigned long flags, size;
+   unsigned long flags;
+   unsigned int num;
int err;
 
if (!name)
@@ -388,66 +379,40 @@ static struct virtqueue *vm_setup_vq(struct virtio_device 
*vdev, unsigned index,
goto error_kmalloc;
}
 
-   /* Allocate pages for the queue - start with a queue as big as
-* possible (limited by maximum size allowed by device), drop down
-* to a minimal size, just big enough to fit descriptor table
-* and two rings (which makes it "alignment_size * 2")
-*/
-   info->num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX);
-
-   /* If the device reports a 0 entry queue, we won't be able to
-* use it to perform I/O, and vring_new_virtqueue() can't create
-* empty queues anyway, so don't bother to set up the device.
-*/
-   if (info->num == 0) {
+   num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX);
+   if (num == 0) {
err = -ENOENT;
-   goto error_alloc_pages;
-   }
-
-   while (1) {
-   size = PAGE_ALIGN(vring_size(info->num,
-   VIRTIO_MMIO_VRING_ALIGN));
-   /* Did the last iter shrink the queue below minimum size? */
-   if (size < VIRTIO_MMIO_VRING_ALIGN * 2) {
-   err = -ENOMEM;
-   goto error_alloc_pages;
-   }
-
-   info->queue = alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO);
-   if (info->queue)
-   break;
-
-   info->num /= 2;
+   goto error_new_virtqueue;
}
 
/* Create the vring */
-   vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, 
vdev,
-true, info->queue, vm_notify, callback, name);
+   vq = vring_create_virtqueue(index, num, VIRTIO_MMIO_VRING_ALIGN, vdev,
+true, true, vm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto error_new_virtqueue;
}
 
/* Activate the queue */
-   writel(info->num, vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
+   writel(virtqueue_get_vring_size(vq), vm_dev->base + 
VIRTIO_MMIO_QUEUE_NUM);
if (vm_dev->version == 1) {
writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
-   writel(virt_to_phys(info->queue) >> PAGE_SHIFT,
+   writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
} else {
u6

[PATCH v4 6/6] virtio_pci: Use the DMA API

This switches to vring_create_virtqueue, simplifying the driver and
adding DMA API support.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/virtio_pci_common.h |  7 -
 drivers/virtio/virtio_pci_legacy.c | 39 +++-
 drivers/virtio/virtio_pci_modern.c | 61 ++
 3 files changed, 19 insertions(+), 88 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index cd6196b513ad..1a3c689d1b9e 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -35,13 +35,6 @@ struct virtio_pci_vq_info {
/* the actual virtqueue */
struct virtqueue *vq;
 
-   /* the number of entries in the queue */
-   int num;
-
-   /* the ring queue */
-   void *queue;
-   dma_addr_t queue_dma_addr;  /* bus address */
-
/* the list node for the virtqueues list */
struct list_head node;
 
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index b5293e5f2af4..8c4e61783441 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -119,7 +119,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device 
*vp_dev,
  u16 msix_vec)
 {
struct virtqueue *vq;
-   unsigned long size;
u16 num;
int err;
 
@@ -131,29 +130,19 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
if (!num || ioread32(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN))
return ERR_PTR(-ENOENT);
 
-   info->num = num;
info->msix_vector = msix_vec;
 
-   size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size,
- &info->queue_dma_addr,
- GFP_KERNEL);
-   if (info->queue == NULL)
+   /* create the vring */
+   vq = vring_create_virtqueue(index, num,
+   VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev,
+   true, false, vp_notify, callback, name);
+   if (!vq)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-   /* create the vring */
-   vq = vring_new_virtqueue(index, info->num,
-VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev,
-true, info->queue, vp_notify, callback, name);
-   if (!vq) {
-   err = -ENOMEM;
-   goto out_activate_queue;
-   }
-
vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
 
if (msix_vec != VIRTIO_MSI_NO_VECTOR) {
@@ -161,18 +150,15 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
msix_vec = ioread16(vp_dev->ioaddr + VIRTIO_MSI_QUEUE_VECTOR);
if (msix_vec == VIRTIO_MSI_NO_VECTOR) {
err = -EBUSY;
-   goto out_assign;
+   goto out_deactivate;
}
}
 
return vq;
 
-out_assign:
-   vring_del_virtqueue(vq);
-out_activate_queue:
+out_deactivate:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   dma_free_coherent(&vp_dev->pci_dev->dev, size,
- info->queue, info->queue_dma_addr);
+   vring_del_virtqueue(vq);
return ERR_PTR(err);
 }
 
@@ -180,7 +166,6 @@ static void del_vq(struct virtio_pci_vq_info *info)
 {
struct virtqueue *vq = info->vq;
struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-   unsigned long size;
 
iowrite16(vq->index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
@@ -191,14 +176,10 @@ static void del_vq(struct virtio_pci_vq_info *info)
ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR);
}
 
-   vring_del_virtqueue(vq);
-
/* Select and deactivate the queue */
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
-   size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   dma_free_coherent(&vp_dev->pci_dev->dev, size,
- info->queue, info->queue_dma_addr);
+   vring_del_virtqueue(vq);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index fbe0bd1c4881..50b0cd5a501e 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -287,35 +287,6 @@ static u16 vp_config_vector(struct virtio_pci_device 
*vp_dev, u16 vector)
return vp_ioread16(&vp_dev->common->msix_config);
 }
 
-static size_t vring_pci_size(u16 num)
-{
-   /* We only need a cacheline

[PATCH 3/6] KVM: PPC: Book3S HV: kvmppc_host_rm_ops - handle offlining CPUs

The kvmppc_host_rm_ops structure keeps track of which cores are
are in the host by maintaining a bitmask of active/runnable
online CPUs that have not entered the guest. This patch adds
support to manage the bitmask when a CPU is offlined or onlined
in the host.

Signed-off-by: Suresh Warrier 
---
 arch/powerpc/kvm/book3s_hv.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7a62aaa..390af5b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3012,6 +3012,36 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu 
*vcpu)
 }
 
 #ifdef CONFIG_KVM_XICS
+static int kvmppc_cpu_notify(struct notifier_block *self, unsigned long action,
+   void *hcpu)
+{
+   unsigned long cpu = (long)hcpu;
+
+   switch (action) {
+   case CPU_UP_PREPARE:
+   case CPU_UP_PREPARE_FROZEN:
+   kvmppc_set_host_core(cpu);
+   break;
+
+#ifdef CONFIG_HOTPLUG_CPU
+   case CPU_DEAD:
+   case CPU_DEAD_FROZEN:
+   case CPU_UP_CANCELED:
+   case CPU_UP_CANCELED_FROZEN:
+   kvmppc_clear_host_core(cpu);
+   break;
+#endif
+   default:
+   break;
+   }
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block kvmppc_cpu_notifier = {
+   .notifier_call = kvmppc_cpu_notify,
+};
+
 /*
  * Allocate a per-core structure for managing state about which cores are
  * running in the host versus the guest and for exchanging data between
@@ -3045,6 +3075,8 @@ void kvmppc_alloc_host_rm_ops(void)
return;
}
 
+   get_online_cpus();
+
for (cpu = 0; cpu < nr_cpu_ids; cpu += threads_per_core) {
if (!cpu_online(cpu))
continue;
@@ -3063,14 +3095,21 @@ void kvmppc_alloc_host_rm_ops(void)
l_ops = (unsigned long) ops;
 
if (cmpxchg64((unsigned long *)&kvmppc_host_rm_ops_hv, 0, l_ops)) {
+   put_online_cpus();
kfree(ops->rm_core);
kfree(ops);
+   return;
}
+
+   register_cpu_notifier(&kvmppc_cpu_notifier);
+
+   put_online_cpus();
 }
 
 void kvmppc_free_host_rm_ops(void)
 {
if (kvmppc_host_rm_ops_hv) {
+   unregister_cpu_notifier(&kvmppc_cpu_notifier);
kfree(kvmppc_host_rm_ops_hv->rm_core);
kfree(kvmppc_host_rm_ops_hv);
kvmppc_host_rm_ops_hv = NULL;
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/6] KVM: PPC: Book3S HV: Send IPI to host core to wake VCPU

This patch adds support to real-mode KVM to search for a core
running in the host partition and send it an IPI message with
VCPU to be woken. This avoids having to switch to the host
partition to complete an H_IPI hypercall when the VCPU which
is the target of the the H_IPI is not loaded (is not running
in the guest).

The patch also includes the support in the IPI handler running
in the host to do the wakeup by calling kvmppc_xics_ipi_action
for the PPC_MSG_RM_HOST_ACTION message.

When a guest is being destroyed, we need to ensure that there
are no pending IPIs waiting to wake up a VCPU before we free
the VCPUs of the guest. This is accomplished by:
- Forces a PPC_MSG_CALL_FUNCTION IPI to be completed by all CPUs
  before freeing any VCPUs in kvm_arch_destroy_vm()
- Any PPC_MSG_RM_HOST_ACTION messages must be executed first
  before any other PPC_MSG_CALL_FUNCTION messages

Signed-off-by: Suresh Warrier 
---
 arch/powerpc/kernel/smp.c| 11 +
 arch/powerpc/kvm/book3s_hv_rm_xics.c | 81 ++--
 arch/powerpc/kvm/powerpc.c   | 10 +
 3 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c07bfad..8958c2a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -280,6 +280,17 @@ irqreturn_t smp_ipi_demux(void)
 
do {
all = xchg(&info->messages, 0);
+#if defined(CONFIG_KVM_XICS) && defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
+   /*
+* Must check for PPC_MSG_RM_HOST_ACTION messages
+* before PPC_MSG_CALL_FUNCTION messages because when
+* a VM is destroyed, we call kick_all_cpus_sync()
+* to ensure that any pending PPC_MSG_RM_HOST_ACTION
+* messages have completed before we free any VCPUs.
+*/
+   if (all & IPI_MESSAGE(PPC_MSG_RM_HOST_ACTION))
+   kvmppc_xics_ipi_action();
+#endif
if (all & IPI_MESSAGE(PPC_MSG_CALL_FUNCTION))
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index 43ffbfe..b8d6403 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -51,11 +51,70 @@ static void ics_rm_check_resend(struct kvmppc_xics *xics,
 
 /* -- ICP routines -- */
 
+/*
+ * We start the search from our current CPU Id in the core map
+ * and go in a circle until we get back to our ID looking for a
+ * core that is running in host context and that hasn't already
+ * been targeted for another rm_host_ops.
+ *
+ * In the future, could consider using a fairer algorithm (one
+ * that distributes the IPIs better)
+ *
+ * Returns -1, if no CPU could be found in the host
+ * Else, returns a CPU Id which has been reserved for use
+ */
+static inline int grab_next_hostcore(int start,
+   struct kvmppc_host_rm_core *rm_core, int max, int action)
+{
+   bool success;
+   int core;
+   union kvmppc_rm_state old, new;
+
+   for (core = start + 1; core < max; core++)  {
+   old = new = READ_ONCE(rm_core[core].rm_state);
+
+   if (!old.in_host || old.rm_action)
+   continue;
+
+   /* Try to grab this host core if not taken already. */
+   new.rm_action = action;
+
+   success = cmpxchg64(&rm_core[core].rm_state.raw,
+   old.raw, new.raw) == old.raw;
+   if (success) {
+   /*
+* Make sure that the store to the rm_action is made
+* visible before we return to caller (and the
+* subsequent store to rm_data) to synchronize with
+* the IPI handler.
+*/
+   smp_wmb();
+   return core;
+   }
+   }
+
+   return -1;
+}
+
+static inline int find_available_hostcore(int action)
+{
+   int core;
+   int my_core = smp_processor_id() >> threads_shift;
+   struct kvmppc_host_rm_core *rm_core = kvmppc_host_rm_ops_hv->rm_core;
+
+   core = grab_next_hostcore(my_core, rm_core, cpu_nr_cores(), action);
+   if (core == -1)
+   core = grab_next_hostcore(core, rm_core, my_core, action);
+
+   return core;
+}
+
 static void icp_rm_set_vcpu_irq(struct kvm_vcpu *vcpu,
struct kvm_vcpu *this_vcpu)
 {
struct kvmppc_icp *this_icp = this_vcpu->arch.icp;
int cpu;
+   int hcore, hcpu;
 
/* Mark the target VCPU as having an interrupt pending */
vcpu->stat.queue_intr++;
@@ -67,11 +126,25 @@ static void icp_rm_set_vcpu_irq(struct kvm_vcpu *vcpu,
return;
}
 
-   /* Check if the co

[PATCH 1/6] KVM: PPC: Book3S HV: Host-side RM data structures

This patch defines the data structures to support the setting up
of host side operations while running in real mode in the guest,
and also the functions to allocate and free it.

The operations are for now limited to virtual XICS operations.
Currently, we have only defined one operation in the data
structure:
 - Wake up a VCPU sleeping in the host when it
   receives a virtual interrupt

The operations are assigned at the core level because PowerKVM
requires that the host run in SMT off mode. For each core,
we will need to manage its state atomically - where the state
is defined by:
1. Is the core running in the host?
2. Is there a Real Mode (RM) operation pending on the host?

Currently, core state is only managed at the whole-core level
even when the system is in split-core mode. This just limits
the number of free or "available" cores in the host to perform
any host-side operations.

The kvmppc_host_rm_core.rm_data allows any data to be passed by
KVM in real mode to the host core along with the operation to
be performed.

The kvmppc_host_rm_ops structure is allocated the very first time
a guest VM is started. Initial core state is also set - all online
cores are in the host. This structure is never deleted, not even
when there are no active guests. However, it needs to be freed
when the module is unloaded because the kvmppc_host_rm_ops_hv
can contain function pointers to kvm-hv.ko functions for the
different supported host operations.

Signed-off-by: Suresh Warrier 
---
 arch/powerpc/include/asm/kvm_ppc.h   | 31 
 arch/powerpc/kvm/book3s_hv.c | 70 
 arch/powerpc/kvm/book3s_hv_builtin.c |  3 ++
 3 files changed, 104 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index c6ef05b..47cd441 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -437,6 +437,8 @@ static inline int kvmppc_xics_enabled(struct kvm_vcpu *vcpu)
 {
return vcpu->arch.irq_type == KVMPPC_IRQ_XICS;
 }
+extern void kvmppc_alloc_host_rm_ops(void);
+extern void kvmppc_free_host_rm_ops(void);
 extern void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu);
 extern int kvmppc_xics_create_icp(struct kvm_vcpu *vcpu, unsigned long server);
 extern int kvm_vm_ioctl_xics_irq(struct kvm *kvm, struct kvm_irq_level *args);
@@ -446,6 +448,8 @@ extern int kvmppc_xics_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xics_connect_vcpu(struct kvm_device *dev,
struct kvm_vcpu *vcpu, u32 cpu);
 #else
+static inline void kvmppc_alloc_host_rm_ops(void) {};
+static inline void kvmppc_free_host_rm_ops(void) {};
 static inline int kvmppc_xics_enabled(struct kvm_vcpu *vcpu)
{ return 0; }
 static inline void kvmppc_xics_free_icp(struct kvm_vcpu *vcpu) { }
@@ -459,6 +463,33 @@ static inline int kvmppc_xics_hcall(struct kvm_vcpu *vcpu, 
u32 cmd)
{ return 0; }
 #endif
 
+/*
+ * Host-side operations we want to set up while running in real
+ * mode in the guest operating on the xics.
+ * Currently only VCPU wakeup is supported.
+ */
+
+union kvmppc_rm_state {
+   unsigned long raw;
+   struct {
+   u32 in_host;
+   u32 rm_action;
+   };
+};
+
+struct kvmppc_host_rm_core {
+   union kvmppc_rm_state rm_state;
+   void *rm_data;
+   char pad[112];
+};
+
+struct kvmppc_host_rm_ops {
+   struct kvmppc_host_rm_core  *rm_core;
+   void(*vcpu_kick)(struct kvm_vcpu *vcpu);
+};
+
+extern struct kvmppc_host_rm_ops *kvmppc_host_rm_ops_hv;
+
 static inline unsigned long kvmppc_get_epr(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_KVM_BOOKE_HV
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9c26c5a..9176e56 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2967,6 +2967,73 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu 
*vcpu)
goto out_srcu;
 }
 
+#ifdef CONFIG_KVM_XICS
+/*
+ * Allocate a per-core structure for managing state about which cores are
+ * running in the host versus the guest and for exchanging data between
+ * real mode KVM and CPU running in the host.
+ * This is only done for the first VM.
+ * The allocated structure stays even if all VMs have stopped.
+ * It is only freed when the kvm-hv module is unloaded.
+ * It's OK for this routine to fail, we just don't support host
+ * core operations like redirecting H_IPI wakeups.
+ */
+void kvmppc_alloc_host_rm_ops(void)
+{
+   struct kvmppc_host_rm_ops *ops;
+   unsigned long l_ops;
+   int cpu, core;
+   int size;
+
+   /* Not the first time here ? */
+   if (kvmppc_host_rm_ops_hv != NULL)
+   return;
+
+   ops = kzalloc(sizeof(struct kvmppc_host_rm_ops), GFP_KERNEL);
+   if (!ops)
+   return;
+
+   size = cpu_nr_cores() * sizeof(struct kvmppc_host_rm_core);
+   ops->rm_core = kzalloc(size, GFP_KERNEL

[PATCH 4/6] KVM: PPC: Book3S HV: Host side kick VCPU when poked by real-mode KVM

This patch adds the support for the kick VCPU operation for
kvmppc_host_rm_ops. The kvmppc_xics_ipi_action() function
provides the function to be invoked for a host side operation
when poked by the real mode KVM. This is initiated by KVM by
sending an IPI to any free host core.

KVM real mode must set the rm_action to XICS_RM_KICK_VCPU and
rm_data to point to the VCPU to be woken up before sending the IPI.
Note that we have allocated one kvmppc_host_rm_core structure
per core. The above values need to be set in the structure
corresponding to the core to which the IPI will be sent.

Signed-off-by: Suresh Warrier 
---
 arch/powerpc/include/asm/kvm_ppc.h   |  1 +
 arch/powerpc/kvm/book3s_hv.c |  2 ++
 arch/powerpc/kvm/book3s_hv_rm_xics.c | 36 
 3 files changed, 39 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 47cd441..1b93519 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -447,6 +447,7 @@ extern u64 kvmppc_xics_get_icp(struct kvm_vcpu *vcpu);
 extern int kvmppc_xics_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
 extern int kvmppc_xics_connect_vcpu(struct kvm_device *dev,
struct kvm_vcpu *vcpu, u32 cpu);
+extern void kvmppc_xics_ipi_action(void);
 #else
 static inline void kvmppc_alloc_host_rm_ops(void) {};
 static inline void kvmppc_free_host_rm_ops(void) {};
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 390af5b..80b1eb3 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3085,6 +3085,8 @@ void kvmppc_alloc_host_rm_ops(void)
ops->rm_core[core].rm_state.in_host = 1;
}
 
+   ops->vcpu_kick = kvmppc_fast_vcpu_kick_hv;
+
/*
 * Make the contents of the kvmppc_host_rm_ops structure visible
 * to other CPUs before we assign it to the global variable.
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index 24f5807..43ffbfe 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "book3s_xics.h"
@@ -623,3 +624,38 @@ int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long 
xirr)
  bail:
return check_too_hard(xics, icp);
 }
+
+/*  --- Non-real mode XICS-related built-in routines ---  */
+
+/**
+ * Host Operations poked by RM KVM
+ */
+static void rm_host_ipi_action(int action, void *data)
+{
+   switch (action) {
+   case XICS_RM_KICK_VCPU:
+   kvmppc_host_rm_ops_hv->vcpu_kick(data);
+   break;
+   default:
+   WARN(1, "Unexpected rm_action=%d data=%p\n", action, data);
+   break;
+   }
+
+}
+
+void kvmppc_xics_ipi_action(void)
+{
+   int core;
+   unsigned int cpu = smp_processor_id();
+   struct kvmppc_host_rm_core *rm_corep;
+
+   core = cpu >> threads_shift;
+   rm_corep = &kvmppc_host_rm_ops_hv->rm_core[core];
+
+   if (rm_corep->rm_data) {
+   rm_host_ipi_action(rm_corep->rm_state.rm_action,
+   rm_corep->rm_data);
+   rm_corep->rm_data = NULL;
+   rm_corep->rm_state.rm_action = 0;
+   }
+}
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/6] KVM: PPC: Book3S HV: Manage core host state

Update the core host state in kvmppc_host_rm_ops whenever
the primary thread of the core enters the guest or returns
back.

Signed-off-by: Suresh Warrier 
---
 arch/powerpc/kvm/book3s_hv.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9176e56..7a62aaa 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2261,6 +2261,46 @@ static void post_guest_process(struct kvmppc_vcore *vc, 
bool is_master)
 }
 
 /*
+ * Clear core from the list of active host cores as we are about to
+ * enter the guest. Only do this if it is the primary thread of the
+ * core (not if a subcore) that is entering the guest.
+ */
+static inline void kvmppc_clear_host_core(int cpu)
+{
+   int core;
+
+   if (!kvmppc_host_rm_ops_hv || cpu_thread_in_core(cpu))
+   return;
+   /*
+* Memory barrier can be omitted here as we will do a smp_wmb()
+* later in kvmppc_start_thread and we need ensure that state is
+* visible to other CPUs only after we enter guest.
+*/
+   core = cpu >> threads_shift;
+   kvmppc_host_rm_ops_hv->rm_core[core].rm_state.in_host = 0;
+}
+
+/*
+ * Advertise this core as an active host core since we exited the guest
+ * Only need to do this if it is the primary thread of the core that is
+ * exiting.
+ */
+static inline void kvmppc_set_host_core(int cpu)
+{
+   int core;
+
+   if (!kvmppc_host_rm_ops_hv || cpu_thread_in_core(cpu))
+   return;
+
+   /*
+* Memory barrier can be omitted here because we do a spin_unlock
+* immediately after this which provides the memory barrier.
+*/
+   core = cpu >> threads_shift;
+   kvmppc_host_rm_ops_hv->rm_core[core].rm_state.in_host = 1;
+}
+
+/*
  * Run a set of guest threads on a physical core.
  * Called with vc->lock held.
  */
@@ -2372,6 +2412,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
}
}
 
+   kvmppc_clear_host_core(pcpu);
+
/* Start all the threads */
active = 0;
for (sub = 0; sub < core_info.n_subcores; ++sub) {
@@ -2468,6 +2510,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
kvmppc_ipi_thread(pcpu + i);
}
 
+   kvmppc_set_host_core(pcpu);
+
spin_unlock(&vc->lock);
 
/* make sure updates to secondary vcpu structs are visible now */
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/6] KVM: PPC: Book3S HV: Add tunable to control H_IPI redirection

Redirecting the wakeup of a VCPU from the H_IPI hypercall to
a core running in the host is usually a good idea, most workloads
seemed to benefit. However, in one heavily interrupt-driven SMT1
workload, some regression was observed. This patch adds a kvm_hv
module parameter called h_ipi_redirect to control this feature.

The default value for this tunable is 1 - that is enable the feature.

Signed-off-by: Suresh Warrier 
---
 arch/powerpc/include/asm/kvm_ppc.h   |  1 +
 arch/powerpc/kvm/book3s_hv.c | 11 +++
 arch/powerpc/kvm/book3s_hv_rm_xics.c |  5 -
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 1b93519..29d1442 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -448,6 +448,7 @@ extern int kvmppc_xics_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xics_connect_vcpu(struct kvm_device *dev,
struct kvm_vcpu *vcpu, u32 cpu);
 extern void kvmppc_xics_ipi_action(void);
+extern int h_ipi_redirect;
 #else
 static inline void kvmppc_alloc_host_rm_ops(void) {};
 static inline void kvmppc_free_host_rm_ops(void) {};
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 80b1eb3..20b2598 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -81,6 +81,17 @@ static int target_smt_mode;
 module_param(target_smt_mode, int, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(target_smt_mode, "Target threads per core (0 = max)");
 
+#ifdef CONFIG_KVM_XICS
+static struct kernel_param_ops module_param_ops = {
+   .set = param_set_int,
+   .get = param_get_int,
+};
+
+module_param_cb(h_ipi_redirect, &module_param_ops, &h_ipi_redirect,
+   S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(h_ipi_redirect, "Redirect H_IPI wakeup to a free host core");
+#endif
+
 static void kvmppc_end_cede(struct kvm_vcpu *vcpu);
 static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index b8d6403..b162f41 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -24,6 +24,9 @@
 
 #define DEBUG_PASSUP
 
+int h_ipi_redirect = 1;
+EXPORT_SYMBOL(h_ipi_redirect);
+
 static void icp_rm_deliver_irq(struct kvmppc_xics *xics, struct kvmppc_icp 
*icp,
u32 new_irq);
 
@@ -134,7 +137,7 @@ static void icp_rm_set_vcpu_irq(struct kvm_vcpu *vcpu,
cpu = vcpu->arch.thread_cpu;
if (cpu < 0 || cpu >= nr_cpu_ids) {
hcore = -1;
-   if (kvmppc_host_rm_ops_hv)
+   if (kvmppc_host_rm_ops_hv && h_ipi_redirect)
hcore = find_available_hostcore(XICS_RM_KICK_VCPU);
if (hcore != -1) {
hcpu = hcore << threads_shift;
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6] KVM: PPC: Book3S HV: Optimize wakeup VCPU from H_IPI

When the VCPU target of an H_IPI hypercall is not running
in the guest, we need to do a kick VCPU (wake the VCPU thread)
to make it runnable. The real-mode version of the H_IPI hypercall
cannot do this because it involves waking a sleeping thread.
Thus the hcall returns H_TOO_HARD which forces a switch back
to host so that the H_IPI call can be completed in virtual mode.
This has been found to cause a slowdown for many workloads like
YCSB MongoDB, small message networking, etc. 

This patch set optimizes the wakeup of the target VCPU by posting
a free core already running in the host to do the wakeup, thus
avoiding the switch to host and back. It requires maintaining a
bitmask of all the available cores in the system to indicate if
they are in the host or running in some guest. It also requires
the H_IPI hypercall to search for a free host core and send it a
new IPI message PPC_MSG_RM_HOST_ACTION after stashing away some
parameters like the pointer to VCPU for the IPI handler. Locks
are avoided by using atomic operations to save core state, to
find and reserve a core in the host, etc.

Note that it is possible for a guest to be destroyed and its
VCPUs freed before the IPI handler gets to run. This case is
handled by ensuring that any pending PPC_MSG_RM_HOST_ACTION
IPIs are completed before proceeding with freeing the VCPUs.

A tunable h_ipi_redirect is also included in the patch set to
disable the feature. 

This patch set depends upon patches to powerpc to increase the
number of supported IPI messages to 8 and which also defines
the PPC_MSG_RM_HOST_ACTION message.

Suresh Warrier (6):
  KVM: PPC: Book3S HV: Host-side RM data structures
  KVM: PPC: Book3S HV: Manage core host state
  KVM: PPC: Book3S HV: kvmppc_host_rm_ops - handle offlining CPUs
  KVM: PPC: Book3S HV: Host side kick VCPU when poked by real-mode KVM
  KVM: PPC: Book3S HV: Send IPI to host core to wake VCPU
  KVM: PPC: Book3S HV: Add tunable to control H_IPI redirection

 arch/powerpc/include/asm/kvm_ppc.h   |  33 +++
 arch/powerpc/kernel/smp.c|  11 +++
 arch/powerpc/kvm/book3s_hv.c | 166 +++
 arch/powerpc/kvm/book3s_hv_builtin.c |   3 +
 arch/powerpc/kvm/book3s_hv_rm_xics.c | 120 -
 arch/powerpc/kvm/powerpc.c   |  10 +++
 6 files changed, 340 insertions(+), 3 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] virtio_net: Stop doing DMA from the stack

On Wed, Oct 28, 2015 at 12:07 AM, Michael S. Tsirkin  wrote:
> How about this instead? Less code, more robust.
>
> Warning: untested.  If you do like this approach, Tested-by would be
> appreciated.

I like it.

Tested-by: Andy Lutomirski 

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC 0/4] dma ops and virtio

On Wed, Oct 28, 2015 at 5:04 PM, Christian Borntraeger
 wrote:
> Am 29.10.2015 um 07:22 schrieb Andy Lutomirski:
>> On Tue, Oct 27, 2015 at 3:48 PM, Christian Borntraeger
>>  wrote:
>>> This is an RFC to check if I am on the right track.  There
>>> are some attempts to unify the dma ops (Christoph) as well
>>> as some attempts to make virtio use the dma API (Andy).
>>>
>>> At todays discussion at the kernel summit, we concluded that
>>> we want to use the same code on all platforms, whereever
>>> possible, so having a dummy dma_op might be the easiest
>>> solution to keep virtio-ccw as similar as possible to
>>> virtio-pci. Andy Lutomirski will rework his patchset to
>>> unconditionally use the dma ops.  We will also need a
>>> compatibility quirk for powerpc to bypass the iommu mappings
>>> on older QEMU version (which translates to all versions as
>>> of today) and per device, e.g.  via device tree.  Ben
>>> Herrenschmidt will look into that.
>>
>> The combination of these patches plus my series don't link for me
>> unless I enable PCI.  Would s390 need to select HAS_DMA from VIRTIO or
>> something similar?
>
> Well, actually this is a potential improvement for series. I could just
> make the noop dma ops default for _all_ devices unless it has a per
> device dma_ops (e.g. s390 pci) and the unconditionally set HAS_DMA.
>>
>> Also, is it possible to use QEMU to emulate an s390x?  Even just:
>>
>> qemu-system-s390x -M s390-ccw-virtio
>
> Yes, we have no interactive bios and if no boot device is there is bios
> will load a disabled wait, which will exit qemu.
>
> Make sure to compile your kernel for z900  (processor type) as qemu does not
> handle all things of newer processors.
> You can then do
> qemu-system-s390x -nographic -m 256 -kernel vmlinux -initrd 
>

Progress!  After getting that sort-of-working, I figured out what was
wrong with my earlier command, and I got that working, too.  Now I
get:

qemu-system-s390x -fsdev
local,id=virtfs1,path=/,security_model=none,readonly -device
virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio
-nodefaults -device sclpconsole,chardev=console -parallel none -net
none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on
-serial chardev:console -mon chardev=console -vga none -display none
-kernel arch/s390/boot/bzImage -append
'init=/home/luto/devel/virtme/virtme/guest/virtme-init
psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
TERM=xterm-256color rootfstype=9p
rootflags=ro,version=9p2000.L,trans=virtio,access=any
raid=noautodetect debug'
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Initializing cgroup subsys cpuacct
Linux version 4.3.0-rc7-00403-ga2b5cd810259-dirty
(l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red
Hat Cross 5.1.1-3) (GCC) ) #328 SMP Thu Oct 29 15:46:05 PDT 2015
setup: Linux is running under KVM in 64-bit mode
setup: Max memory size: 128MB
Zone ranges:
  DMA  [mem 0x-0x7fff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x-0x07ff]
Initmem setup node 0 [mem 0x-0x07ff]
On node 0 totalpages: 32768
  DMA zone: 512 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 32768 pages, LIFO batch:7
PERCPU: Embedded 466 pages/cpu @07605000 s1868032 r8192 d32512 u1908736
pcpu-alloc: s1868032 r8192 d32512 u1908736 alloc=466*4096
pcpu-alloc: [0] 0 [0] 1
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32256
Kernel command line:
init=/home/luto/devel/virtme/virtme/guest/virtme-init
psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
TERM=xterm-256color rootfstype=9p
rootflags=ro,version=9p2000.L,trans=virtio,access=any
raid=noautodetect debug
PID hash table entries: 512 (order: 0, 4096 bytes)
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Memory: 92552K/131072K available (8229K kernel code, 798K rwdata,
3856K rodata, 2384K init, 14382K bss, 38520K reserved, 0K
cma-reserved)
Write protected kernel read-only data: 0x10 - 0xccdfff
SLUB: HWalign=256, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
Running RCU self tests
Hierarchical RCU implementation.
RCU debugfs-based tracing is enabled.
RCU lockdep checking is enabled.
Build-time adjustment of leaf fanout to 64.
RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=2.
RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=2
NR_IRQS:3
clocksource: tod: mask: 0x max_cycles:
0x3b0a9be803b0a9, max_idle_ns: 1805497147909793 ns
console [ttyS1] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:  8
... MAX_LOCK_DEPTH:  48
... MAX_LOCKDEP_KEYS:8191
... CLASSHASH_SIZE:  4096
... MAX_LOCKDEP_ENTRIES: 32768
... MAX_LOCKDEP_CHAINS:  65536
... CHAINHASH_SIZE:  32768
 memory us

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread David Woodhouse

On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote:
> On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
>  wrote:
> > 
> > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> > > 
> > > > On Power, I generally have 2 IOMMU windows for a device, one at
> > > > the
> > > > bottom is remapped, and is generally used for 32-bit devices
> > > > and the
> > > > one at the top us setup as a bypass
> > > 
> > > So in the normal case of decent 64-bit devices (and not in a VM),
> > > they'll *already* be using the bypass region and have full access
> > > to
> > > all of memory, all of the time? And you have no protection
> > > against
> > > driver and firmware bugs causing stray DMA?
> > 
> > Correct, we chose to do that for performance reasons.
> 
> Could this be mitigated using pools?  I don't know if the net code
> would play along easily.

For the receive side, it shouldn't be beyond the wit of man to
introduce an API which allocates *and* DMA-maps a skb. Pass it to
netif_rx() still mapped, with a destructor that just shoves it back in
a pool for re-use.

Doing it for transmit might be a little more complex, but perhaps still
possible.

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-29 Thread Will Deacon

On Thu, Oct 29, 2015 at 06:42:10PM +, Robin Murphy wrote:
> On 29/10/15 18:28, Will Deacon wrote:
> >On Tue, Oct 27, 2015 at 10:00:11AM -0600, Alex Williamson wrote:
> >>On Tue, 2015-10-27 at 15:40 +, Will Deacon wrote:
> >>>On Fri, Oct 16, 2015 at 09:51:22AM -0600, Alex Williamson wrote:
> Would it be possible to add iommu_domain_geometry support to arm-smmu.c?
> In addition to this test to verify that DMA cannot bypass the IOMMU, I'd
> eventually like to pass the aperture information out through the VFIO
> API.  Thanks,
> >>>
> >>>The slight snag here is that we don't know which SMMU in the system the
> >>>domain is attached to at the point when the geometry is queried, so I
> >>>can't give you an upper bound on the aperture. For example, if there is
> >>>an SMMU with a 32-bit input address and another with a 48-bit input
> >>>address.
> >>>
> >>>We could play the same horrible games that we do with the pgsize bitmap,
> >>>and truncate some global aperture everytime we probe an SMMU device, but
> >>>I'd really like to have fewer hacks like that if possible. The root of
> >>>the problem is still that domains are allocated for a bus, rather than
> >>>an IOMMU instance.
> >>
> >>Yes, Intel VT-d has this issue as well.  In theory we can have
> >>heterogeneous IOMMU hardware units (DRHDs) in a system and the upper
> >>bound of the geometry could be diminished if we add a less capable DRHD
> >>into the domain.  I suspect this is more a theoretical problem than a
> >>practical one though as we're typically mixing similar DRHDs and I think
> >>we're still capable of 39-bit addressing in the least capable version
> >>per the spec.
> >>
> >>In any case, I really want to start testing geometry.force_aperture,
> >>even if we're not yet comfortable to expose the IOMMU limits to the
> >>user.  The vfio type1 shouldn't be enabled at all for underlying
> >>hardware that allows DMA bypass.  Thanks,
> >
> >Ok, I'll put it on my list of things to look at under the assumption that
> >the actual aperture limits don't need to be accurate as long as DMA to
> >an arbitrary unmapped address always faults.
> 
> I'm pretty sure we'd only ever set the aperture to the full input address
> range anyway (since we're not a GART-type thing), in which case we should
> only need to worry about unmatched streams that don't hit in a domain at
> all. Doesn't the disable_bypass option already cover that? (FWIW I hacked it
> up for v2 a while back, too[0]).

Well, the "full input address range" is tricky when you have multiple SMMU
instances with different input address sizes. I can do something similar
to the pgsize_bitmap.

I also don't think the disable_bypass option is what we're after -- this
is about devices attached to a VFIO domain that can still mysteriously
bypass the SMMU for some ranges AFAIU (and shouldn't be an issue for ARM).

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-29 Thread Robin Murphy

On 29/10/15 18:28, Will Deacon wrote:

On Tue, Oct 27, 2015 at 10:00:11AM -0600, Alex Williamson wrote:

On Tue, 2015-10-27 at 15:40 +, Will Deacon wrote:

On Fri, Oct 16, 2015 at 09:51:22AM -0600, Alex Williamson wrote:

Would it be possible to add iommu_domain_geometry support to arm-smmu.c?
In addition to this test to verify that DMA cannot bypass the IOMMU, I'd
eventually like to pass the aperture information out through the VFIO
API. Thanks,

The slight snag here is that we don't know which SMMU in the system the
domain is attached to at the point when the geometry is queried, so I
can't give you an upper bound on the aperture. For example, if there is
an SMMU with a 32-bit input address and another with a 48-bit input
address.

We could play the same horrible games that we do with the pgsize bitmap,
and truncate some global aperture everytime we probe an SMMU device, but
I'd really like to have fewer hacks like that if possible. The root of
the problem is still that domains are allocated for a bus, rather than
an IOMMU instance.

Yes, Intel VT-d has this issue as well. In theory we can have
heterogeneous IOMMU hardware units (DRHDs) in a system and the upper
bound of the geometry could be diminished if we add a less capable DRHD
into the domain. I suspect this is more a theoretical problem than a
practical one though as we're typically mixing similar DRHDs and I think
we're still capable of 39-bit addressing in the least capable version
per the spec.

In any case, I really want to start testing geometry.force_aperture,
even if we're not yet comfortable to expose the IOMMU limits to the
user. The vfio type1 shouldn't be enabled at all for underlying
hardware that allows DMA bypass. Thanks,

Ok, I'll put it on my list of things to look at under the assumption that
the actual aperture limits don't need to be accurate as long as DMA to
an arbitrary unmapped address always faults.

I'm pretty sure we'd only ever set the aperture to the full input
address range anyway (since we're not a GART-type thing), in which case
we should only need to worry about unmatched streams that don't hit in a
domain at all. Doesn't the disable_bypass option already cover that?
(FWIW I hacked it up for v2 a while back, too[0]).

Robin.

[0]:http://www.linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=23a251189fa3330b799a837bd8eb1023aa2dcea4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
 wrote:
>
> On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> >
> > > On Power, I generally have 2 IOMMU windows for a device, one at the
> > > bottom is remapped, and is generally used for 32-bit devices and the
> > > one at the top us setup as a bypass
> >
> > So in the normal case of decent 64-bit devices (and not in a VM),
> > they'll *already* be using the bypass region and have full access to
> > all of memory, all of the time? And you have no protection against
> > driver and firmware bugs causing stray DMA?
>
> Correct, we chose to do that for performance reasons.

Could this be mitigated using pools?  I don't know if the net code
would play along easily.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-29 Thread Will Deacon

On Tue, Oct 27, 2015 at 10:00:11AM -0600, Alex Williamson wrote:
> On Tue, 2015-10-27 at 15:40 +, Will Deacon wrote:
> > On Fri, Oct 16, 2015 at 09:51:22AM -0600, Alex Williamson wrote:
> > > Would it be possible to add iommu_domain_geometry support to arm-smmu.c?
> > > In addition to this test to verify that DMA cannot bypass the IOMMU, I'd
> > > eventually like to pass the aperture information out through the VFIO
> > > API.  Thanks,
> > 
> > The slight snag here is that we don't know which SMMU in the system the
> > domain is attached to at the point when the geometry is queried, so I
> > can't give you an upper bound on the aperture. For example, if there is
> > an SMMU with a 32-bit input address and another with a 48-bit input
> > address.
> > 
> > We could play the same horrible games that we do with the pgsize bitmap,
> > and truncate some global aperture everytime we probe an SMMU device, but
> > I'd really like to have fewer hacks like that if possible. The root of
> > the problem is still that domains are allocated for a bus, rather than
> > an IOMMU instance.
> 
> Yes, Intel VT-d has this issue as well.  In theory we can have
> heterogeneous IOMMU hardware units (DRHDs) in a system and the upper
> bound of the geometry could be diminished if we add a less capable DRHD
> into the domain.  I suspect this is more a theoretical problem than a
> practical one though as we're typically mixing similar DRHDs and I think
> we're still capable of 39-bit addressing in the least capable version
> per the spec.
> 
> In any case, I really want to start testing geometry.force_aperture,
> even if we're not yet comfortable to expose the IOMMU limits to the
> user.  The vfio type1 shouldn't be enabled at all for underlying
> hardware that allows DMA bypass.  Thanks,

Ok, I'll put it on my list of things to look at under the assumption that
the actual aperture limits don't need to be accurate as long as DMA to
an arbitrary unmapped address always faults.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-29 Thread Eric Auger

Current vfio_pgsize_bitmap code hides the supported IOMMU page
sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
does not support PAGE_SIZE page, the alignment check on map/unmap
is done with larger page sizes, if any. This can fail although
mapping could be done with pages smaller than PAGE_SIZE.

This patch modifies vfio_pgsize_bitmap implementation so that,
in case the IOMMU supports page sizes smaller than PAGE_SIZE
we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes.
That way the user will be able to map/unmap buffers whose size/
start address is aligned with PAGE_SIZE. Pinning code uses that
granularity while iommu driver can use the sub-PAGE_SIZE size
to map the buffer.

Signed-off-by: Eric Auger 
Signed-off-by: Alex Williamson 
Acked-by: Will Deacon 

---

This was tested on AMD Seattle with 64kB page host. ARM MMU 401
currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
the map/unmap check is done against 2MB. Some alignment check fail
so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
size.

v1 -> v2:
- correct PAGE_HOST type in comment and commit msg
- Add Will's R-b

RFC -> PATCH v1:
- move all modifications in vfio_pgsize_bitmap following Alex'
  suggestion to expose a fake PAGE_SIZE support
- restore WARN_ON's
---
 drivers/vfio/vfio_iommu_type1.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 57d8c37..59d47cb 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -403,13 +403,26 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
struct vfio_dma *dma)
 static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 {
struct vfio_domain *domain;
-   unsigned long bitmap = PAGE_MASK;
+   unsigned long bitmap = ULONG_MAX;
 
mutex_lock(&iommu->lock);
list_for_each_entry(domain, &iommu->domain_list, next)
bitmap &= domain->domain->ops->pgsize_bitmap;
mutex_unlock(&iommu->lock);
 
+   /*
+* In case the IOMMU supports page sizes smaller than PAGE_SIZE
+* we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes.
+* That way the user will be able to map/unmap buffers whose size/
+* start address is aligned with PAGE_SIZE. Pinning code uses that
+* granularity while iommu driver can use the sub-PAGE_SIZE size
+* to map the buffer.
+*/
+   if (bitmap & ~PAGE_MASK) {
+   bitmap &= PAGE_MASK;
+   bitmap |= PAGE_SIZE;
+   }
+
return bitmap;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-29 Thread Will Deacon

On Thu, Oct 29, 2015 at 01:59:45PM +, Eric Auger wrote:
> Current vfio_pgsize_bitmap code hides the supported IOMMU page
> sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
> does not support PAGE_SIZE page, the alignment check on map/unmap
> is done with larger page sizes, if any. This can fail although
> mapping could be done with pages smaller than PAGE_SIZE.
> 
> This patch modifies vfio_pgsize_bitmap implementation so that,
> in case the IOMMU supports page sizes smaller than PAGE_HOST
> we pretend PAGE_HOST is supported and hide sub-PAGE_HOST sizes.
> That way the user will be able to map/unmap buffers whose size/
> start address is aligned with PAGE_HOST. Pinning code uses that
> granularity while iommu driver can use the sub-PAGE_HOST size
> to map the buffer.
> 
> Signed-off-by: Eric Auger 
> Signed-off-by: Alex Williamson 
> 
> ---
> 
> This was tested on AMD Seattle with 64kB page host. ARM MMU 401
> currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
> the map/unmap check is done against 2MB. Some alignment check fail
> so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
> size.
> 
> RFC -> PATCH v1:
> - move all modifications in vfio_pgsize_bitmap following Alex'
>   suggestion to expose a fake PAGE_HOST support
> - restore WARN_ON's
> ---
>  drivers/vfio/vfio_iommu_type1.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 57d8c37..cee504a 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -403,13 +403,26 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> struct vfio_dma *dma)
>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  {
>   struct vfio_domain *domain;
> - unsigned long bitmap = PAGE_MASK;
> + unsigned long bitmap = ULONG_MAX;
>  
>   mutex_lock(&iommu->lock);
>   list_for_each_entry(domain, &iommu->domain_list, next)
>   bitmap &= domain->domain->ops->pgsize_bitmap;
>   mutex_unlock(&iommu->lock);
>  
> + /*
> +  * In case the IOMMU supports page sizes smaller than PAGE_HOST
> +  * we pretend PAGE_HOST is supported and hide sub-PAGE_HOST sizes.
> +  * That way the user will be able to map/unmap buffers whose size/
> +  * start address is aligned with PAGE_HOST. Pinning code uses that
> +  * granularity while iommu driver can use the sub-PAGE_HOST size
> +  * to map the buffer.
> +  */
> + if (bitmap & ~PAGE_MASK) {
> + bitmap &= PAGE_MASK;
> + bitmap |= PAGE_SIZE;
> + }
> +

s/PAGE_HOST/PAGE_SIZE/g (in the cover-letter too) and then I think this
looks good:

  Acked-by: Will Deacon 

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vfio/platform: store mapped memory in region, instead of an on-stack copy

2015-10-29 Thread James Morse

vfio_platform_{read,write}_mmio() call ioremap_nocache() to map
a region of io memory, which they store in struct vfio_platform_region to
be eventually re-used, or unmapped by vfio_platform_regions_cleanup().

These functions receive a copy of their struct vfio_platform_region
argument on the stack - so these mapped areas are always allocated, and
always leaked.

Pass this argument as a pointer instead.

Fixes: 6e3f26456009 "vfio/platform: read and write support for the device fd"
Signed-off-by: James Morse 
---
 drivers/vfio/platform/vfio_platform_common.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index f3b6299..ccf5da5 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -308,17 +308,17 @@ static long vfio_platform_ioctl(void *device_data,
return -ENOTTY;
 }
 
-static ssize_t vfio_platform_read_mmio(struct vfio_platform_region reg,
+static ssize_t vfio_platform_read_mmio(struct vfio_platform_region *reg,
   char __user *buf, size_t count,
   loff_t off)
 {
unsigned int done = 0;
 
-   if (!reg.ioaddr) {
-   reg.ioaddr =
-   ioremap_nocache(reg.addr, reg.size);
+   if (!reg->ioaddr) {
+   reg->ioaddr =
+   ioremap_nocache(reg->addr, reg->size);
 
-   if (!reg.ioaddr)
+   if (!reg->ioaddr)
return -ENOMEM;
}
 
@@ -328,7 +328,7 @@ static ssize_t vfio_platform_read_mmio(struct 
vfio_platform_region reg,
if (count >= 4 && !(off % 4)) {
u32 val;
 
-   val = ioread32(reg.ioaddr + off);
+   val = ioread32(reg->ioaddr + off);
if (copy_to_user(buf, &val, 4))
goto err;
 
@@ -336,7 +336,7 @@ static ssize_t vfio_platform_read_mmio(struct 
vfio_platform_region reg,
} else if (count >= 2 && !(off % 2)) {
u16 val;
 
-   val = ioread16(reg.ioaddr + off);
+   val = ioread16(reg->ioaddr + off);
if (copy_to_user(buf, &val, 2))
goto err;
 
@@ -344,7 +344,7 @@ static ssize_t vfio_platform_read_mmio(struct 
vfio_platform_region reg,
} else {
u8 val;
 
-   val = ioread8(reg.ioaddr + off);
+   val = ioread8(reg->ioaddr + off);
if (copy_to_user(buf, &val, 1))
goto err;
 
@@ -377,7 +377,7 @@ static ssize_t vfio_platform_read(void *device_data, char 
__user *buf,
return -EINVAL;
 
if (vdev->regions[index].type & VFIO_PLATFORM_REGION_TYPE_MMIO)
-   return vfio_platform_read_mmio(vdev->regions[index],
+   return vfio_platform_read_mmio(&vdev->regions[index],
buf, count, off);
else if (vdev->regions[index].type & VFIO_PLATFORM_REGION_TYPE_PIO)
return -EINVAL; /* not implemented */
@@ -385,17 +385,17 @@ static ssize_t vfio_platform_read(void *device_data, char 
__user *buf,
return -EINVAL;
 }
 
-static ssize_t vfio_platform_write_mmio(struct vfio_platform_region reg,
+static ssize_t vfio_platform_write_mmio(struct vfio_platform_region *reg,
const char __user *buf, size_t count,
loff_t off)
 {
unsigned int done = 0;
 
-   if (!reg.ioaddr) {
-   reg.ioaddr =
-   ioremap_nocache(reg.addr, reg.size);
+   if (!reg->ioaddr) {
+   reg->ioaddr =
+   ioremap_nocache(reg->addr, reg->size);
 
-   if (!reg.ioaddr)
+   if (!reg->ioaddr)
return -ENOMEM;
}
 
@@ -407,7 +407,7 @@ static ssize_t vfio_platform_write_mmio(struct 
vfio_platform_region reg,
 
if (copy_from_user(&val, buf, 4))
goto err;
-   iowrite32(val, reg.ioaddr + off);
+   iowrite32(val, reg->ioaddr + off);
 
filled = 4;
} else if (count >= 2 && !(off % 2)) {
@@ -415,7 +415,7 @@ static ssize_t vfio_platform_write_mmio(struct 
vfio_platform_region reg,
 
if (copy_from_user(&val, buf, 2))
goto err;
-   iowrite16(val, reg.ioaddr + off);
+   iowrite16(val, reg->ioaddr + off);
 
filled = 2;
} else {
@@ -423,7 +423,7 @@ static ssize_t vfio_platform_write_mmio(struc

Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-29 Thread Alexander Duyck


On 10/29/2015 01:33 AM, Lan Tianyu wrote:

On 2015年10月29日 14:58, Alexander Duyck wrote:

Your code was having to do a bunch of shuffling in order to get things
set up so that you could bring the interface back up.  I would argue
that it may actually be faster at least on the bring-up to just drop the
old rings and start over since it greatly reduced the complexity and the
amount of device related data that has to be moved.

If give up the old ring after migration and keep DMA running before
stopping VCPU, it seems we don't need to track Tx/Rx descriptor ring and
just make sure that all Rx buffers delivered to stack has been migrated.

1) Dummy write Rx buffer before checking Rx descriptor to ensure packet
migrated first.


Don't dummy write the Rx descriptor.  You should only really need to 
dummy write the Rx buffer and you would do so after checking the 
descriptor, not before.  Otherwise you risk corrupting the Rx buffer 
because it is possible for you to read the Rx buffer, DMA occurs, and 
then you write back the Rx buffer and now you have corrupted the memory.



2) Make a copy of Rx descriptor and then use the copied data to check
buffer status. Not use the original descriptor because it won't be
migrated and migration may happen between two access of the Rx descriptor.


Do not just blindly copy the Rx descriptor ring.  That is a recipe for 
disaster.  The problem is DMA has to happen in a very specific order for 
things to function correctly.  The Rx buffer has to be written and then 
the Rx descriptor.  The problem is you will end up getting a read-ahead 
on the Rx descriptor ring regardless of which order you dirty things in.


The descriptor is only 16 bytes, you can fit 256 of them in a single 
page.  There is a good chance you probably wouldn't be able to migrate 
if you were under heavy network stress, however you could still have 
several buffers written in the time it takes for you to halt the VM and 
migrate the remaining pages.  Those buffers wouldn't be marked as dirty 
but odds are the page the descriptors are in would be.  As such you will 
end up with the descriptors but not the buffers.


The only way you could possibly migrate the descriptors rings cleanly 
would be to have enough knowledge about the layout of things to force 
the descriptor rings to be migrated first followed by all of the 
currently mapped Rx buffers.  In addition you would need to have some 
means of tracking all of the Rx buffers such as an emulated IOMMU as you 
would need to migrate all of them, not just part.  By doing it this way 
you would get the Rx descriptor rings in the earliest state possible and 
would be essentially emulating the Rx buffer writes occurring before the 
Rx descriptor writes.  You would likely have several Rx buffer writes 
that would be discarded in the process as there would be no descriptor 
for them but at least the state of the system would be consistent.


- Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-29 Thread David Woodhouse

On Thu, 2015-10-29 at 11:01 +0200, Michael S. Tsirkin wrote:
> 
> Example: you have a mix of assigned devices and virtio devices. You
> don't trust your assigned device vendor not to corrupt your memory so
> you want to limit the damage your assigned device can do to your
> guest,
> so you use an IOMMU for that.  Thus existing iommu=pt within guest is
> out.
> 
> But you trust your hypervisor (you have no choice anyway),
> and you don't want the overhead of tweaking IOMMU
> on data path for virtio. Thus iommu=on is out too.

That's not at all special for virtio or guest VMs. Even with real
hardware, we might want performance from *some* devices, and security
from others. See the DMA_ATTR_IOMMU_BYPASS which is currently being
discussed.

But of course the easy answer in *your* case it just to ask the
hypervisor not to put the virtio devices behind an IOMMU at all. Which
we were planning to remain the default behaviour.

In all cases, the DMA API shall do the right thing.

-- 
dwmw2

smime.p7s
Description: S/MIME cryptographic signature

Re: [GIT PULL 3/3] KVM: s390: use simple switch statement as multiplexer

2015-10-29 Thread Alexander Graf


> Am 29.10.2015 um 16:08 schrieb Christian Borntraeger :
> 
> We currently do some magic shifting (by exploiting that exit codes
> are always a multiple of 4) and a table lookup to jump into the
> exit handlers. This causes some calculations and checks, just to
> do an potentially expensive function call.
> 
> Changing that to a switch statement gives the compiler the chance
> to inline and dynamically decide between jump tables or inline
> compare and branches. In addition it makes the code more readable.
> 
> bloat-o-meter gives me a small reduction in code size:
> 
> add/remove: 0/7 grow/shrink: 1/1 up/down: 986/-1334 (-348)
> function old new   delta
> kvm_handle_sie_intercept  721058+986
> handle_prog  704 696  -8
> handle_noop   54   - -54
> handle_partial_execution  60   - -60
> intercept_funcs  120   --120
> handle_instruction   198   --198
> handle_validity  210   --210
> handle_stop  316   --316
> handle_external_interrupt368   --368
> 
> Right now my gcc does conditional branches instead of jump tables.
> The inlining seems to give us enough cycles as some micro-benchmarking
> shows minimal improvements, but still in noise.

Awesome. I ended up with the same conclusions on switch vs table lookups in the 
ppc code back in the day.

> 
> Signed-off-by: Christian Borntraeger 
> Reviewed-by: Cornelia Huck 
> ---
> arch/s390/kvm/intercept.c | 42 +-
> 1 file changed, 21 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
> index 7365e8a..b4a5aa1 100644
> --- a/arch/s390/kvm/intercept.c
> +++ b/arch/s390/kvm/intercept.c
> @@ -336,28 +336,28 @@ static int handle_partial_execution(struct kvm_vcpu 
> *vcpu)
>return -EOPNOTSUPP;
> }
> 
> -static const intercept_handler_t intercept_funcs[] = {
> -[0x00 >> 2] = handle_noop,
> -[0x04 >> 2] = handle_instruction,
> -[0x08 >> 2] = handle_prog,
> -[0x10 >> 2] = handle_noop,
> -[0x14 >> 2] = handle_external_interrupt,
> -[0x18 >> 2] = handle_noop,
> -[0x1C >> 2] = kvm_s390_handle_wait,
> -[0x20 >> 2] = handle_validity,
> -[0x28 >> 2] = handle_stop,
> -[0x38 >> 2] = handle_partial_execution,
> -};
> -
> int kvm_handle_sie_intercept(struct kvm_vcpu *vcpu)
> {
> -intercept_handler_t func;
> -u8 code = vcpu->arch.sie_block->icptcode;
> -
> -if (code & 3 || (code >> 2) >= ARRAY_SIZE(intercept_funcs))
> +switch (vcpu->arch.sie_block->icptcode) {
> +case 0x00:
> +case 0x10:
> +case 0x18:

... if you could convert these magic numbers to something more telling however, 
I think readability would improve even more! That can easily be a follow up 
patch though.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL 3/3] KVM: s390: use simple switch statement as multiplexer

We currently do some magic shifting (by exploiting that exit codes
are always a multiple of 4) and a table lookup to jump into the
exit handlers. This causes some calculations and checks, just to
do an potentially expensive function call.

Changing that to a switch statement gives the compiler the chance
to inline and dynamically decide between jump tables or inline
compare and branches. In addition it makes the code more readable.

bloat-o-meter gives me a small reduction in code size:

add/remove: 0/7 grow/shrink: 1/1 up/down: 986/-1334 (-348)
function old new   delta
kvm_handle_sie_intercept  721058+986
handle_prog  704 696  -8
handle_noop   54   - -54
handle_partial_execution  60   - -60
intercept_funcs  120   --120
handle_instruction   198   --198
handle_validity  210   --210
handle_stop  316   --316
handle_external_interrupt368   --368

Right now my gcc does conditional branches instead of jump tables.
The inlining seems to give us enough cycles as some micro-benchmarking
shows minimal improvements, but still in noise.

Signed-off-by: Christian Borntraeger 
Reviewed-by: Cornelia Huck 
---
 arch/s390/kvm/intercept.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c
index 7365e8a..b4a5aa1 100644
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -336,28 +336,28 @@ static int handle_partial_execution(struct kvm_vcpu *vcpu)
return -EOPNOTSUPP;
 }
 
-static const intercept_handler_t intercept_funcs[] = {
-   [0x00 >> 2] = handle_noop,
-   [0x04 >> 2] = handle_instruction,
-   [0x08 >> 2] = handle_prog,
-   [0x10 >> 2] = handle_noop,
-   [0x14 >> 2] = handle_external_interrupt,
-   [0x18 >> 2] = handle_noop,
-   [0x1C >> 2] = kvm_s390_handle_wait,
-   [0x20 >> 2] = handle_validity,
-   [0x28 >> 2] = handle_stop,
-   [0x38 >> 2] = handle_partial_execution,
-};
-
 int kvm_handle_sie_intercept(struct kvm_vcpu *vcpu)
 {
-   intercept_handler_t func;
-   u8 code = vcpu->arch.sie_block->icptcode;
-
-   if (code & 3 || (code >> 2) >= ARRAY_SIZE(intercept_funcs))
+   switch (vcpu->arch.sie_block->icptcode) {
+   case 0x00:
+   case 0x10:
+   case 0x18:
+   return handle_noop(vcpu);
+   case 0x04:
+   return handle_instruction(vcpu);
+   case 0x08:
+   return handle_prog(vcpu);
+   case 0x14:
+   return handle_external_interrupt(vcpu);
+   case 0x1c:
+   return kvm_s390_handle_wait(vcpu);
+   case 0x20:
+   return handle_validity(vcpu);
+   case 0x28:
+   return handle_stop(vcpu);
+   case 0x38:
+   return handle_partial_execution(vcpu);
+   default:
return -EOPNOTSUPP;
-   func = intercept_funcs[code >> 2];
-   if (func)
-   return func(vcpu);
-   return -EOPNOTSUPP;
+   }
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL 2/3] KVM: s390: drop useless newline in debugging data

the s390 debug feature does not need newlines. In fact it will
result in empty lines. Get rid of 4 leftovers.

Signed-off-by: Christian Borntraeger 
Acked-by: Cornelia Huck 
---
 arch/s390/kvm/kvm-s390.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 3559617..07a6aa8 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -514,7 +514,7 @@ static int kvm_s390_set_tod_high(struct kvm *kvm, struct 
kvm_device_attr *attr)
 
if (gtod_high != 0)
return -EINVAL;
-   VM_EVENT(kvm, 3, "SET: TOD extension: 0x%x\n", gtod_high);
+   VM_EVENT(kvm, 3, "SET: TOD extension: 0x%x", gtod_high);
 
return 0;
 }
@@ -527,7 +527,7 @@ static int kvm_s390_set_tod_low(struct kvm *kvm, struct 
kvm_device_attr *attr)
return -EFAULT;
 
kvm_s390_set_tod_clock(kvm, gtod);
-   VM_EVENT(kvm, 3, "SET: TOD base: 0x%llx\n", gtod);
+   VM_EVENT(kvm, 3, "SET: TOD base: 0x%llx", gtod);
return 0;
 }
 
@@ -559,7 +559,7 @@ static int kvm_s390_get_tod_high(struct kvm *kvm, struct 
kvm_device_attr *attr)
if (copy_to_user((void __user *)attr->addr, >od_high,
 sizeof(gtod_high)))
return -EFAULT;
-   VM_EVENT(kvm, 3, "QUERY: TOD extension: 0x%x\n", gtod_high);
+   VM_EVENT(kvm, 3, "QUERY: TOD extension: 0x%x", gtod_high);
 
return 0;
 }
@@ -571,7 +571,7 @@ static int kvm_s390_get_tod_low(struct kvm *kvm, struct 
kvm_device_attr *attr)
gtod = kvm_s390_get_tod_clock_fast(kvm);
if (copy_to_user((void __user *)attr->addr, >od, sizeof(gtod)))
return -EFAULT;
-   VM_EVENT(kvm, 3, "QUERY: TOD base: 0x%llx\n", gtod);
+   VM_EVENT(kvm, 3, "QUERY: TOD base: 0x%llx", gtod);
 
return 0;
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL 1/3] KVM: s390: SCA must not cross page boundaries

From: David Hildenbrand 

We seemed to have missed a few corner cases in commit f6c137ff00a4
("KVM: s390: randomize sca address").

The SCA has a maximum size of 2112 bytes. By setting the sca_offset to
some unlucky numbers, we exceed the page.

0x7c0 (1984) -> Fits exactly
0x7d0 (2000) -> 16 bytes out
0x7e0 (2016) -> 32 bytes out
0x7f0 (2032) -> 48 bytes out

One VCPU entry is 32 bytes long.

For the last two cases, we actually write data to the other page.
1. The address of the VCPU.
2. Injection/delivery/clearing of SIGP externall calls via SIGP IF.

Especially the 2. happens regularly. So this could produce two problems:
1. The guest losing/getting external calls.
2. Random memory overwrites in the host.

So this problem happens on every 127 + 128 created VM with 64 VCPUs.

Cc: sta...@vger.kernel.org # v3.15+
Acked-by: Christian Borntraeger 
Signed-off-by: David Hildenbrand 
Signed-off-by: Christian Borntraeger 
---
 arch/s390/kvm/kvm-s390.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 618c854..3559617 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1098,7 +1098,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (!kvm->arch.sca)
goto out_err;
spin_lock(&kvm_lock);
-   sca_offset = (sca_offset + 16) & 0x7f0;
+   sca_offset += 16;
+   if (sca_offset + sizeof(struct sca_block) > PAGE_SIZE)
+   sca_offset = 0;
kvm->arch.sca = (struct sca_block *) ((char *) kvm->arch.sca + 
sca_offset);
spin_unlock(&kvm_lock);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL 0/3] KVM: s390: Bugfix and cleanups for kvm/next (4.4)