Re: [PATCH trivial] KVM: s390: Spelling s/intance/instance/

2015-03-10 Thread Christian Borntraeger
Am 09.03.2015 um 21:27 schrieb Geert Uytterhoeven:
> Signed-off-by: Geert Uytterhoeven 
> ---
>  arch/s390/kvm/kvm-s390.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> index fda3f3146eb636d3..83f32a147d72c0f3 100644
> --- a/arch/s390/kvm/kvm-s390.h
> +++ b/arch/s390/kvm/kvm-s390.h
> @@ -125,7 +125,7 @@ static inline void kvm_s390_set_psw_cc(struct kvm_vcpu 
> *vcpu, unsigned long cc)
>   vcpu->arch.sie_block->gpsw.mask |= cc << 44;
>  }
> 
> -/* test availability of facility in a kvm intance */
> +/* test availability of facility in a kvm instance */
>  static inline int test_kvm_facility(struct kvm *kvm, unsigned long nr)
>  {
>   return __test_facility(nr, kvm->arch.model.fac->mask) &&
> 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-10 Thread Paolo Bonzini


On 09/03/2015 20:50, Avi Kivity wrote:
>>> Is the issue emulating a higher MAXPHYADDR on the guest than is
>>> available on the host? I don't think there's any need to support that.
>> No, indeed.  The only problem is that the failure mode is quite horrible
>> (you get a triple fault, possibly while the guest is running).
> 
> Can't qemu simply check for it?

Yes.  But right now it doesn't even try to do something sensible with
MAXPHYADDR. :/

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2 CPU Conformance Issue in KVM/x86

2015-03-10 Thread Paolo Bonzini


On 09/03/2015 20:49, Avi Kivity wrote:
>>>
>> Yes, and it checked that MAXPHYADDR != 52 before.  If you want to set
>> only one bit, making that bit 51 makes sense anyway for simplicity, so
>> it is still 99.9% academic.  Once processors appear with MAXPHYADDR =
>> 52, the remaining 0.1% will become more relevant.
>>
>> The current limit is IIRC 46 or 48 (on Haswell Xeons).
> 
> It will be interesting to have processors with 52 bits of physical
> address and 48 bits of virtual address. HIGHMEM for x86_64?  Or 5-level
> page tables?

I wonder why Intel chose exactly 52...  HIGHMEM seems more likely than
5-level page tables.  Certainly it wouldn't need hacks like Ingo's 4G-4G.

> 50 bits == 1 PiB.  That's quite an amount of RAM.

Not that 64 TiB is not "quite an amount of RAM". :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: svm: make wbinvd faster

2015-03-10 Thread Radim Krčmář
2015-03-09 20:28-0300, Marcelo Tosatti:
> On Fri, Feb 27, 2015 at 06:19:18PM -0600, Joel Schopp wrote:
> > From: David Kaplan 
> > No need to re-decode WBINVD since we know what it is from the intercept.
> > 
> > Signed-off-by: David Kaplan 
> > [extracted from larger unlrelated patch, forward ported, tested]
> > Signed-off-by: Joel Schopp 
> 
> Can't you disable the intercept if need_emulate_wbinvd(vcpu) == false? 

I don't think we want to:  it should be faster to intercept and ignore
than to invalidate all caches.  The exit doesn't affect other physical
cores and costs just about 10(?) L3 cache misses.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix to update memslots properly

2015-03-10 Thread Paolo Bonzini

>>> This suggests another fix.  We can change the insertion to use a ">="
>>> comparison, as in your first patch.  Alone it is not correct, but we
>>> only need to take some care and avoid breaking the case of deleting a
>>> memslot.
>>>
>>> It's enough to wrap the second loop (that you patched) with
>>> "if (new->npages)".  In the new->npages == 0 case the first loop has
>>> already set i to the right value, and moving i back would be wrong:
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index f5283438ee05..050974c051b5 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -687,11 +687,23 @@ static void update_memslots(struct kvm_memslots
>>> *slots,
>>>   slots->id_to_index[mslots[i].id] = i;
>>>   i++;
>>>   }
>>> -while (i > 0 &&
>>> -   new->base_gfn > mslots[i - 1].base_gfn) {
>>> -mslots[i] = mslots[i - 1];
>>> -slots->id_to_index[mslots[i].id] = i;
>>> -i--;
>>> +
>>> +/*
>>> + * The ">=" is needed when creating a slot with base_gfn == 0,
>>> + * so that it moves before all those with base_gfn == npages == 0.
>>> + *
>>> + * On the other hand, if new->npages is zero, the above loop has
>>> + * already left i pointing to the beginning of the empty part of
>>> + * mslots, and the ">=" would move the hole backwards in this
>>> + * case---which is wrong.  So skip the loop when deleting a slot.
>>> + */
>>> +if (new->npages) {
>>> +while (i > 0 &&
>>> +   new->base_gfn >= mslots[i - 1].base_gfn) {
>>> +mslots[i] = mslots[i - 1];
>>> +slots->id_to_index[mslots[i].id] = i;
>>> +i--;
>>> +}
>>>   }
>>>
>>>   mslots[i] = *new;
>>>
>>> Paolo
>>
>> Paolo,
>>
>> Can you include a proper changelog for this patch?
>>
> 
> But this is already applied long time ago...

Yes, this is commit efbeec7098eee2b3d2359d0cc24bbba0436e7f21.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov
On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov  wrote:
> On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das  wrote:
>> Andrey Korolyov  writes:
>>
>>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov  wrote:
 Hello,

 recently I`ve got a couple of shiny new Intel 2620v2s for future
 replacement of the E5-2620v1, but I experienced relatively many events
 with emulation errors, all traces looks simular to the one below. I am
 running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
 can switch to some other versions if necessary. Most of crashes
 happened during reboot cycle or at the end of ACPI-based shutdown
 action, if this can help. I have zero clues of what can introduce such
 a mess inside same processor family using identical software, as
 2620v1 has no simular problem ever. Please let me know if there can be
 some side measures for making entire story more clear.

 Thanks!

 KVM internal error. Suberror: 2
 extra data[0]: 80d1
 extra data[1]: 8b0d
 EAX=0003 EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6cd4
 EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6e98 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb 
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e
>>>
>>>
>>> It turns out that those errors are introduced by APICv, which gets
>>> enabled due to different feature set. If anyone is interested in
>>> reproducing/fixing this exactly on 3.10, it takes about one hundred of
>>> migrations/power state changes for an issue to appear, guest OS can be
>>> Linux or Win.
>>
>> Are you able to reproduce this on a more recent upstream kernel as well ?
>>
>> Bandan
>
> I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
> follow up with any reproduceable results.

Heh.. issue is not triggered on 2603v2 at all, at least I am not able
to hit this. The only difference with 2620v2 except lower frequency is
an Intel Dynamic Acceleration feature. I`d appreciate any testing with
higher CPU models with same or richer feature set. The testing itself
can be done on both generic 3.10 or RH7 kernels, as both of them are
experiencing this issue. I conducted all tests with disabled cstates
so I advise to do the same for a first reproduction step.

Thanks!

model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
stepping: 4
microcode   : 0x416
cpu MHz : 2100.039
cache size  : 15360 KB
siblings: 12
apicid  : 43
initial apicid  : 43
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq

2015-03-10 Thread Radim Krčmář
2015-03-09 23:05-0600, James Sullivan:
> Hi folks,
> 
> This is a small patch that implements logic to handle the RH bit
> being set in the msi message address register. Currently the DM bit is
> the only thing used to decide irq->dest_mode (logical when DM set,
> physical when unset). Documentation indicates that the DM bit will be
> ignored when the RH bit is unset, and physical destination mode is used
> in this case.

(I found that https://software.intel.com/en-us/forums/topic/23 agrees
 with this definition of "ignored"; and it explains the weird design at
 the bottom.)

> Fixed this to set irq->dest_mode to logical just when both bits are set,
> and physical otherwise.
> 
> Signed-off-by: James Sullivan 
> ---
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> @@ -97,18 +97,26 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
> kvm_lapic *src,
>  static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
>  struct kvm_lapic_irq *irq)
>  {
> + bool phys;
>   trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
> 
>   irq->dest_id = (e->msi.address_lo &
>   MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
>   irq->vector = (e->msi.data &
>   MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;
> - irq->dest_mode = (1 << MSI_ADDR_DEST_MODE_SHIFT) & e->msi.address_lo;
> + /*
> +  * Set dest_mode to logical just in case both the RH and DM
> +  * bits are set, otherwise default to physical.
> +  */
> + phys = ((e->msi.address_lo & (MSI_ADDR_REDIRECTION_LOWPRI |
> + MSI_ADDR_DEST_MODE_LOGICAL)) !=
> + (MSI_ADDR_REDIRECTION_LOWPRI |
> + MSI_ADDR_DEST_MODE_LOGICAL));
> + irq->dest_mode = phys ? 0 : (MSI_ADDR_DEST_MODE_LOGICAL);

(Should be APIC_DEST_LOGICAL.  All works because it is a boolean and we
 only checked for APIC_DEST_PHYSICAL, which is 0.)

>   irq->trig_mode = (1 << MSI_DATA_TRIGGER_SHIFT) & e->msi.data;
>   irq->delivery_mode = e->msi.data & 0x700;
>   irq->level = 1;
>   irq->shorthand = 0;
> - /* TODO Deal with RH bit of MSI message address */

RH bit still isn't deal with -- we do not use lowest-priority-like
delivery in logical destination mode ...

How does DM=1/RH=1 work on real hardware?
(There seem to be interesting corner cases with irq->delivery_mode like
 APIC_DM_NMI.)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert
* Andrey Korolyov (and...@xdel.ru) wrote:
> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov  wrote:
> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das  wrote:
> >> Andrey Korolyov  writes:
> >>
> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov  wrote:
>  Hello,
> 
>  recently I`ve got a couple of shiny new Intel 2620v2s for future
>  replacement of the E5-2620v1, but I experienced relatively many events
>  with emulation errors, all traces looks simular to the one below. I am
>  running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
>  can switch to some other versions if necessary. Most of crashes
>  happened during reboot cycle or at the end of ACPI-based shutdown
>  action, if this can help. I have zero clues of what can introduce such
>  a mess inside same processor family using identical software, as
>  2620v1 has no simular problem ever. Please let me know if there can be
>  some side measures for making entire story more clear.
> 
>  Thanks!
> 
>  KVM internal error. Suberror: 2
>  extra data[0]: 80d1
>  extra data[1]: 8b0d
>  EAX=0003 EBX= ECX= EDX=
>  ESI= EDI= EBP= ESP=6cd4
>  EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
>  ES =   9300
>  CS =f000 000f  9b00
>  SS =   9300
>  DS =   9300
>  FS =   9300
>  GS =   9300
>  LDT=   8200
>  TR =   8b00
>  GDT= 000f6e98 0037
>  IDT=  03ff
>  CR0=0010 CR2= CR3= CR4=
>  DR0= DR1= DR2=
>  DR3=
>  DR6=0ff0 DR7=0400
>  EFER=
>  Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb 
>  10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
>  b8 00 e0 00 00 8e
> >>>
> >>>
> >>> It turns out that those errors are introduced by APICv, which gets
> >>> enabled due to different feature set. If anyone is interested in
> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of
> >>> migrations/power state changes for an issue to appear, guest OS can be
> >>> Linux or Win.
> >>
> >> Are you able to reproduce this on a more recent upstream kernel as well ?
> >>
> >> Bandan
> >
> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
> > follow up with any reproduceable results.
> 
> Heh.. issue is not triggered on 2603v2 at all, at least I am not able
> to hit this. The only difference with 2620v2 except lower frequency is
> an Intel Dynamic Acceleration feature. I`d appreciate any testing with
> higher CPU models with same or richer feature set. The testing itself
> can be done on both generic 3.10 or RH7 kernels, as both of them are
> experiencing this issue. I conducted all tests with disabled cstates
> so I advise to do the same for a first reproduction step.
> 
> Thanks!
> 
> model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> stepping: 4
> microcode   : 0x416
> cpu MHz : 2100.039
> cache size  : 15360 KB
> siblings: 12
> apicid  : 43
> initial apicid  : 43
> fpu : yes
> fpu_exception   : yes
> cpuid level : 13
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
> flexpriority ept vpid fsgsbase smep erms

I'm seeing something similar; it's very intermittent and generally
happening right at boot of the guest;   I'm running this on qemu
head+my postcopy world (but it's happening right at boot before postcopy
gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
but hey maybe I'm seeing a different bug.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov
On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
 wrote:
> * Andrey Korolyov (and...@xdel.ru) wrote:
>> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov  wrote:
>> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das  wrote:
>> >> Andrey Korolyov  writes:
>> >>
>> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov  wrote:
>>  Hello,
>> 
>>  recently I`ve got a couple of shiny new Intel 2620v2s for future
>>  replacement of the E5-2620v1, but I experienced relatively many events
>>  with emulation errors, all traces looks simular to the one below. I am
>>  running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
>>  can switch to some other versions if necessary. Most of crashes
>>  happened during reboot cycle or at the end of ACPI-based shutdown
>>  action, if this can help. I have zero clues of what can introduce such
>>  a mess inside same processor family using identical software, as
>>  2620v1 has no simular problem ever. Please let me know if there can be
>>  some side measures for making entire story more clear.
>> 
>>  Thanks!
>> 
>>  KVM internal error. Suberror: 2
>>  extra data[0]: 80d1
>>  extra data[1]: 8b0d
>>  EAX=0003 EBX= ECX= EDX=
>>  ESI= EDI= EBP= ESP=6cd4
>>  EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
>>  ES =   9300
>>  CS =f000 000f  9b00
>>  SS =   9300
>>  DS =   9300
>>  FS =   9300
>>  GS =   9300
>>  LDT=   8200
>>  TR =   8b00
>>  GDT= 000f6e98 0037
>>  IDT=  03ff
>>  CR0=0010 CR2= CR3= CR4=
>>  DR0= DR1= DR2=
>>  DR3=
>>  DR6=0ff0 DR7=0400
>>  EFER=
>>  Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb 
>>  10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
>>  b8 00 e0 00 00 8e
>> >>>
>> >>>
>> >>> It turns out that those errors are introduced by APICv, which gets
>> >>> enabled due to different feature set. If anyone is interested in
>> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of
>> >>> migrations/power state changes for an issue to appear, guest OS can be
>> >>> Linux or Win.
>> >>
>> >> Are you able to reproduce this on a more recent upstream kernel as well ?
>> >>
>> >> Bandan
>> >
>> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
>> > follow up with any reproduceable results.
>>
>> Heh.. issue is not triggered on 2603v2 at all, at least I am not able
>> to hit this. The only difference with 2620v2 except lower frequency is
>> an Intel Dynamic Acceleration feature. I`d appreciate any testing with
>> higher CPU models with same or richer feature set. The testing itself
>> can be done on both generic 3.10 or RH7 kernels, as both of them are
>> experiencing this issue. I conducted all tests with disabled cstates
>> so I advise to do the same for a first reproduction step.
>>
>> Thanks!
>>
>> model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
>> stepping: 4
>> microcode   : 0x416
>> cpu MHz : 2100.039
>> cache size  : 15360 KB
>> siblings: 12
>> apicid  : 43
>> initial apicid  : 43
>> fpu : yes
>> fpu_exception   : yes
>> cpuid level : 13
>> wp  : yes
>> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
>> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
>> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
>> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
>> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
>> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
>> flexpriority ept vpid fsgsbase smep erms
>
> I'm seeing something similar; it's very intermittent and generally
> happening right at boot of the guest;   I'm running this on qemu
> head+my postcopy world (but it's happening right at boot before postcopy
> gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
> but hey maybe I'm seeing a different bug.
>
> Dave

Yep, looks like we are hitting same bug - two thirds of mine failure
events shot during boot/reboot cycle and approx. one third of events
happened in the middle of runtime. What CPU, v0 or v2 are you using
(in other words, is APICv enabled)?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.h

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Bandan Das
Paolo Bonzini  writes:

> On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
>> I'm seeing something similar; it's very intermittent and generally
>> happening right at boot of the guest;   I'm running this on qemu
>> head+my postcopy world (but it's happening right at boot before postcopy
>> gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
>> but hey maybe I'm seeing a different bug.

Probably a tangent but is the qemu trace identical to what Andrey is seeing ?
>From a cursory look and my limited understanding, it seems his failure is #GP
when executing video bios.

> Same here on 3.16 + Xeon E5 v3 kernel.

I will try to reproduce this on a v2.

Bandan
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov
On Tue, Mar 10, 2015 at 9:16 PM, Dr. David Alan Gilbert
 wrote:
> * Andrey Korolyov (and...@xdel.ru) wrote:
>> On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
>>  wrote:
>> > * Andrey Korolyov (and...@xdel.ru) wrote:
>> >> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov  wrote:
>> >> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das  wrote:
>> >> >> Andrey Korolyov  writes:
>> >> >>
>> >> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov  
>> >> >>> wrote:
>> >>  Hello,
>> >> 
>> >>  recently I`ve got a couple of shiny new Intel 2620v2s for future
>> >>  replacement of the E5-2620v1, but I experienced relatively many 
>> >>  events
>> >>  with emulation errors, all traces looks simular to the one below. I 
>> >>  am
>> >>  running qemu-2.1 on x86 on top of 3.10 branch for testing purposes 
>> >>  but
>> >>  can switch to some other versions if necessary. Most of crashes
>> >>  happened during reboot cycle or at the end of ACPI-based shutdown
>> >>  action, if this can help. I have zero clues of what can introduce 
>> >>  such
>> >>  a mess inside same processor family using identical software, as
>> >>  2620v1 has no simular problem ever. Please let me know if there can 
>> >>  be
>> >>  some side measures for making entire story more clear.
>> >> 
>> >>  Thanks!
>> >> 
>> >>  KVM internal error. Suberror: 2
>> >>  extra data[0]: 80d1
>> >>  extra data[1]: 8b0d
>> >>  EAX=0003 EBX= ECX= EDX=
>> >>  ESI= EDI= EBP= ESP=6cd4
>> >>  EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> >>  ES =   9300
>> >>  CS =f000 000f  9b00
>> >>  SS =   9300
>> >>  DS =   9300
>> >>  FS =   9300
>> >>  GS =   9300
>> >>  LDT=   8200
>> >>  TR =   8b00
>> >>  GDT= 000f6e98 0037
>> >>  IDT=  03ff
>> >>  CR0=0010 CR2= CR3= CR4=
>> >>  DR0= DR1= DR2=
>> >>  DR3=
>> >>  DR6=0ff0 DR7=0400
>> >>  EFER=
>> >>  Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb 
>> >>  10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
>> >>  b8 00 e0 00 00 8e
>> >> >>>
>> >> >>>
>> >> >>> It turns out that those errors are introduced by APICv, which gets
>> >> >>> enabled due to different feature set. If anyone is interested in
>> >> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of
>> >> >>> migrations/power state changes for an issue to appear, guest OS can be
>> >> >>> Linux or Win.
>> >> >>
>> >> >> Are you able to reproduce this on a more recent upstream kernel as 
>> >> >> well ?
>> >> >>
>> >> >> Bandan
>> >> >
>> >> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
>> >> > follow up with any reproduceable results.
>> >>
>> >> Heh.. issue is not triggered on 2603v2 at all, at least I am not able
>> >> to hit this. The only difference with 2620v2 except lower frequency is
>> >> an Intel Dynamic Acceleration feature. I`d appreciate any testing with
>> >> higher CPU models with same or richer feature set. The testing itself
>> >> can be done on both generic 3.10 or RH7 kernels, as both of them are
>> >> experiencing this issue. I conducted all tests with disabled cstates
>> >> so I advise to do the same for a first reproduction step.
>> >>
>> >> Thanks!
>> >>
>> >> model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
>> >> stepping: 4
>> >> microcode   : 0x416
>> >> cpu MHz : 2100.039
>> >> cache size  : 15360 KB
>> >> siblings: 12
>> >> apicid  : 43
>> >> initial apicid  : 43
>> >> fpu : yes
>> >> fpu_exception   : yes
>> >> cpuid level : 13
>> >> wp  : yes
>> >> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
>> >> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
>> >> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
>> >> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
>> >> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
>> >> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
>> >> flexpriority ept vpid fsgsbase smep erms
>> >
>> > I'm seeing something similar; it's very intermittent and generally
>> > happening right at boot of the guest;   I'm running this on qemu
>> > head+my postcopy world (but it's happening right at boot before postcopy
>> > gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Paolo Bonzini


On 10/03/2015 19:21, Bandan Das wrote:
> Paolo Bonzini  writes:
> 
>> On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
>>> I'm seeing something similar; it's very intermittent and generally
>>> happening right at boot of the guest;   I'm running this on qemu
>>> head+my postcopy world (but it's happening right at boot before postcopy
>>> gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
>>> but hey maybe I'm seeing a different bug.
> 
> Probably a tangent but is the qemu trace identical to what Andrey is seeing ?
> From a cursory look and my limited understanding, it seems his failure is #GP
> when executing video bios.
> 
>> Same here on 3.16 + Xeon E5 v3 kernel.
> 
> I will try to reproduce this on a v2.

I see several failures, usually mine have suberror 1.  With a 32-VCPU
guest I can reproduce it roughly half of the time.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Paolo Bonzini


On 10/03/2015 19:16, Dr. David Alan Gilbert wrote:
> KVM internal error. Suberror: 1
> emulation failure
> EAX= EBX= ECX= EDX=000fd2bc
> ESI= EDI= EBP= ESP=
> EIP=000fd2c5 EFL=00010007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0010   00c09300 DPL=0 DS   [-WA]
> CS =0008   00c09b00 DPL=0 CS32 [-RA]
> SS =0010   00c09300 DPL=0 DS   [-WA]
> DS =0010   00c09300 DPL=0 DS   [-WA]
> FS =0010   00c09300 DPL=0 DS   [-WA]
> GS =0010   00c09300 DPL=0 DS   [-WA]
> LDT=   8200 DPL=0 LDT
> TR =   8b00 DPL=0 TSS32-busy
> GDT= 000f6a80 0037
> IDT= 000f6abe 
> CR0=6011 CR2= CR3= CR4=
> DR0= DR1= DR2= 
> DR3=
> DR6=0ff0 DR7=0400
> EFER=
> Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb bf 00 <72> f3 8b 
> 25 00 ff fb bf e8 44 66 ff ff c7 05 04 ff
>  fb bf 00 00 00 00 f4 eb fd fa fc 66 b8
> KVM internal error. Suberror: 1
> emulation failure

This is exactly the same as mine.  I'm using upstream QEMU right now,
but I can try to reproduce using RHEL7's too.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 10/03/2015 19:21, Bandan Das wrote:
> > Paolo Bonzini  writes:
> > 
> >> On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
> >>> I'm seeing something similar; it's very intermittent and generally
> >>> happening right at boot of the guest;   I'm running this on qemu
> >>> head+my postcopy world (but it's happening right at boot before postcopy
> >>> gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
> >>> but hey maybe I'm seeing a different bug.
> > 
> > Probably a tangent but is the qemu trace identical to what Andrey is seeing 
> > ?
> > From a cursory look and my limited understanding, it seems his failure is 
> > #GP
> > when executing video bios.
> > 
> >> Same here on 3.16 + Xeon E5 v3 kernel.
> > 
> > I will try to reproduce this on a v2.
> 
> I see several failures, usually mine have suberror 1.  With a 32-VCPU
> guest I can reproduce it roughly half of the time.

Oh yes, that helps:

while true; do (sleep 5; echo -e 
'\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
pc-i440fx-2.3,accel=kvm -m 8192 -smp 20 -nographic -device sga; done

gets it under a minute on my other box; Intel(R) Xeon(R) CPU E5-2620 v2 @ 
2.10GHz
(Running a 3.18)

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:
> This checks that the TCE table page size is not bigger that the size of
> a page we just pinned and going to put its physical address to the table.
> 
> Otherwise the hardware gets unwanted access to physical memory between
> the end of the actual page and the end of the aligned up TCE page.
> 
> Since compound_order() and compound_head() work correctly on non-huge
> pages, there is no need for additional check whether the page is huge.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v4:
> * s/tce_check_page_size/tce_page_is_contained/
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 756831f..91e7599 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -49,6 +49,22 @@ struct tce_container {
>   bool enabled;
>  };
>  
> +static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +{
> + unsigned shift;
> +
> + /*
> +  * Check that the TCE table granularity is not bigger than the size of
> +  * a page we just found. Otherwise the hardware can get access to
> +  * a bigger memory chunk that it should.
> +  */
> + shift = PAGE_SHIFT + compound_order(compound_head(page));
> + if (shift >= page_shift)
> + return true;
> +
> + return false;

nit, simplified:

return (PAGE_SHIFT + compound_order(compound_head(page) >= page_shift);

> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>   int ret = 0;
> @@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
> *container,
>   ret = -EFAULT;
>   break;
>   }
> +
> + if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> + ret = -EPERM;
> + break;
> + }
> +
>   hva = (unsigned long) page_address(page) +
>   (tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
>  



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 10/03/2015 19:21, Bandan Das wrote:
> > Paolo Bonzini  writes:
> > 
> >> On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
> >>> I'm seeing something similar; it's very intermittent and generally
> >>> happening right at boot of the guest;   I'm running this on qemu
> >>> head+my postcopy world (but it's happening right at boot before postcopy
> >>> gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
> >>> but hey maybe I'm seeing a different bug.
> > 
> > Probably a tangent but is the qemu trace identical to what Andrey is seeing 
> > ?
> > From a cursory look and my limited understanding, it seems his failure is 
> > #GP
> > when executing video bios.
> > 
> >> Same here on 3.16 + Xeon E5 v3 kernel.
> > 
> > I will try to reproduce this on a v2.
> 
> I see several failures, usually mine have suberror 1.  With a 32-VCPU
> guest I can reproduce it roughly half of the time.
> 
> Paolo

while true; do (sleep 5; echo -e 
'\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 2>&1 | tee 
/tmp/qemu.op; grep "internal error" /tmp/qemu.op -q && break; done

(and leave about 2mins of runs before declaring good)

bad: cd2946607b42636d6c8cf6dbf94bce0273507b17
bad: 041ccc922ee474693a2869d4e3b59e920c739bc0
bad: 2559db069628981bfdc90637fac5bf1b4f4e8ef5
bad: 21f5826a04d38e19488f917e1eef22751490c769
good:e95d24ff40c77fbfd71396834a2eb99375f8bcc4
good: 7781a492fa5a2eff53d06b25b93f0186ad3226c9
good: c3edd62851098e6417786193ed9e9341781fcf57
good: c5c6d7f81a6950d8e32a3b5a0bafd37bfa5a8e88
good: 73104fd399c6778112f64fe0d439319f24508d9a
good: 92013cf8ca10adafec9a92deb5df993e7df22cb9
good: 4478aa768ccefcc5b234c23d035435fd71b932f6
good: 2.2.0

[root@virtlab413 qemu-world3]# git bisect bad
21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit
commit 21f5826a04d38e19488f917e1eef22751490c769
Author: Gerd Hoffmann 
Date:   Thu Feb 19 09:33:03 2015 +0100

seabios: update to 1.8.0 release

'git shortlog 8936dbb2..4c59f5d8' for seabios repo:

David Woodhouse (4):
  Update EFI_COMPATIBILITY16_TABLE to match 0.98 spec update
  build: use -m16 where available instead of asm(".code16gcc")
  romlayout: Use .code16 not .code16gcc
  vgabios: Use .code16 not .code16gcc

Gerd Hoffmann (2):
  add scripts/tarball.sh
  build: set LC_ALL=C

Hannes Reinecke (1):
  megasas: read addional PCI I/O bar

Ian Campbell (1):
  romlayout: Use "rep ; nop" not "rep nop".

Kevin O'Connor (139):
  vgabios: Return from handle_1011() if handler found.
  edd: Move EDD get drive parameters (int 1348) logic from disk.c to 
block.c.
  edd: Use sectors==-1 to detect removable media.
  edd: Separate out ATA and virtio specific parts of fill_edd().
  cdemu: store internal cdemu fields in standard "el-torito" spec 
format.
  Move cdemu call interface and disk_ret helper code to disk.c.
  smm: Replace SMI assembler code with C code.
  smm: Use a C struct to define the layout of the SMM area.
  smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
mode.
  Don't enable thread preemption during S3 resume vga option rom 
execution.
  Remove old Bochs bios fixed address string at 0xfff00.
  Move most of the VAR16FIXED() defs to misc.c.
  build: Avoid absolute paths during "whole-program" compiling.
  Make sure handle_smi() and handle_smp() are compiled out if not 
enabled.
  Remove the TODO file.
  Abstract reset call (and possible 16bit mode switch) into reset() 
function.
  build: Remove unused function getSectionsStart() from layoutrom.py.
  build: Extract section visiting logic in layoutrom.py.
  build: Refactor layoutrom.py gc() function.
  build: Use customized entry point for each type of build.
  build: Refactor findInit() function.
  build: Rework getRelocs() to use a hash instead of categories in 
layoutrom.py
  build: Keep segmented sections separate until final link step.
  build: Use fileid instead of category to write sections in 
layoutrom.py.
  build: Only export needed fields in LayoutInfo in layoutrom.py.
  build: Get fixed address variables from 32bit compile pass (not 16bit)
  build: Minor - fix comments referring to old tools/ directory.
  xhci: Update the times for usb command timeouts.
  ehci: Update usb command timeouts to use usb_xfer_time()
  uhci: Update usb command timeouts to use usb_xfer_time()
  ohci: Update usb command timeouts to use usb_xfer_time()
  vgabios: Fix broken build resulting from e5749978.
  boot: Change ":rom%d" boot order rom instance to ":rom%x"
  Minor - remove stray tab from 

Re: 2 CPU Conformance Issue in KVM/x86

2015-03-10 Thread Avi Kivity

On 03/10/2015 12:47 PM, Paolo Bonzini wrote:


On 09/03/2015 20:49, Avi Kivity wrote:

Yes, and it checked that MAXPHYADDR != 52 before.  If you want to set
only one bit, making that bit 51 makes sense anyway for simplicity, so
it is still 99.9% academic.  Once processors appear with MAXPHYADDR =
52, the remaining 0.1% will become more relevant.

The current limit is IIRC 46 or 48 (on Haswell Xeons).

It will be interesting to have processors with 52 bits of physical
address and 48 bits of virtual address. HIGHMEM for x86_64?  Or 5-level
page tables?

I wonder why Intel chose exactly 52...  HIGHMEM seems more likely than
5-level page tables.  Certainly it wouldn't need hacks like Ingo's 4G-4G.


My bitcoins are on 5-level page tables.  HIGHMEM is too much pain.


50 bits == 1 PiB.  That's quite an amount of RAM.

Not that 64 TiB is not "quite an amount of RAM". :)




Depends on how many browser tabs you have open, I guess.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: svm: make wbinvd faster

2015-03-10 Thread Marcelo Tosatti
On Tue, Mar 10, 2015 at 12:01:31PM +0100, Radim Krčmář wrote:
> 2015-03-09 20:28-0300, Marcelo Tosatti:
> > On Fri, Feb 27, 2015 at 06:19:18PM -0600, Joel Schopp wrote:
> > > From: David Kaplan 
> > > No need to re-decode WBINVD since we know what it is from the intercept.
> > > 
> > > Signed-off-by: David Kaplan 
> > > [extracted from larger unlrelated patch, forward ported, tested]
> > > Signed-off-by: Joel Schopp 
> > 
> > Can't you disable the intercept if need_emulate_wbinvd(vcpu) == false? 
> 
> I don't think we want to:  it should be faster to intercept and ignore
> than to invalidate all caches.  The exit doesn't affect other physical
> cores and costs just about 10(?) L3 cache misses.

Yes, right.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: irq_comm: Add check for RH bit in kvm_set_msi_irq

2015-03-10 Thread James Sullivan
On 03/10/2015 08:47 AM, Radim Krčmář wrote:
...
>> +/*
>> + * Set dest_mode to logical just in case both the RH and DM
>> + * bits are set, otherwise default to physical.
>> + */
>> +phys = ((e->msi.address_lo & (MSI_ADDR_REDIRECTION_LOWPRI |
>> +MSI_ADDR_DEST_MODE_LOGICAL)) !=
>> +(MSI_ADDR_REDIRECTION_LOWPRI |
>> +MSI_ADDR_DEST_MODE_LOGICAL));
>> +irq->dest_mode = phys ? 0 : (MSI_ADDR_DEST_MODE_LOGICAL);
> 
> (Should be APIC_DEST_LOGICAL.  All works because it is a boolean and we
>  only checked for APIC_DEST_PHYSICAL, which is 0.)
> 

Thank you, that should be as follows:

irq->dest_mode = phys ? (APIC_DEST_PHYSICAL) : (APIC_DEST_LOGICAL);

?

>>  irq->trig_mode = (1 << MSI_DATA_TRIGGER_SHIFT) & e->msi.data;
>>  irq->delivery_mode = e->msi.data & 0x700;
>>  irq->level = 1;
>>  irq->shorthand = 0;
>> -/* TODO Deal with RH bit of MSI message address */
> 
> RH bit still isn't deal with -- we do not use lowest-priority-like
> delivery in logical destination mode ...

Just want to make sure I understand this comment-
Isn't low-pri delivery mode used in kvm_irq_delivery_to_apic_fast
when irq->dest_mode > APIC_DEST_PHYSICAL (ie, logical)? (See below.)

Do you mean that this patch will interfere with this? As long as
irq->dest_mode is still APIC_DEST_LOGICAL this shouldn't change.


bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map)
{
.
if (irq->dest_mode == APIC_DEST_PHYSICAL) {
if (irq->dest_id >= ARRAY_SIZE(map->phys_map))
goto out;

dst = &map->phys_map[irq->dest_id];
} else {
u32 mda = irq->dest_id << (32 - map->ldr_bits);
u16 cid = apic_cluster_id(map, mda);

if (cid >= ARRAY_SIZE(map->logical_map))
goto out;

dst = map->logical_map[cid];

bitmap = apic_logical_id(map, mda);

if (irq->delivery_mode == APIC_DM_LOWEST) {
int l = -1;
for_each_set_bit(i, &bitmap, 16) {
if (!dst[i])
continue;
if (l < 0)
l = i;
else if (kvm_apic_compare_prio(dst[i]->vcpu, 
dst[l]->vcpu) < 0)
l = i;
}

bitmap = (l >= 0) ? 1 << l : 0;
}
}
.
}

> How does DM=1/RH=1 work on real hardware?
> (There seem to be interesting corner cases with irq->delivery_mode like
>  APIC_DM_NMI.)
> 

The IA32 manual says that if DM=1/RH=1, we operate in logical destination mode
similarly to other IPIs. I don't believe this patch introduces any invalid
settings listed in section 10-21, Vol. 3, so this shouldn't create any 
weirdness.

Thanks
-James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alexey Kardashevskiy

On 03/11/2015 06:56 AM, Alex Williamson wrote:

On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* s/tce_check_page_size/tce_page_is_contained/
---
  drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
  1 file changed, 22 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 756831f..91e7599 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -49,6 +49,22 @@ struct tce_container {
bool enabled;
  };

+static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+{
+   unsigned shift;
+
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   shift = PAGE_SHIFT + compound_order(compound_head(page));
+   if (shift >= page_shift)
+   return true;
+
+   return false;


nit, simplified:

return (PAGE_SHIFT + compound_order(compound_head(page) >= page_shift);


This won't be "bool" though. This will (I'll do this)

shift = PAGE_SHIFT + compound_order(compound_head(page));
return (shift >= page_shift);







+}
+
  static int tce_iommu_enable(struct tce_container *container)
  {
int ret = 0;
@@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
*container,
ret = -EFAULT;
break;
}
+
+   if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+   ret = -EPERM;
+   break;
+   }
+
hva = (unsigned long) page_address(page) +
(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);








--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alex Williamson
On Wed, 2015-03-11 at 09:57 +1100, Alexey Kardashevskiy wrote:
> On 03/11/2015 06:56 AM, Alex Williamson wrote:
> > On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:
> >> This checks that the TCE table page size is not bigger that the size of
> >> a page we just pinned and going to put its physical address to the table.
> >>
> >> Otherwise the hardware gets unwanted access to physical memory between
> >> the end of the actual page and the end of the aligned up TCE page.
> >>
> >> Since compound_order() and compound_head() work correctly on non-huge
> >> pages, there is no need for additional check whether the page is huge.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >> Changes:
> >> v4:
> >> * s/tce_check_page_size/tce_page_is_contained/
> >> ---
> >>   drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
> >>   1 file changed, 22 insertions(+)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> >> b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> index 756831f..91e7599 100644
> >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> >> @@ -49,6 +49,22 @@ struct tce_container {
> >>bool enabled;
> >>   };
> >>
> >> +static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> >> +{
> >> +  unsigned shift;
> >> +
> >> +  /*
> >> +   * Check that the TCE table granularity is not bigger than the size of
> >> +   * a page we just found. Otherwise the hardware can get access to
> >> +   * a bigger memory chunk that it should.
> >> +   */
> >> +  shift = PAGE_SHIFT + compound_order(compound_head(page));
> >> +  if (shift >= page_shift)
> >> +  return true;
> >> +
> >> +  return false;
> >
> > nit, simplified:
> >
> > return (PAGE_SHIFT + compound_order(compound_head(page) >= page_shift);
> 
> This won't be "bool" though.

Yes, it will.

>  This will (I'll do this)
> 
> shift = PAGE_SHIFT + compound_order(compound_head(page));
> return (shift >= page_shift);
> 
> 
> 
> 
> >
> >> +}
> >> +
> >>   static int tce_iommu_enable(struct tce_container *container)
> >>   {
> >>int ret = 0;
> >> @@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
> >> *container,
> >>ret = -EFAULT;
> >>break;
> >>}
> >> +
> >> +  if (!tce_page_is_contained(page, tbl->it_page_shift)) {
> >> +  ret = -EPERM;
> >> +  break;
> >> +  }
> >> +
> >>hva = (unsigned long) page_address(page) +
> >>(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);
> >>
> >
> >
> >
> 
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Benjamin Herrenschmidt
On Tue, 2015-03-10 at 17:03 -0600, Alex Williamson wrote:
> > > return (PAGE_SHIFT + compound_order(compound_head(page) >= page_shift);
> > 
> > This won't be "bool" though.
> 
> Yes, it will.

Don't you have your parenthesis in the wrong place, Alex ? :-)

> >  This will (I'll do this)
> > 
> > shift = PAGE_SHIFT + compound_order(compound_head(page));
> > return (shift >= page_shift);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alex Williamson
On Wed, 2015-03-11 at 10:14 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2015-03-10 at 17:03 -0600, Alex Williamson wrote:
> > > > return (PAGE_SHIFT + compound_order(compound_head(page) >= page_shift);
> > > 
> > > This won't be "bool" though.
> > 
> > Yes, it will.
> 
> Don't you have your parenthesis in the wrong place, Alex ? :-)

Probably, but the compiler will warn about that.

> > >  This will (I'll do this)
> > > 
> > > shift = PAGE_SHIFT + compound_order(compound_head(page));
> > > return (shift >= page_shift);
> 
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 07/29] vfio: powerpc/spapr: Moving pinning/unpinning to helpers

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
> This is a pretty mechanical patch to make next patches simpler.
> 
> New tce_iommu_unuse_page() helper does put_page() now but it might skip
> that after the memory registering patch applied.
> 
> As we are here, this removes unnecessary checks for a value returned
> by pfn_to_page() as it cannot possibly return NULL.
> 
> This moves tce_iommu_disable() later to let tce_iommu_clear() know if
> the container has been enabled because if it has not been, then
> put_page() must not be called on TCEs from the TCE table. This situation
> is not yet possible but it will after KVM acceleration patchset is
> applied.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 70 
> -
>  1 file changed, 54 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d3ab34f..ca396e5 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -204,7 +204,6 @@ static void tce_iommu_release(void *iommu_data)
>   struct iommu_table *tbl = container->tbl;
>  
>   WARN_ON(tbl && !tbl->it_group);
> - tce_iommu_disable(container);
>  
>   if (tbl) {
>   tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> @@ -212,63 +211,102 @@ static void tce_iommu_release(void *iommu_data)
>   if (tbl->it_group)
>   tce_iommu_detach_group(iommu_data, tbl->it_group);
>   }
> +
> + tce_iommu_disable(container);
> +
>   mutex_destroy(&container->lock);
>  
>   kfree(container);
>  }
>  
> +static void tce_iommu_unuse_page(struct tce_container *container,
> + unsigned long oldtce)
> +{
> + struct page *page;
> +
> + if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE)))
> + return;
> +
> + /*
> +  * VFIO cannot map/unmap when a container is not enabled so
> +  * we would not need this check but KVM could map/unmap and if
> +  * this happened, we must not put pages as KVM does not get them as
> +  * it expects memory pre-registation to do this part.
> +  */
> + if (!container->enabled)
> + return;
> +
> + page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT);
> +
> + if (oldtce & TCE_PCI_WRITE)
> + SetPageDirty(page);
> +
> + put_page(page);
> +}
> +
>  static int tce_iommu_clear(struct tce_container *container,
>   struct iommu_table *tbl,
>   unsigned long entry, unsigned long pages)
>  {
>   unsigned long oldtce;
> - struct page *page;
>  
>   for ( ; pages; --pages, ++entry) {
>   oldtce = iommu_clear_tce(tbl, entry);
>   if (!oldtce)
>   continue;
>  
> - page = pfn_to_page(oldtce >> PAGE_SHIFT);
> - WARN_ON(!page);
> - if (page) {
> - if (oldtce & TCE_PCI_WRITE)
> - SetPageDirty(page);
> - put_page(page);
> - }
> + tce_iommu_unuse_page(container, (unsigned long) __va(oldtce));
>   }
>  
>   return 0;
>  }
>  
> +static unsigned long tce_get_hva(struct tce_container *container,
> + unsigned page_shift, unsigned long tce)
> +{
> + long ret;
> + struct page *page = NULL;
> + unsigned long hva;
> + enum dma_data_direction direction = iommu_tce_direction(tce);
> +
> + ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> + direction != DMA_TO_DEVICE, &page);
> + if (unlikely(ret != 1))
> + return -1;
> +
> + hva = (unsigned long) page_address(page);
> +
> + return hva;
> +}


It's a bit crude to return -1 for an unsigned long function.  You might
want to later think about cleaning this up to return int with a proper
error code and return hva via a pointer.  We don't really need to store
'ret' here either.

> +
>  static long tce_iommu_build(struct tce_container *container,
>   struct iommu_table *tbl,
>   unsigned long entry, unsigned long tce, unsigned long pages)
>  {
>   long i, ret = 0;
> - struct page *page = NULL;
> + struct page *page;
>   unsigned long hva;
>   enum dma_data_direction direction = iommu_tce_direction(tce);
>  
>   for (i = 0; i < pages; ++i) {
> - ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> - direction != DMA_TO_DEVICE, &page);
> - if (unlikely(ret != 1)) {
> + hva = tce_get_hva(container, tbl->it_page_shift, tce);
> + if (hva == -1) {
>   ret = -EFAULT;
>   break;
>   }
>  
> + page = pfn_to_page(__pa(hva) >> PAGE_SHIFT);
>   if (!tce_page_is_contained(page, tbl->it_page_shift)) {
>   ret = -EPERM;
> 

Re: [PATCH v5 03/29] vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size

2015-03-10 Thread Alexey Kardashevskiy

On 03/11/2015 10:03 AM, Alex Williamson wrote:

On Wed, 2015-03-11 at 09:57 +1100, Alexey Kardashevskiy wrote:

On 03/11/2015 06:56 AM, Alex Williamson wrote:

On Tue, 2015-03-10 at 01:06 +1100, Alexey Kardashevskiy wrote:

This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* s/tce_check_page_size/tce_page_is_contained/
---
   drivers/vfio/vfio_iommu_spapr_tce.c | 22 ++
   1 file changed, 22 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 756831f..91e7599 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -49,6 +49,22 @@ struct tce_container {
bool enabled;
   };

+static bool tce_page_is_contained(struct page *page, unsigned page_shift)
+{
+   unsigned shift;
+
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   shift = PAGE_SHIFT + compound_order(compound_head(page));
+   if (shift >= page_shift)
+   return true;
+
+   return false;


nit, simplified:

return (PAGE_SHIFT + compound_order(compound_head(page) >= page_shift);


This won't be "bool" though.


Yes, it will.


Ah, misread as "... - page_shift". And you missed one bracket :)





  This will (I'll do this)

shift = PAGE_SHIFT + compound_order(compound_head(page));
return (shift >= page_shift);







+}
+
   static int tce_iommu_enable(struct tce_container *container)
   {
int ret = 0;
@@ -197,6 +213,12 @@ static long tce_iommu_build(struct tce_container 
*container,
ret = -EFAULT;
break;
}
+
+   if (!tce_page_is_contained(page, tbl->it_page_shift)) {
+   ret = -EPERM;
+   break;
+   }
+
hva = (unsigned long) page_address(page) +
(tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK);















--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/2] kvm: x86: kvm_emulate_*

2015-03-10 Thread Marcelo Tosatti
On Mon, Mar 02, 2015 at 01:43:24PM -0600, Joel Schopp wrote:
> Review comments from v1 that used kvm_emulate_wbinvd() pointed out that 
> kvm_emulate_* was inconsistant in using skipping, while kvm_emulate() always
> skips.  The first patch cleans up the existing use while the second patch
> adds use of the updated version of kvm_emulate_wbinvd() in svm
> 
> Changes since v2:
>   * fixed email subject line on series short description
>   * renamed kvm_emulate_halt_noskip() to kvm_vcpu_halt()
>   * added header declaration for kvm_vcpu_halt()
>   * squashed blank line 
> ---
> 
> David Kaplan (1):
>   x86: svm: make wbinvd faster
> 
> Joel Schopp (1):
>   kvm: x86: make kvm_emulate_* consistant
> 
> 
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/svm.c  |   10 +++---
>  arch/x86/kvm/vmx.c  |9 +++--
>  arch/x86/kvm/x86.c  |   23 ---
>  4 files changed, 31 insertions(+), 12 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 26/29] vfio: powerpc/spapr: Define v2 IOMMU

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
> The existing IOMMU code takes/releases ownership over the existing IOMMU
> tables created by the platform code, i.e. the tables remain in memory
> all the time. Also, the existing IOMMU requires VFIO_IOMMU_ENABLE call to
> start working as that's where we check the rlimit for locked pages.
> 
> New IOMMU will take over the IOMMU group completely which means the IOMMU
> tables created by the platform code are going to be disposed and VFIO
> will create its own tables. Also, with the DMA memory pre-registration
> feature, the userspace will not need to call VFIO_IOMMU_ENABLE as
> the locked pages accounting will be done by
> VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
> 
> In order to inform the userspace that VFIO supports new capabilities,
> this adds a new SPAPR TCE IOMMU v2 type.

Shouldn't we block some of these new features on v1?  Otherwise v1 now
supports them too and by the property of "we don't break userspace" v1
and v2 are now one-in-the-same.  You'd need to support everything
through both versions.  If that's what you intend to do, maybe this
should be a capability rather than an IOMMU type.

> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
>  include/uapi/linux/vfio.h   | 2 ++
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index e572c28..d665ddc 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -404,7 +404,7 @@ static void *tce_iommu_open(unsigned long arg)
>  {
>   struct tce_container *container;
>  
> - if (arg != VFIO_SPAPR_TCE_IOMMU) {
> + if ((arg != VFIO_SPAPR_TCE_IOMMU) && (arg != VFIO_SPAPR_TCE_v2_IOMMU)) {
>   pr_err("tce_vfio: Wrong IOMMU type\n");
>   return ERR_PTR(-EINVAL);
>   }
> @@ -588,6 +588,7 @@ static long tce_iommu_ioctl(void *iommu_data,
>   case VFIO_CHECK_EXTENSION:
>   switch (arg) {
>   case VFIO_SPAPR_TCE_IOMMU:
> + case VFIO_SPAPR_TCE_v2_IOMMU:
>   ret = 1;
>   break;
>   default:
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index b17e120..fbc5286 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -36,6 +36,8 @@
>  /* Two-stage IOMMU */
>  #define VFIO_TYPE1_NESTING_IOMMU 6   /* Implies v2 */
>  
> +#define VFIO_SPAPR_TCE_v2_IOMMU  7
> +
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
>   * structure length (argsz) and flags into structures passed between



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 27/29] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
> Before the IOMMU user would take control over the IOMMU table belonging to
> a specific IOMMU group. This approach did not allow sharing tables between
> IOMMU groups attached to the same container.
> 
> This introduces a new IOMMU ownership flavour when the user can not
> just control the existing IOMMU table but remove/create tables on demand.
> If an IOMMU supports a set_ownership() callback, that lets the user have
> full control over the IOMMU group. When the ownership is taken,
> the platform code removes all the windows so the caller must create them.
> Before returning the ownership back to the platform code, the user
> has to unprogram and remove all the tables it created.

We have no ability to enforce that requirement on the user.  VFIO needs
to do the cleanup if the user fails to.

> Old-style ownership is still supported allowing VFIO to run on older
> P5IOC2 and IODA IO controllers.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++---
>  drivers/vfio/vfio_iommu_spapr_tce.c   | 51 
> ---
>  2 files changed, 66 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 07857c4..afb6906 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1620,11 +1620,33 @@ static void pnv_ioda2_set_ownership(struct 
> iommu_table_group *table_group,
>  {
>   struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
>   table_group);
> - if (enable)
> - iommu_take_ownership(table_group);
> - else
> - iommu_release_ownership(table_group);
> + if (enable) {
> + pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> + pnv_pci_free_table(&pe->table_group.tables[0]);
> + } else {
> + struct iommu_table *tbl = &pe->table_group.tables[0];
> + int64_t rc;
>  
> + rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
> + IOMMU_PAGE_SHIFT_4K,
> + pe->phb->ioda.m32_pci_base,
> + POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
> + if (rc) {
> + pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
> + rc);
> + return;
> + }
> +
> + iommu_init_table(tbl, pe->phb->hose->node);
> +
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> + if (rc) {
> + pe_err(pe, "Failed to configure 32-bit TCE table, err 
> %ld\n",
> + rc);
> + pnv_pci_free_table(tbl);
> + return;
> + }
> + }
>   pnv_pci_ioda2_set_bypass(pe, !enable);
>  }
>  
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index d665ddc..3bc0645 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -426,18 +426,11 @@ static int tce_iommu_clear(struct tce_container 
> *container,
>  static void tce_iommu_release(void *iommu_data)
>  {
>   struct tce_container *container = iommu_data;
> - struct iommu_table *tbl;
> - struct iommu_table_group *table_group;
>  
>   WARN_ON(container->grp);
>  
> - if (container->grp) {
> - table_group = iommu_group_get_iommudata(container->grp);
> - tbl = &table_group->tables[0];
> - tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
> -
> + if (container->grp)
>   tce_iommu_detach_group(iommu_data, container->grp);
> - }
>  
>   tce_mem_unregister_all(container);
>   tce_iommu_disable(container);
> @@ -826,14 +819,24 @@ static int tce_iommu_attach_group(void *iommu_data,
>  
>   if (!table_group->ops || !table_group->ops->set_ownership) {
>   ret = iommu_take_ownership(table_group);
> + } else if (!table_group->ops->create_table ||
> + !table_group->ops->set_window) {
> + WARN_ON_ONCE(1);
> + ret = -EFAULT;
>   } else {
>   /*
>* Disable iommu bypass, otherwise the user can DMA to all of
>* our physical memory via the bypass window instead of just
>* the pages that has been explicitly mapped into the iommu
>*/
> + struct iommu_table tbltmp = { 0 }, *tbl = &tbltmp;
> +
>   table_group->ops->set_ownership(table_group, true);
> - ret = 0;
> + ret = table_group->ops->create_table(table_group, 0,
> + IOMMU_PAGE_SHIFT_4K,
> + table_group->

Re: [PATCH] kvm: move advertising of KVM_CAP_IRQFD to common code

2015-03-10 Thread Marcelo Tosatti
On Thu, Mar 05, 2015 at 11:54:46AM +0100, Paolo Bonzini wrote:
> POWER supports irqfds but forgot to advertise them.  Some userspace does
> not check for the capability, but others check it---thus they work on
> x86 and s390 but not POWER.
> 
> To avoid that other architectures in the future make the same mistake, let
> common code handle KVM_CAP_IRQFD the same way as KVM_CAP_IRQFD_RESAMPLE.
> 
> Reported-by: Greg Kurz 
> Cc: sta...@vger.kernel.org
> Fixes: 297e21053a52f060944e9f0de4c64fad9bcd72fc
> Signed-off-by: Paolo Bonzini 
> ---
>   Marcelo, please apply this for 4.0.

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 27/29] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership

2015-03-10 Thread Alexey Kardashevskiy

On 03/11/2015 11:09 AM, Alex Williamson wrote:

On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:

Before the IOMMU user would take control over the IOMMU table belonging to
a specific IOMMU group. This approach did not allow sharing tables between
IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU supports a set_ownership() callback, that lets the user have
full control over the IOMMU group. When the ownership is taken,
the platform code removes all the windows so the caller must create them.
Before returning the ownership back to the platform code, the user
has to unprogram and remove all the tables it created.


We have no ability to enforce that requirement on the user.  VFIO needs
to do the cleanup if the user fails to.


Ah. Wrong commit log, VFIO always does cleanup (as I end my guests by c-a-x 
most of the time :) ), will fix it.






Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++---
  drivers/vfio/vfio_iommu_spapr_tce.c   | 51 ---
  2 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 07857c4..afb6906 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1620,11 +1620,33 @@ static void pnv_ioda2_set_ownership(struct 
iommu_table_group *table_group,
  {
struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
table_group);
-   if (enable)
-   iommu_take_ownership(table_group);
-   else
-   iommu_release_ownership(table_group);
+   if (enable) {
+   pnv_pci_ioda2_unset_window(&pe->table_group, 0);
+   pnv_pci_free_table(&pe->table_group.tables[0]);
+   } else {
+   struct iommu_table *tbl = &pe->table_group.tables[0];
+   int64_t rc;

+   rc = pnv_pci_ioda2_create_table(&pe->table_group, 0,
+   IOMMU_PAGE_SHIFT_4K,
+   pe->phb->ioda.m32_pci_base,
+   POWERNV_IOMMU_DEFAULT_LEVELS, tbl);
+   if (rc) {
+   pe_err(pe, "Failed to create 32-bit TCE table, err %ld",
+   rc);
+   return;
+   }
+
+   iommu_init_table(tbl, pe->phb->hose->node);
+
+   rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
+   if (rc) {
+   pe_err(pe, "Failed to configure 32-bit TCE table, err 
%ld\n",
+   rc);
+   pnv_pci_free_table(tbl);
+   return;
+   }
+   }
pnv_pci_ioda2_set_bypass(pe, !enable);
  }

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index d665ddc..3bc0645 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -426,18 +426,11 @@ static int tce_iommu_clear(struct tce_container 
*container,
  static void tce_iommu_release(void *iommu_data)
  {
struct tce_container *container = iommu_data;
-   struct iommu_table *tbl;
-   struct iommu_table_group *table_group;

WARN_ON(container->grp);

-   if (container->grp) {
-   table_group = iommu_group_get_iommudata(container->grp);
-   tbl = &table_group->tables[0];
-   tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size);
-
+   if (container->grp)
tce_iommu_detach_group(iommu_data, container->grp);
-   }

tce_mem_unregister_all(container);
tce_iommu_disable(container);
@@ -826,14 +819,24 @@ static int tce_iommu_attach_group(void *iommu_data,

if (!table_group->ops || !table_group->ops->set_ownership) {
ret = iommu_take_ownership(table_group);
+   } else if (!table_group->ops->create_table ||
+   !table_group->ops->set_window) {
+   WARN_ON_ONCE(1);
+   ret = -EFAULT;
} else {
/*
 * Disable iommu bypass, otherwise the user can DMA to all of
 * our physical memory via the bypass window instead of just
 * the pages that has been explicitly mapped into the iommu
 */
+   struct iommu_table tbltmp = { 0 }, *tbl = &tbltmp;
+
table_group->ops->set_ownership(table_group, true);
-   ret = 0;
+   ret = table_group->ops->create_table(table_group, 0,
+

CfP 10th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '15)

2015-03-10 Thread VHPC 15
=
CALL FOR PAPERS

10th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '15)

held in conjunction with Euro-Par 2015, August 24-28, Vienna, Austria

(Springer LNCS)
=

Date: August 25, 2015
Workshop URL: http://vhpc.org

Paper Submission Deadline: May 22, 2015


CALL FOR PAPERS

Virtualization technologies constitute a key enabling factor for
flexible resource
management in modern data centers, cloud environments, and increasingly in
HPC as well. Providers need to dynamically manage complex infrastructures in a
seamless fashion for varying workloads and hosted applications, independently of
the customers deploying software or users submitting highly dynamic and
heterogeneous workloads. Thanks to virtualization, we have the ability to manage
vast computing and networking resources dynamically and close to the marginal
cost of providing the services, which is unprecedented in the history
of scientific
and commercial computing.

Various virtualization technologies contribute to the overall picture
in different
ways: machine virtualization, with its capability to enable
consolidation of multiple
under-utilized servers with heterogeneous software and operating systems (OSes),
and its capability to live-migrate a fully operating virtual machine
(VM) with a very
short downtime, enables novel and dynamic ways to manage physical servers;
OS-level virtualization, with its capability to isolate multiple user-space
environments and to allow for their co-existence within the same OS kernel,
promises to provide many of the advantages of machine virtualization with high
levels of responsiveness and performance; I/O Virtualization allows physical
network adapters to take traffic from multiple VMs; network
virtualization, with its
capability to create logical network overlays that are independent of the
underlying physical topology and IP addressing, provides the fundamental
ground on top of which evolved network services can be realized with an
unprecedented level of dynamicity and flexibility; These technologies
have to be inter-mixed and integrated in an intelligent way, to support
workloads that are increasingly demanding in terms of absolute performance,
responsiveness and interactivity, and have to respect well-specified Service-
Level Agreements (SLAs), as needed for industrial-grade provided services.
Indeed, among emerging and increasingly interesting application domains
for virtualization, we can find big-data application workloads in cloud
infrastructures, interactive and real-time multimedia services in the cloud,
including real-time big-data streaming platforms such as used in real-time
analytics supporting nowadays a plethora of application domains. Distributed
cloud infrastructures promise to offer unprecedented responsiveness levels for
hosted applications, but that is only possible if the underlying virtualization
technologies can overcome most of the latency impairments typical of current
virtualized infrastructures (e.g., far worse tail-latency).

The Workshop on Virtualization in High-Performance Cloud Computing (VHPC)
aims to bring together researchers and industrial practitioners facing
the challenges
posed by virtualization in order to foster discussion, collaboration,
mutual exchange
of knowledge and experience, enabling research to ultimately provide novel
solutions for virtualized computing systems of tomorrow.

The workshop will be one day in length, composed of 20 min paper presentations,
each followed by 10 min discussion sections, and lightning talks, limited to 5
minutes. Presentations may be accompanied by interactive demonstrations.

TOPICS

Topics of interest include, but are not limited to:

- Virtualization in supercomputing environments, HPC clusters, cloud
HPC and grids
- Optimizations of virtual machine monitor platforms, hypervisors and
OS-level virtualization
- Hypervisor and network virtualization QoS and SLAs
- Cloud based network and system management for SDN and NFV
- Management, deployment and monitoring of virtualized environments
- Performance measurement, modelling and monitoring of
virtualized/cloud workloads
- Programming models for virtualized environments
- Cloud reliability, fault-tolerance, high-availability and security
- Heterogeneous virtualized environments, virtualized accelerators,
GPUs and co-processors
- Optimized communication libraries/protocols in the cloud and for HPC
in the cloud
- Topology management and optimization for distributed virtualized applications
- Cluster provisioning in the cloud and cloud bursting
- Adaptation of emerging HPC technologies (high performance networks,
RDMA, etc..)
- I/O and storage virtualization, virtualization aware file systems
- Job scheduling/control/policy in virtualized environments
- Checkpointing and migration of VM-based large compute jobs
- Cloud frameworks and APIs
- E

Re: [PATCH v5 29/29] vfio: powerpc/spapr: Support Dynamic DMA windows

2015-03-10 Thread Alex Williamson
On Tue, 2015-03-10 at 01:07 +1100, Alexey Kardashevskiy wrote:
> This adds create/remove window ioctls to create and remove DMA windows.
> sPAPR defines a Dynamic DMA windows capability which allows
> para-virtualized guests to create additional DMA windows on a PCI bus.
> The existing linux kernels use this new window to map the entire guest
> memory and switch to the direct DMA operations saving time on map/unmap
> requests which would normally happen in a big amounts.
> 
> This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
> VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
> Up to 2 windows are supported now by the hardware and by this driver.
> 
> This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
> information such as a number of supported windows and maximum number
> levels of TCE tables.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v4:
> * moved code to tce_iommu_create_window()/tce_iommu_remove_window()
> helpers
> * added docs
> ---
>  Documentation/vfio.txt  |  19 +
>  arch/powerpc/include/asm/iommu.h|   2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c | 165 
> +++-
>  include/uapi/linux/vfio.h   |  24 +-
>  4 files changed, 207 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 791e85c..61ce393 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -446,6 +446,25 @@ the memory block.
>  The user space is not expected to call these often and the block descriptors
>  are stored in a linked list in the kernel.
>  
> +6) sPAPR specification allows guests to have an ddditional DMA window(s) on
> +a PCI bus with a variable page size. Two ioctls have been added to support
> +this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
> +The platform has to support the functionality or error will be returned to
> +the userspace. The existing hardware supports up to 2 DMA windows, one is
> +2GB long, uses 4K pages and called "default 32bit window"; the other can
> +be as big as entire RAM, use different page size, it is optional - guests
> +create those in run-time if the guest driver supports 64bit DMA.
> +
> +VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
> +a number of TCE table levels (if a TCE table is going to be big enough and
> +the kernel may not be able to allocate enough of physicall contiguous 
> memory).
> +It creates a new window in the available slot and returns the bus address 
> where
> +the new window starts. Due to hardware limitation, the user space cannot 
> choose
> +the location of DMA windows.
> +
> +VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
> +and removes it.
> +
>  
> ---
>  
>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 04f72ac..de82b61 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -138,7 +138,7 @@ extern void iommu_free_table(struct iommu_table *tbl, 
> const char *node_name);
>  extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
>   int nid);
>  
> -#define IOMMU_TABLE_GROUP_MAX_TABLES 1
> +#define IOMMU_TABLE_GROUP_MAX_TABLES 2
>  
>  struct iommu_table_group;
>  
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 3a0b5fe..7aa4141b 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -96,6 +96,7 @@ struct tce_container {
>   struct list_head mem_list;
>   struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES];
>   struct list_head group_list;
> + bool v2;
>  };
>  
>  struct tce_iommu_group {
> @@ -333,6 +334,20 @@ static struct iommu_table *spapr_tce_find_table(
>   return ret;
>  }
>  
> +static int spapr_tce_find_free_table(struct tce_container *container)
> +{
> + int i;
> +
> + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
> + struct iommu_table *tbl = &container->tables[i];
> +
> + if (!tbl->it_size)
> + return i;
> + }
> +
> + return -1;


Why not use a real errno here?

> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>   int ret = 0;
> @@ -432,6 +447,8 @@ static void *tce_iommu_open(unsigned long arg)
>   INIT_LIST_HEAD_RCU(&container->mem_list);
>   INIT_LIST_HEAD_RCU(&container->group_list);
>  
> + container->v2 = arg == VFIO_SPAPR_TCE_v2_IOMMU;
> +

Ah, here v2 actually provides some enforced differentiation, right?  ...
oh wait, nobody ever uses this.
>   return container;
>  }
>  
> @@ -605,11 +622,90 @@ static long tce_iommu_build(struct tce_container 
> *container,
>   return ret;

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Bandan Das
"Dr. David Alan Gilbert"  writes:

> * Paolo Bonzini (pbonz...@redhat.com) wrote:
>> 
>> 
>> On 10/03/2015 19:21, Bandan Das wrote:
>> > Paolo Bonzini  writes:
>> > 
>> >> On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
>> >>> I'm seeing something similar; it's very intermittent and generally
>> >>> happening right at boot of the guest;   I'm running this on qemu
>> >>> head+my postcopy world (but it's happening right at boot before postcopy
>> >>> gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
>> >>> but hey maybe I'm seeing a different bug.
>> > 
>> > Probably a tangent but is the qemu trace identical to what Andrey is 
>> > seeing ?
>> > From a cursory look and my limited understanding, it seems his failure is 
>> > #GP
>> > when executing video bios.
>> > 
>> >> Same here on 3.16 + Xeon E5 v3 kernel.
>> > 
>> > I will try to reproduce this on a v2.
>> 
>> I see several failures, usually mine have suberror 1.  With a 32-VCPU
>> guest I can reproduce it roughly half of the time.
>> 
>> Paolo
>
> while true; do (sleep 5; echo -e 
> '\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
> pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 2>&1 | tee 
> /tmp/qemu.op; grep "internal error" /tmp/qemu.op -q && break; done
>
> (and leave about 2mins of runs before declaring good)
>
> bad: cd2946607b42636d6c8cf6dbf94bce0273507b17
> bad: 041ccc922ee474693a2869d4e3b59e920c739bc0
> bad: 2559db069628981bfdc90637fac5bf1b4f4e8ef5
> bad: 21f5826a04d38e19488f917e1eef22751490c769
> good:e95d24ff40c77fbfd71396834a2eb99375f8bcc4
> good: 7781a492fa5a2eff53d06b25b93f0186ad3226c9
> good: c3edd62851098e6417786193ed9e9341781fcf57
> good: c5c6d7f81a6950d8e32a3b5a0bafd37bfa5a8e88
> good: 73104fd399c6778112f64fe0d439319f24508d9a
> good: 92013cf8ca10adafec9a92deb5df993e7df22cb9
> good: 4478aa768ccefcc5b234c23d035435fd71b932f6
> good: 2.2.0
>
> [root@virtlab413 qemu-world3]# git bisect bad
> 21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit

I can reproduce this on E5-2620 v2 with  David's "while true" test.
(The emulation failure I mean, not the suberror 2 that Andrey is seeing)
The commit that seems to have introduced this is -

commit 0673b7870063a3affbad9046fb6d385a4e734c19
Author: Kevin O'Connor 
Date:   Sat May 24 10:49:50 2014 -0400

smp: Replace QEMU SMP init assembler code with C; run only in 32bit mode.

Change the multi-processor init code to trampoline into 32bit mode on
each of the additional processors.  Implement an atomic lock so that
each processor performs its initialization serially.

I am not sure what in that change could cause this though..
Also, in my testing, "unrestricted_guest=0" avoids the failure.

> commit 21f5826a04d38e19488f917e1eef22751490c769
> Author: Gerd Hoffmann 
> Date:   Thu Feb 19 09:33:03 2015 +0100
>
> seabios: update to 1.8.0 release
> 
> 'git shortlog 8936dbb2..4c59f5d8' for seabios repo:
> 
> David Woodhouse (4):
>   Update EFI_COMPATIBILITY16_TABLE to match 0.98 spec update
>   build: use -m16 where available instead of asm(".code16gcc")
>   romlayout: Use .code16 not .code16gcc
>   vgabios: Use .code16 not .code16gcc
> 
> Gerd Hoffmann (2):
>   add scripts/tarball.sh
>   build: set LC_ALL=C
> 
> Hannes Reinecke (1):
>   megasas: read addional PCI I/O bar
> 
> Ian Campbell (1):
>   romlayout: Use "rep ; nop" not "rep nop".
> 
> Kevin O'Connor (139):
>   vgabios: Return from handle_1011() if handler found.
>   edd: Move EDD get drive parameters (int 1348) logic from disk.c to 
> block.c.
>   edd: Use sectors==-1 to detect removable media.
>   edd: Separate out ATA and virtio specific parts of fill_edd().
>   cdemu: store internal cdemu fields in standard "el-torito" spec 
> format.
>   Move cdemu call interface and disk_ret helper code to disk.c.
>   smm: Replace SMI assembler code with C code.
>   smm: Use a C struct to define the layout of the SMM area.
>   smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
> mode.
>   Don't enable thread preemption during S3 resume vga option rom 
> execution.
>   Remove old Bochs bios fixed address string at 0xfff00.
>   Move most of the VAR16FIXED() defs to misc.c.
>   build: Avoid absolute paths during "whole-program" compiling.
>   Make sure handle_smi() and handle_smp() are compiled out if not 
> enabled.
>   Remove the TODO file.
>   Abstract reset call (and possible 16bit mode switch) into reset() 
> function.
>   build: Remove unused function getSectionsStart() from layoutrom.py.
>   build: Extract section visiting logic in layoutrom.py.
>   build: Refactor layoutrom.py gc() function.
>   build: Use customized entry point for each type of build.
>   build: Refact

[PATCH 2/2] drivers/vfio: Support EEH error injection

2015-03-10 Thread Gavin Shan
The patch adds one more EEH sub-command (VFIO_EEH_PE_INJECT_ERR)
to inject the specified EEH error, which is represented by
(struct vfio_eeh_pe_err), to the indicated PE for testing purpose.

Signed-off-by: Gavin Shan 
---
 Documentation/vfio.txt| 47 ++-
 drivers/vfio/vfio_spapr_eeh.c | 14 +
 include/uapi/linux/vfio.h | 34 ++-
 3 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..2e7f736 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -328,7 +328,13 @@ So 4 additional ioctls have been added:
 
 The code flow from the example above should be slightly changed:
 
-   struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
+   struct vfio_eeh_pe_op *pe_op;
+   struct vfio_eeh_pe_err *pe_err;
+
+   pe_op = malloc(sizeof(*pe_op) + sizeof(*pe_err));
+   pe_err = (void *)pe_op + sizeof(*pe_op);
+   pe_op->argsz = sizeof(*pe_op) + sizeof(*pe_err);
+   pe_op->flags = 0;
 
.
/* Add the group to the container */
@@ -367,8 +373,8 @@ The code flow from the example above should be slightly 
changed:
ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH);
 
/* Enable the EEH functionality on the device */
-   pe_op.op = VFIO_EEH_PE_ENABLE;
-   ioctl(container, VFIO_EEH_PE_OP, &pe_op);
+   pe_op->op = VFIO_EEH_PE_ENABLE;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
 
/* You're suggested to create additional data struct to represent
 * PE, and put child devices belonging to same IOMMU group to the
@@ -376,8 +382,8 @@ The code flow from the example above should be slightly 
changed:
 */
 
/* Check the PE's state and make sure it's in functional state */
-   pe_op.op = VFIO_EEH_PE_GET_STATE;
-   ioctl(container, VFIO_EEH_PE_OP, &pe_op);
+   pe_op->op = VFIO_EEH_PE_GET_STATE;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
 
/* Save device state using pci_save_state().
 * EEH should be enabled on the specified device.
@@ -385,11 +391,24 @@ The code flow from the example above should be slightly 
changed:
 

 
+   /* Inject EEH error, which is expected to be caused by 32-bits
+* config load.
+*/
+   pe_err->type = VFIO_EEH_ERR_TYPE_32;
+   pe_err->func = VFIO_EEH_ERR_FUNC_LD_CFG_ADDR;
+   pe_err->addr = 0ul;
+   pe_err->mask = 0ul;
+   pe_op->op = VFIO_EEH_PE_INJECT_ERR;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
+
+   
+
/* When 0xFF's returned from reading PCI config space or IO BARs
 * of the PCI device. Check the PE's state to see if that has been
 * frozen.
 */
-   ioctl(container, VFIO_EEH_PE_OP, &pe_op);
+   pe_op->op = VFIO_EEH_PE_GET_STATE;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
 
/* Waiting for pending PCI transactions to be completed and don't
 * produce any more PCI traffic from/to the affected PE until
@@ -400,22 +419,22 @@ The code flow from the example above should be slightly 
changed:
 * standard part of PCI config space, AER registers are dumped
 * as logs for further analysis.
 */
-   pe_op.op = VFIO_EEH_PE_UNFREEZE_IO;
-   ioctl(container, VFIO_EEH_PE_OP, &pe_op);
+   pe_op->op = VFIO_EEH_PE_UNFREEZE_IO;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
 
/*
 * Issue PE reset: hot or fundamental reset. Usually, hot reset
 * is enough. However, the firmware of some PCI adapters would
 * require fundamental reset.
 */
-   pe_op.op = VFIO_EEH_PE_RESET_HOT;
-   ioctl(container, VFIO_EEH_PE_OP, &pe_op);
-   pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE;
-   ioctl(container, VFIO_EEH_PE_OP, &pe_op);
+   pe_op->op = VFIO_EEH_PE_RESET_HOT;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
+   pe_op->op = VFIO_EEH_PE_RESET_DEACTIVATE;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
 
/* Configure the PCI bridges for the affected PE */
-   pe_op.op = VFIO_EEH_PE_CONFIGURE;
-   ioctl(container, VFIO_EEH_PE_OP, &pe_op);
+   pe_op->op = VFIO_EEH_PE_CONFIGURE;
+   ioctl(container, VFIO_EEH_PE_OP, pe_op);
 
/* Restored state we saved at initialization time. pci_restore_state()
 * is good enough as an example.
diff --git a/drivers/vfio/vfio_spapr_eeh.c b/drivers/vfio/vfio_spapr_eeh.c
index 5fa42db..4f1ebc1 100644
--- a/drivers/vfio/vfio_spapr_eeh.c
+++ b/drivers/vfio/vfio_spapr_eeh.c
@@ -85,6 +85,20 @@ long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
case VFIO_EEH_PE_CONFIGURE:
ret = eeh_pe_configure(pe);
break;
+   case VFIO_EEH_PE_INJECT_ERR: {
+   struct vfio_eeh_pe_err err;
+
+   if (op.argsz < minsz + size

[PATCH 1/2] powerpc/eeh: Introduce eeh_pe_inject_err()

2015-03-10 Thread Gavin Shan
The patch defines PCI error types and functions in eeh.h and
exports function eeh_pe_inject_err(), which will be called by
VFIO driver to inject the specified PCI error to the indicated
PE for testing purpose.

Signed-off-by: Gavin Shan 
---
 arch/powerpc/include/asm/eeh.h | 24 
 arch/powerpc/kernel/eeh.c  | 63 ++
 2 files changed, 87 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 9de87ce..eb20c62 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -196,6 +196,28 @@ enum {
 #define EEH_RESET_COMPLETE 4   /* PHB complete reset   */
 #define EEH_LOG_TEMP   1   /* EEH temporary error log  */
 #define EEH_LOG_PERM   2   /* EEH permanent error log  */
+#define EEH_ERR_TYPE_320   /* 32-bits PCI error*/
+#define EEH_ERR_TYPE_641   /* 64-bits PCI error*/
+#define EEH_ERR_FUNC_LD_MEM_ADDR   0   /* Memory load  */
+#define EEH_ERR_FUNC_LD_MEM_DATA   1
+#define EEH_ERR_FUNC_LD_IO_ADDR2   /* IO load  */
+#define EEH_ERR_FUNC_LD_IO_DATA3
+#define EEH_ERR_FUNC_LD_CFG_ADDR   4   /* Config load  */
+#define EEH_ERR_FUNC_LD_CFG_DATA   5
+#define EEH_ERR_FUNC_ST_MEM_ADDR   6   /* Memory store */
+#define EEH_ERR_FUNC_ST_MEM_DATA   7
+#define EEH_ERR_FUNC_ST_IO_ADDR8   /* IO store */
+#define EEH_ERR_FUNC_ST_IO_DATA9
+#define EEH_ERR_FUNC_ST_CFG_ADDR   10  /* Config store */
+#define EEH_ERR_FUNC_ST_CFG_DATA   11
+#define EEH_ERR_FUNC_DMA_RD_ADDR   12  /* DMA read */
+#define EEH_ERR_FUNC_DMA_RD_DATA   13
+#define EEH_ERR_FUNC_DMA_RD_MASTER 14
+#define EEH_ERR_FUNC_DMA_RD_TARGET 15
+#define EEH_ERR_FUNC_DMA_WR_ADDR   16  /* DMA write*/
+#define EEH_ERR_FUNC_DMA_WR_DATA   17
+#define EEH_ERR_FUNC_DMA_WR_MASTER 18
+#define EEH_ERR_FUNC_DMA_WR_TARGET 19
 
 struct eeh_ops {
char *name;
@@ -296,6 +318,8 @@ int eeh_pe_set_option(struct eeh_pe *pe, int option);
 int eeh_pe_get_state(struct eeh_pe *pe);
 int eeh_pe_reset(struct eeh_pe *pe, int option);
 int eeh_pe_configure(struct eeh_pe *pe);
+int eeh_pe_inject_err(struct eeh_pe *pe, int type, int func,
+ unsigned long addr, unsigned long mask);
 
 /**
  * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 60a0f15ce..dbab1a4 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1657,6 +1657,69 @@ int eeh_pe_configure(struct eeh_pe *pe)
 }
 EXPORT_SYMBOL_GPL(eeh_pe_configure);
 
+/**
+ * eeh_pe_inject_err - Injecting the specified PCI error to the indicated PE
+ * @pe: the indicated PE
+ * @type: error type
+ * @function: error function
+ * @addr: address
+ * @mask: address mask
+ *
+ * The routine is called to inject the specified PCI error, which
+ * is determined by @type and @function, to the indicated PE for
+ * testing purpose.
+ */
+int eeh_pe_inject_err(struct eeh_pe *pe, int type, int func,
+ unsigned long addr, unsigned long mask)
+{
+   /* Invalid PE ? */
+   if (!pe)
+   return -ENODEV;
+
+   /* Unsupported operation ? */
+   if (!eeh_ops || !eeh_ops->err_inject)
+   return -ENOENT;
+
+   /* Check on PCI error type */
+   switch (type) {
+   case EEH_ERR_TYPE_32:
+   case EEH_ERR_TYPE_64:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   /* Check on PCI error function */
+   switch (func) {
+   case EEH_ERR_FUNC_LD_MEM_ADDR:
+   case EEH_ERR_FUNC_LD_MEM_DATA:
+   case EEH_ERR_FUNC_LD_IO_ADDR:
+   case EEH_ERR_FUNC_LD_IO_DATA:
+   case EEH_ERR_FUNC_LD_CFG_ADDR:
+   case EEH_ERR_FUNC_LD_CFG_DATA:
+   case EEH_ERR_FUNC_ST_MEM_ADDR:
+   case EEH_ERR_FUNC_ST_MEM_DATA:
+   case EEH_ERR_FUNC_ST_IO_ADDR:
+   case EEH_ERR_FUNC_ST_IO_DATA:
+   case EEH_ERR_FUNC_ST_CFG_ADDR:
+   case EEH_ERR_FUNC_ST_CFG_DATA:
+   case EEH_ERR_FUNC_DMA_RD_ADDR:
+   case EEH_ERR_FUNC_DMA_RD_DATA:
+   case EEH_ERR_FUNC_DMA_RD_MASTER:
+   case EEH_ERR_FUNC_DMA_RD_TARGET:
+   case EEH_ERR_FUNC_DMA_WR_ADDR:
+   case EEH_ERR_FUNC_DMA_WR_DATA:
+   case EEH_ERR_FUNC_DMA_WR_MASTER:
+   case EEH_ERR_FUNC_DMA_WR_TARGET:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return eeh_ops->err_inject(pe, type, func, addr, mask);
+}
+EXPORT_SYMBOL_GPL(eeh_pe_inject_err);
+
 static int proc_eeh_show(struct seq_file *m, void *v)
 {
if (!eeh_enabled()) {
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo