[RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Pavel Fedin
 Hello!

 Since c2f58514cfb374d5368c9da945f1765cd48eb0da ("arm/arm64: KVM: vgic: Check 
for
!irqchip_in_kernel() when mapping resources") we can run qemu with 
kernel_irqchip=off option. The
only remaining problem is how to handle CP15 timer in this case.
 I have applied my old experimental patches to kernel 4.2.2. Using them, i can 
run 'virt' machine
with software-emulated GICv2 on GICv3 hardware without v2 backwards 
compatibility. Also, i studied
4.3 code, and it seems to be feasible to add this support to 4.3 and on.
 The only main question now is how would we do it. We can actually use two 
possible approaches:

 1. If there's no in-kernel irqchip, we revert to older timer behavior 
(ARCH_TIMER_CTRL_IT_MASK
hack), and forward the timer IRQ to userspace using new KVM_EXIT_IRQ return 
code. Timer IRQ has to
be treated as edge-triggered in this case. This is the approach which my 
current implementation
uses.

 2. Another possible approach, based on how device tree binding is handled by 
Linux. It is possible
to remove virtual timer IRQ from the device tree, in this case the kernel 
reverts to using physical
timer. When running under hypervisor, accesses to physical CP15 timer are 
trapped into HYP,
therefore we can forward them to userspace using new exit code, something like 
KVM_EXIT_REG_ACCESS.
In this case the timer would be also emulated by the userspace, which is 
slower, but allows better
emulation. Also, this could be used in order to run some other guests which 
just expect physical
timer to be there.

 Both approaches have their own limitations, but anyway this is much better 
than nothing. What do
you think, and which approach do you like more?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 09/15] KVM: arm64: implement basic ITS register handlers

2015-10-02 Thread Pavel Fedin
 Hello!

 Long time has passed, but i started working on live migration of this thing, 
and found some more
problems.

> @@ -117,9 +305,26 @@ int vits_init(struct kvm *kvm)
>   struct vgic_dist *dist = &kvm->arch.vgic;
>   struct vgic_its *its = &dist->its;
> 
> + dist->pendbaser = kmalloc(sizeof(u64) * dist->nr_cpus, GFP_KERNEL);
> + if (!dist->pendbaser)
> + return -ENOMEM;
> +
>   spin_lock_init(&its->lock);
> 
>   its->enabled = false;
> 
>   return -ENXIO;
>  }
> +

 vits_init() allocates table for per-CPU pendbaser values. However, it is 
called from within
vgicv3_map_resources(), which is in turn called upon first vCPU run. This is 
too late, because in
case of live migration we would first want to set up all registers from within 
the userspace. But,
when i start doing this, i crash in handle_mmio_pendbaser_redist(), because of 
dist->pendbaser being
NULL.
 The solution is to split the function up. I moved vgic_register_kvm_io_dev() 
(introduced by later
patch) to vits_map_resources(), which is now called where vits_init() 
originally was. My new
vits_init() (which is made reentrant by checking for dist->pendbaser != NULL) 
is now called from
within two places:
 a) vits_map_resources()
 b) handle_mmio_pendbaser_redist()

 Therefore, all allocations happen either on first vCPU run, or on first 
PENDBASER access, whatever
comes first. An alternative is to do allocations during 
KVM_DEV_ARM_VGIC_CTRL_INIT.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCHv2] arm: Fail on unknown subtest

2015-10-02 Thread Andrew Jones
On Thu, Oct 01, 2015 at 03:46:25PM -0400, Christopher Covington wrote:
> Signed-off-by: Christopher Covington 
> ---
>  arm/selftest.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arm/selftest.c b/arm/selftest.c
> index fc9ec60..f4a5030 100644
> --- a/arm/selftest.c
> +++ b/arm/selftest.c
> @@ -376,6 +376,9 @@ int main(int argc, char **argv)
>   cpumask_set_cpu(0, &smp_reported);
>   while (!cpumask_full(&smp_reported))
>   cpu_relax();
> + } else {
> + printf("Unknown subtest\n");
> + abort();
>   }
>  
>   return report_summary();
> -- 
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
> 
> 

Reviewed-by: Andrew Jones 

and applied to https://github.com/rhdrjones/kvm-unit-tests/tree/staging

Thanks,
drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Paolo Bonzini


On 02/10/2015 09:28, Pavel Fedin wrote:
>  2. Another possible approach, based on how device tree binding is handled by 
> Linux. It is possible
> to remove virtual timer IRQ from the device tree, in this case the kernel 
> reverts to using physical
> timer. When running under hypervisor, accesses to physical CP15 timer are 
> trapped into HYP,
> therefore we can forward them to userspace using new exit code, something 
> like KVM_EXIT_REG_ACCESS.
> In this case the timer would be also emulated by the userspace, which is 
> slower, but allows better
> emulation. Also, this could be used in order to run some other guests which 
> just expect physical
> timer to be there.
> 
>  Both approaches have their own limitations, but anyway this is much better 
> than nothing. What do
> you think, and which approach do you like more?

I like the latter.  But I guess one could even do both?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Pavel Fedin
 Hello!

> I like the latter.  But I guess one could even do both?

 You know, you are right, they absolutely do not contradict each other.
 Patch set for #1 is quite not big, if you are interested, and nobody votes 
strongly against, i can
rebase it to 4.3 and post.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 01/15] KVM: arm/arm64: VGIC: don't track used LRs in the distributor

2015-10-02 Thread Pavel Fedin
 Hello! One more concern.

> Currently we track which IRQ has been mapped to which VGIC list
> register and also have to synchronize both. We used to do this
> to hold some extra state (for instance the active bit).
> It turns out that this extra state in the LRs is no longer needed and
> this extra tracking causes some pain later.

 Returning to the beginning, can you explain, what kind of pain exactly does it 
bring?
 For some reasons here i had to keep all this tracking mechanism, and modified 
your patchset. I see
no significant problems, except memory usage. I have to allocate 
vgic_irq_lr_map large enough to
hold indexes up to 16384, and indexes from dist->nr_irqs to 8192 appear to be 
not used.
 Since the map itself is actually used only for piggy-back optimization, it is 
possible to easily
get rid of it using for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr) 
iteration instead. The rest
of mechanism will work as it is, there's no need to remove the state tracking 
bitmap and
optimization itself.
 I am currently testing this approach and preparing my alternative patch for 
upstreaming.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Peter Maydell
On 2 October 2015 at 10:30, Paolo Bonzini  wrote:
>
>
> On 02/10/2015 09:28, Pavel Fedin wrote:
>>  2. Another possible approach, based on how device tree binding is handled 
>> by Linux. It is possible
>> to remove virtual timer IRQ from the device tree, in this case the kernel 
>> reverts to using physical
>> timer. When running under hypervisor, accesses to physical CP15 timer are 
>> trapped into HYP,
>> therefore we can forward them to userspace using new exit code, something 
>> like KVM_EXIT_REG_ACCESS.
>> In this case the timer would be also emulated by the userspace, which is 
>> slower, but allows better
>> emulation. Also, this could be used in order to run some other guests which 
>> just expect physical
>> timer to be there.
>>
>>  Both approaches have their own limitations, but anyway this is much better 
>> than nothing. What do
>> you think, and which approach do you like more?
>
> I like the latter.  But I guess one could even do both?

I definitely dislike the latter -- userspace ends up having to
emulate part of the CPU even though that CPU support is really
there in hardware. Also it requires us to edit the device tree,
which means it won't work at all on boards other than 'virt'
where we use the kernel's device tree rather than creating our
own. Better for the kernel to forward the timer
interrupts back out to userspace's irq controller.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] arm: Add PMU test

2015-10-02 Thread Andrew Jones
On Thu, Oct 01, 2015 at 03:47:21PM -0400, Christopher Covington wrote:
> Beginning with just a read of the control register, add plumbing
> for testing the ARM Performance Monitors Unit (PMU).
> 
> Signed-off-by: Christopher Covington 

Hi Christopher,

PMU tests are a great idea. I think Wei Huang has started working
on them too. I've CC'ed him so you two can work out how to proceed
without duplication of effort.

For kvm-unit-tests patches please make sure the kvm-unit-tests flag
is in the PATCH tag. You may want to run

  git config format.subjectprefix "kvm-unit-tests PATCH"

in your git repo.


> ---
>  arm/pmu.c| 31 +++
>  arm/unittests.cfg|  5 +
>  config/config-arm-common.mak |  4 +++-
>  3 files changed, 39 insertions(+), 1 deletion(-)
>  create mode 100644 arm/pmu.c
> 
> diff --git a/arm/pmu.c b/arm/pmu.c
> new file mode 100644
> index 000..b1e3c7a
> --- /dev/null
> +++ b/arm/pmu.c
> @@ -0,0 +1,31 @@
> +#include "libcflat.h"
> +
> +union pmcr_el0 {
> + struct {
> + unsigned int implementor:8;
> + unsigned int identification_code:8;
> + unsigned int num_counters:5;
> + unsigned int zeros:4;
> + unsigned int cycle_counter_long:1;
> + unsigned int cycle_counter_disable_when_prohibited:1;
> + unsigned int event_counter_export:1;
> + unsigned int cycle_counter_clock_divider:1;
> + unsigned int cycle_counter_reset:1;
> + unsigned int event_counter_reset:1;
> + unsigned int enable:1;
> + } split;

I'd use an anonymous struct here. And, since I prefer to avoid needing
to remember a data structure is a union, I'd also make the union anonymous
and wrap it inside a struct, i.e.

struct pmu_data {
union {
uint32_t pmcr_el0;
struct {
...
};
};
};

> + uint32_t full;
> +};
> +
> +int main()
> +{
> + union pmcr_el0 pmcr;
> +
> + asm volatile("mrs %0, pmcr_el0\n" : "=r" (pmcr));
> +
> + printf("PMU implementor: 0x%x\n", pmcr.split.implementor);
> + printf("Identification code: 0x%x\n", pmcr.split.identification_code);
> + printf("Event counters:  %d\n", pmcr.split.num_counters);
> +
> + return report_summary();
> +}
> diff --git a/arm/unittests.cfg b/arm/unittests.cfg
> index e068a0c..d3deb6a 100644
> --- a/arm/unittests.cfg
> +++ b/arm/unittests.cfg
> @@ -35,3 +35,8 @@ file = selftest.flat
>  smp = `getconf _NPROCESSORS_CONF`
>  extra_params = -append 'smp'
>  groups = selftest
> +
> +# Test PMU support
> +[pmu]
> +file = pmu.flat
> +groups = pmu
> diff --git a/config/config-arm-common.mak b/config/config-arm-common.mak
> index 698555d..b34d04c 100644
> --- a/config/config-arm-common.mak
> +++ b/config/config-arm-common.mak
> @@ -11,7 +11,8 @@ endif
>  
>  tests-common = \
>   $(TEST_DIR)/selftest.flat \
> - $(TEST_DIR)/spinlock-test.flat
> + $(TEST_DIR)/spinlock-test.flat \
> + $(TEST_DIR)/pmu.flat
>  
>  all: test_cases
>  
> @@ -70,3 +71,4 @@ test_cases: $(generated_files) $(tests-common) $(tests)
>  
>  $(TEST_DIR)/selftest.elf: $(cstart.o) $(TEST_DIR)/selftest.o
>  $(TEST_DIR)/spinlock-test.elf: $(cstart.o) $(TEST_DIR)/spinlock-test.o
> +$(TEST_DIR)/pmu.elf: $(cstart.o) $(TEST_DIR)/pmu.o
> -- 
> Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
> 
>

Thanks,
drew 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Paolo Bonzini


On 02/10/2015 11:58, Peter Maydell wrote:
> On 2 October 2015 at 10:30, Paolo Bonzini  wrote:
>>
>>
>> On 02/10/2015 09:28, Pavel Fedin wrote:
>>>  2. Another possible approach, based on how device tree binding is handled 
>>> by Linux. It is possible
>>> to remove virtual timer IRQ from the device tree, in this case the kernel 
>>> reverts to using physical
>>> timer. When running under hypervisor, accesses to physical CP15 timer are 
>>> trapped into HYP,
>>> therefore we can forward them to userspace using new exit code, something 
>>> like KVM_EXIT_REG_ACCESS.
>>> In this case the timer would be also emulated by the userspace, which is 
>>> slower, but allows better
>>> emulation. Also, this could be used in order to run some other guests which 
>>> just expect physical
>>> timer to be there.
>>>
>>>  Both approaches have their own limitations, but anyway this is much better 
>>> than nothing. What do
>>> you think, and which approach do you like more?
>>
>> I like the latter.  But I guess one could even do both?
> 
> I definitely dislike the latter -- userspace ends up having to
> emulate part of the CPU even though that CPU support is really
> there in hardware. Also it requires us to edit the device tree,
> which means it won't work at all on boards other than 'virt'
> where we use the kernel's device tree rather than creating our
> own. Better for the kernel to forward the timer
> interrupts back out to userspace's irq controller.

How do boards other than 'virt' work when emulated without KVM?  It must
be possible to emulate the physical timer in QEMU.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Peter Maydell
On 2 October 2015 at 11:05, Paolo Bonzini  wrote:
> On 02/10/2015 11:58, Peter Maydell wrote:
>> I definitely dislike the latter -- userspace ends up having to
>> emulate part of the CPU even though that CPU support is really
>> there in hardware. Also it requires us to edit the device tree,
>> which means it won't work at all on boards other than 'virt'
>> where we use the kernel's device tree rather than creating our
>> own. Better for the kernel to forward the timer
>> interrupts back out to userspace's irq controller.
>
> How do boards other than 'virt' work when emulated without KVM?  It must
> be possible to emulate the physical timer in QEMU.

Without KVM is easy -- we emulate the physical timer as just
one of the parts of the emulated CPU. With KVM, we don't emulate
the CPU at all. We don't try to handle a "half TCG half KVM" setup.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Paolo Bonzini


On 02/10/2015 12:16, Peter Maydell wrote:
> On 2 October 2015 at 11:05, Paolo Bonzini  wrote:
>> On 02/10/2015 11:58, Peter Maydell wrote:
>>> I definitely dislike the latter -- userspace ends up having to
>>> emulate part of the CPU even though that CPU support is really
>>> there in hardware. Also it requires us to edit the device tree,
>>> which means it won't work at all on boards other than 'virt'
>>> where we use the kernel's device tree rather than creating our
>>> own. Better for the kernel to forward the timer
>>> interrupts back out to userspace's irq controller.
>>
>> How do boards other than 'virt' work when emulated without KVM?  It must
>> be possible to emulate the physical timer in QEMU.
> 
> Without KVM is easy -- we emulate the physical timer as just
> one of the parts of the emulated CPU. With KVM, we don't emulate
> the CPU at all. We don't try to handle a "half TCG half KVM" setup.

I mean in the device tree.  Does the boot loader realize it's under a
hypervisor, and provide different device trees to the kernel?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Pavel Fedin
 Hello!

> How do boards other than 'virt' work when emulated without KVM?

 Under "emulated without KVM" we normally suppose full software emulation 
(known in qemu as TCG). In this case CPU code is interpreted, and KVM is not 
used at all.
 Some boards (like vexpress) have own platform-specific timer, so they can work 
under KVM without CP15 timer. In order to make it working one should either 
disable CP15 timer in guest .config or remove the node from device tree.
 CP15 timer never worked in this configuration (KVM without irqchip).

> It must be possible to emulate the physical timer in QEMU.

 Of course it is, and with TCG it works. But we currently cannot use it with 
KVM because KVM assumes that it emulates all CP15 (or system) registers by 
itself, and never hands any of them outside.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Marc Zyngier
On Fri, 2 Oct 2015 10:28:42 +0300
Pavel Fedin  wrote:

[...]

>  1. If there's no in-kernel irqchip, we revert to older timer
> behavior (ARCH_TIMER_CTRL_IT_MASK hack), and forward the timer IRQ to
> userspace using new KVM_EXIT_IRQ return code. Timer IRQ has to be
> treated as edge-triggered in this case. This is the approach which my
> current implementation uses.

No way I'm putting this hack back in. We're working hard enough to make
KVM architecture compliant (making the timer LEVEL triggered), and this
hack prevents legitimate guests from running properly. So it is gone,
never to return.

>  2. Another possible approach, based on how device tree binding is
> handled by Linux. It is possible to remove virtual timer IRQ from the
> device tree, in this case the kernel reverts to using physical timer.
> When running under hypervisor, accesses to physical CP15 timer are
> trapped into HYP, therefore we can forward them to userspace using
> new exit code, something like KVM_EXIT_REG_ACCESS. In this case the
> timer would be also emulated by the userspace, which is slower, but
> allows better emulation. Also, this could be used in order to run
> some other guests which just expect physical timer to be there.

A few things:

- How do you find out that your guest is Linux?
- How do you ensure that the guest will not require the virtual timer?
- Who is going to maintain this new ABI that would exist for the sole
  benefit of broken hardware?
- What is wrong with exposing another, memory mapped timer and remove
  the architected timer entirely?

My guess is that this particular ball of hacks will just bitrot as soon
as your product has shipped, because the world will have moved on and
ditched/reworked the broken HW.

So let me put it another way. The only way I look into this is when we
have this particular platform fully supported in mainline.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Pavel Fedin
 Hello!

> I mean in the device tree.  Does the boot loader realize it's under a
> hypervisor, and provide different device trees to the kernel?

 This has nothing to do with the device tree. Device tree is static. The logic 
sits inside Linux kernel itself:
 a) If we are running in HYP mode (we are host) - use physical timer
 b) If there's no virtual timer IRQ specified - use physical timer
 c) Otherwise - use virtual vimer.

 On systems without virtualization support virtual timer is simply aliased to 
physical timer, this is architectural requirement. Therefore this logic just 
works everywhere.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Peter Maydell
On 2 October 2015 at 11:18, Paolo Bonzini  wrote:
>
>
> On 02/10/2015 12:16, Peter Maydell wrote:
>> On 2 October 2015 at 11:05, Paolo Bonzini  wrote:
>>> On 02/10/2015 11:58, Peter Maydell wrote:
 I definitely dislike the latter -- userspace ends up having to
 emulate part of the CPU even though that CPU support is really
 there in hardware. Also it requires us to edit the device tree,
 which means it won't work at all on boards other than 'virt'
 where we use the kernel's device tree rather than creating our
 own. Better for the kernel to forward the timer
 interrupts back out to userspace's irq controller.
>>>
>>> How do boards other than 'virt' work when emulated without KVM?  It must
>>> be possible to emulate the physical timer in QEMU.
>>
>> Without KVM is easy -- we emulate the physical timer as just
>> one of the parts of the emulated CPU. With KVM, we don't emulate
>> the CPU at all. We don't try to handle a "half TCG half KVM" setup.
>
> I mean in the device tree.  Does the boot loader realize it's under a
> hypervisor, and provide different device trees to the kernel?

No. The device tree says "you have both physical and virtual
timers", the guest kernel always prefers the virtual timer, and
this works for both KVM and TCG.

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/15] KVM: arm/arm64: VGIC: don't track used LRs in the distributor

2015-10-02 Thread Andre Przywara
Hi Pavel,

On 02/10/15 10:55, Pavel Fedin wrote:
>  Hello! One more concern.
> 
>> Currently we track which IRQ has been mapped to which VGIC list
>> register and also have to synchronize both. We used to do this
>> to hold some extra state (for instance the active bit).
>> It turns out that this extra state in the LRs is no longer needed and
>> this extra tracking causes some pain later.
> 
>  Returning to the beginning, can you explain, what kind of pain exactly does 
> it bring?
>  For some reasons here i had to keep all this tracking mechanism, and 
> modified your patchset. I see
> no significant problems, except memory usage. I have to allocate 
> vgic_irq_lr_map large enough to
> hold indexes up to 16384, and indexes from dist->nr_irqs to 8192 appear to be 
> not used.

Yes, this was the main problem I was concerned about. Actually not so
much about memory usage really, but more about the idea of pushing the
concept of bitmaps beyond the point where it is a reasonable data
structure to use.
I briefly thought about extending the bitmaps, but that sounded like a
hack to me.
For instance how did you come up with that 16384? LPIs could actually be
much bigger (in fact the emulation currently support up to 64k).
In my opinion removing that tracking mechanism is actually a good idea.
Most implementations actually have only _four_ list registers and since
we keep them shadowed in our KVM data structure, iterating over all of
them is not really painful.

>  Since the map itself is actually used only for piggy-back optimization, it 
> is possible to easily
> get rid of it using for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr) 
> iteration instead.

Can't you use the ELRSR bitmap instead? The idea of lr_used sounds like
a moot optimization to me.

Cheers,
Andre.

> The rest
> of mechanism will work as it is, there's no need to remove the state tracking 
> bitmap and
> optimization itself.
>  I am currently testing this approach and preparing my alternative patch for 
> upstreaming.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Pavel Fedin
 Hello!

> - How do you find out that your guest is Linux?

 Yes, i agree. No way. And it's not host kernel's job anyway.

> - How do you ensure that the guest will not require the virtual timer?

 Agree again. Except device tree, no way. If the guest has its own hardcoded 
assumptions, it's
doomed. (*)

> - Who is going to maintain this new ABI that would exist for the sole
>   benefit of broken hardware?

 The ABI is simple, just one more exit code. And it's not only for broken 
hardware. It is quite
legitimate thing to have GICv3 without v2 backwards compatibility.

> - What is wrong with exposing another, memory mapped timer and remove
>   the architected timer entirely?

 Well... qemu guys don't want to remove CP15, i already proposed it, and it was 
very simple thing to
do. :) But, again, this is also device-tree-based, the same concern applies as 
to (*).

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 01/15] KVM: arm/arm64: VGIC: don't track used LRs in the distributor

2015-10-02 Thread Pavel Fedin
 Hello!

> For instance how did you come up with that 16384? LPIs could actually be
> much bigger (in fact the emulation currently support up to 64k).

 Well, read "at least 16384". I actually don't remember the exact value you've 
put in.

> >  Since the map itself is actually used only for piggy-back optimization, it 
> > is possible to
> easily
> > get rid of it using for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr) 
> > iteration instead.
> 
> Can't you use the ELRSR bitmap instead? The idea of lr_used sounds like
> a moot optimization to me.

 I'll take a look. You seem to be right, lr_used became a direct (well, 
inverted) copy of elrsr
after full elrsr synchronization was introduced long time ago. It's just 
current code relying on
lr_used everywhere.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 01/15] KVM: arm/arm64: VGIC: don't track used LRs in the distributor

2015-10-02 Thread Pavel Fedin
 Hello!

> Can't you use the ELRSR bitmap instead? The idea of lr_used sounds like
> a moot optimization to me.

 This perfectly works on 4.2, but will break HW interrupt forwarding on 4.3. If 
you look at 4.3
__kvm_vgic_sync_hwstate(), you'll notice that for HW interrupts lr_used and 
elrsr_ptr will diverge
at this point, and this function actually brings them into sync. And it relies 
on lr_used for the
loop to operate correctly (no idea why we use "for" loop here with extra check 
instead of
"for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr)", looks stupid to me).

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/15] KVM: arm/arm64: VGIC: don't track used LRs in the distributor

2015-10-02 Thread Andre Przywara
Hi Pavel,

On 02/10/15 13:39, Pavel Fedin wrote:
>  Hello!
> 
>> Can't you use the ELRSR bitmap instead? The idea of lr_used sounds like
>> a moot optimization to me.
> 
>  This perfectly works on 4.2, but will break HW interrupt forwarding on 4.3. 
> If you look at 4.3
> __kvm_vgic_sync_hwstate(), you'll notice that for HW interrupts lr_used and 
> elrsr_ptr will diverge
> at this point, and this function actually brings them into sync. And it 
> relies on lr_used for the
> loop to operate correctly (no idea why we use "for" loop here with extra 
> check instead of
> "for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr)", looks stupid to me).

I know, because I have reworked my patch lately to work on top of 4.3-rc
and Christoffer's timer rework series. I have something which "works
now"(TM), but wanted to wait for Christoffer's respin to send out.
I will send you this new version this as a sneak preview in private,
maybe that helps.

Cheers,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [edk2] KVM: MTRR: fix memory type handling if MTRR is completely disabled

2015-10-02 Thread Janusz
W dniu 01.10.2015 o 16:18, Paolo Bonzini pisze:
>
> On 01/10/2015 16:12, Janusz wrote:
>> Now, I can also add, that the problem is only when I allow VM to use
>> more than one core, so with option  for example:
>> -smp 8,cores=4,threads=2,sockets=1 and other combinations like -smp
>> 4,threads=1 its not working, and without it I am always running VM
>> without problems
>>
>> Any ideas what can it be? or any idea what would help to find out what
>> is causing this?
> I am going to send a revert of the patch tomorrow.
>
> Paolo
Thanks, but revert patch doesn't help, so something else is wrong here
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-02 Thread Pavel Fedin
Current KVM code has lots of old redundancies, which can be cleaned up.
This patchset is actually a better alternative to
http://www.spinics.net/lists/arm-kernel/msg430726.html, which allows to
keep piggy-backed LRs. The idea is based on the fact that our code also
maintains LR state in elrsr, and this information is enough to track LR
usage.

This patchset is made against linux-next of 02.10.2015. Thanks to Andre
for pointing out some 4.3 specifics.

Pavel Fedin (2):
  Optimize away redundant LR tracking
  Merge vgic_set_lr() and vgic_sync_lr_elrsr()

 include/kvm/arm_vgic.h |   7 
 virt/kvm/arm/vgic-v2.c |   5 ---
 virt/kvm/arm/vgic-v3.c |   5 ---
 virt/kvm/arm/vgic.c| 104 +++--
 4 files changed, 32 insertions(+), 89 deletions(-)

-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: arm/arm64: Optimize away redundant LR tracking

2015-10-02 Thread Pavel Fedin
Currently we use vgic_irq_lr_map in order to track which LRs hold which
IRQs, and lr_used bitmap in order to track which LRs are used or free.

vgic_irq_lr_map is actually used only for piggy-back optimization, and
can be easily replaced by iteration over lr_used. This is good because in
future, when LPI support is introduced, number of IRQs will grow up to at
least 16384, while numbers from 1024 to 8192 are never going to be used.
This would be a huge memory waste.

In its turn, lr_used is also completely redundant since
ae705930fca6322600690df9dc1c7d0516145a93 ("arm/arm64: KVM: Keep elrsr/aisr
in sync with software model"), because together with lr_used we also update
elrsr. This allows to easily replace lr_used with elrsr, inverting all
conditions (because in elrsr '1' means 'free').

Signed-off-by: Pavel Fedin 
---
 include/kvm/arm_vgic.h |  6 
 virt/kvm/arm/vgic.c| 74 +++---
 2 files changed, 28 insertions(+), 52 deletions(-)

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 4e14dac..d908028 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -296,9 +296,6 @@ struct vgic_v3_cpu_if {
 };
 
 struct vgic_cpu {
-   /* per IRQ to LR mapping */
-   u8  *vgic_irq_lr_map;
-
/* Pending/active/both interrupts on this VCPU */
DECLARE_BITMAP( pending_percpu, VGIC_NR_PRIVATE_IRQS);
DECLARE_BITMAP( active_percpu, VGIC_NR_PRIVATE_IRQS);
@@ -309,9 +306,6 @@ struct vgic_cpu {
unsigned long   *active_shared;
unsigned long   *pend_act_shared;
 
-   /* Bitmap of used/free list registers */
-   DECLARE_BITMAP( lr_used, VGIC_V2_MAX_LRS);
-
/* Number of list registers on this CPU */
int nr_lr;
 
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 6bd1c9b..2f4d25a 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -102,9 +102,10 @@
 #include "vgic.h"
 
 static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu);
-static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu);
+static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu);
 static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr);
 static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr lr_desc);
+static u64 vgic_get_elrsr(struct kvm_vcpu *vcpu);
 static struct irq_phys_map *vgic_irq_map_search(struct kvm_vcpu *vcpu,
int virt_irq);
 
@@ -683,9 +684,11 @@ bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio 
*mmio,
 void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
 {
struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
+   u64 elrsr = vgic_get_elrsr(vcpu);
+   unsigned long *elrsr_ptr = u64_to_bitmask(&elrsr);
int i;
 
-   for_each_set_bit(i, vgic_cpu->lr_used, vgic_cpu->nr_lr) {
+   for_each_clear_bit(i, elrsr_ptr, vgic_cpu->nr_lr) {
struct vgic_lr lr = vgic_get_lr(vcpu, i);
 
/*
@@ -728,7 +731,7 @@ void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
 * Mark the LR as free for other use.
 */
BUG_ON(lr.state & LR_STATE_MASK);
-   vgic_retire_lr(i, lr.irq, vcpu);
+   vgic_retire_lr(i, vcpu);
vgic_irq_clear_queued(vcpu, lr.irq);
 
/* Finally update the VGIC state. */
@@ -1087,15 +1090,12 @@ static inline void vgic_enable(struct kvm_vcpu *vcpu)
vgic_ops->enable(vcpu);
 }
 
-static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu)
+static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu)
 {
-   struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr);
 
vlr.state = 0;
vgic_set_lr(vcpu, lr_nr, vlr);
-   clear_bit(lr_nr, vgic_cpu->lr_used);
-   vgic_cpu->vgic_irq_lr_map[irq] = LR_EMPTY;
vgic_sync_lr_elrsr(vcpu, lr_nr, vlr);
 }
 
@@ -1110,14 +1110,15 @@ static void vgic_retire_lr(int lr_nr, int irq, struct 
kvm_vcpu *vcpu)
  */
 static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu)
 {
-   struct vgic_cpu *vgic_cpu = &vcpu->arch.vgic_cpu;
+   u64 elrsr = vgic_get_elrsr(vcpu);
+   unsigned long *elrsr_ptr = u64_to_bitmask(&elrsr);
int lr;
 
-   for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr) {
+   for_each_clear_bit(lr, elrsr_ptr, vgic->nr_lr) {
struct vgic_lr vlr = vgic_get_lr(vcpu, lr);
 
if (!vgic_irq_is_enabled(vcpu, vlr.irq)) {
-   vgic_retire_lr(lr, vlr.irq, vcpu);
+   vgic_retire_lr(lr, vcpu);
if (vgic_irq_is_queued(vcpu, vlr.irq))
vgic_irq_clear_queued(vcpu, vlr.irq);
}
@@ -1169,8 +1170,9 @@ static void vgic_queue_irq_to_lr(struct kvm_vcpu *vcpu, 
int irq,
  */
 bool vgic_queue_irq(struct kvm_vcpu *vcpu, u8 sgi_source_id, int irq)
 {
-   

[PATCH 2/2] KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()

2015-10-02 Thread Pavel Fedin
Now we see that vgic_set_lr() and vgic_sync_lr_elrsr() are always used
together. Merge them into one function, saving from second vgic_ops
dereferencing every time.

Additionally, remove unnecessary vgic_set_lr() in vgic_unqueue_irqs(),
because the following vgic_retire_lr() will reset lr.state to zero
anyway.

Signed-off-by: Pavel Fedin 
---
 include/kvm/arm_vgic.h |  1 -
 virt/kvm/arm/vgic-v2.c |  5 -
 virt/kvm/arm/vgic-v3.c |  5 -
 virt/kvm/arm/vgic.c| 30 --
 4 files changed, 4 insertions(+), 37 deletions(-)

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index d908028..ab5d242 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -112,7 +112,6 @@ struct vgic_vmcr {
 struct vgic_ops {
struct vgic_lr  (*get_lr)(const struct kvm_vcpu *, int);
void(*set_lr)(struct kvm_vcpu *, int, struct vgic_lr);
-   void(*sync_lr_elrsr)(struct kvm_vcpu *, int, struct vgic_lr);
u64 (*get_elrsr)(const struct kvm_vcpu *vcpu);
u64 (*get_eisr)(const struct kvm_vcpu *vcpu);
void(*clear_eisr)(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/arm/vgic-v2.c b/virt/kvm/arm/vgic-v2.c
index 8d7b04d..f9d8da5 100644
--- a/virt/kvm/arm/vgic-v2.c
+++ b/virt/kvm/arm/vgic-v2.c
@@ -79,11 +79,7 @@ static void vgic_v2_set_lr(struct kvm_vcpu *vcpu, int lr,
lr_val |= (lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT);
 
vcpu->arch.vgic_cpu.vgic_v2.vgic_lr[lr] = lr_val;
-}
 
-static void vgic_v2_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
- struct vgic_lr lr_desc)
-{
if (!(lr_desc.state & LR_STATE_MASK))
vcpu->arch.vgic_cpu.vgic_v2.vgic_elrsr |= (1ULL << lr);
else
@@ -166,7 +162,6 @@ static void vgic_v2_enable(struct kvm_vcpu *vcpu)
 static const struct vgic_ops vgic_v2_ops = {
.get_lr = vgic_v2_get_lr,
.set_lr = vgic_v2_set_lr,
-   .sync_lr_elrsr  = vgic_v2_sync_lr_elrsr,
.get_elrsr  = vgic_v2_get_elrsr,
.get_eisr   = vgic_v2_get_eisr,
.clear_eisr = vgic_v2_clear_eisr,
diff --git a/virt/kvm/arm/vgic-v3.c b/virt/kvm/arm/vgic-v3.c
index 7dd5d62..75f6d91 100644
--- a/virt/kvm/arm/vgic-v3.c
+++ b/virt/kvm/arm/vgic-v3.c
@@ -112,11 +112,7 @@ static void vgic_v3_set_lr(struct kvm_vcpu *vcpu, int lr,
}
 
vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)] = lr_val;
-}
 
-static void vgic_v3_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
- struct vgic_lr lr_desc)
-{
if (!(lr_desc.state & LR_STATE_MASK))
vcpu->arch.vgic_cpu.vgic_v3.vgic_elrsr |= (1U << lr);
else
@@ -211,7 +207,6 @@ static void vgic_v3_enable(struct kvm_vcpu *vcpu)
 static const struct vgic_ops vgic_v3_ops = {
.get_lr = vgic_v3_get_lr,
.set_lr = vgic_v3_set_lr,
-   .sync_lr_elrsr  = vgic_v3_sync_lr_elrsr,
.get_elrsr  = vgic_v3_get_elrsr,
.get_eisr   = vgic_v3_get_eisr,
.clear_eisr = vgic_v3_clear_eisr,
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 2f4d25a..7e164eb 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -709,10 +709,8 @@ void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
 * interrupt then move the active state to the
 * distributor tracking bit.
 */
-   if (lr.state & LR_STATE_ACTIVE) {
+   if (lr.state & LR_STATE_ACTIVE)
vgic_irq_set_active(vcpu, lr.irq);
-   lr.state &= ~LR_STATE_ACTIVE;
-   }
 
/*
 * Reestablish the pending state on the distributor and the
@@ -720,17 +718,12 @@ void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
 * is fine, then we are only setting a few bits that were
 * already set.
 */
-   if (lr.state & LR_STATE_PENDING) {
+   if (lr.state & LR_STATE_PENDING)
vgic_dist_irq_set_pending(vcpu, lr.irq);
-   lr.state &= ~LR_STATE_PENDING;
-   }
-
-   vgic_set_lr(vcpu, i, lr);
 
/*
 * Mark the LR as free for other use.
 */
-   BUG_ON(lr.state & LR_STATE_MASK);
vgic_retire_lr(i, vcpu);
vgic_irq_clear_queued(vcpu, lr.irq);
 
@@ -1039,12 +1032,6 @@ static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr,
vgic_ops->set_lr(vcpu, lr, vlr);
 }
 
-static void vgic_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
-  struct vgic_lr vlr)
-{
-   vgic_ops->sync_lr_elrsr(vcpu, lr, vlr);
-}
-
 static inline u64 vgic_get_elrsr(struct kvm_vcpu *vcpu)
 {
return vgic_ops->get_elrsr(vcpu);
@@ -1096,7 +1083,6 @@ static v

Re: [PATCH v3 3/8] arm/arm64: KVM: vgic: Factor out level irq processing on guest exit

2015-10-02 Thread Andre Przywara
Hi Christoffer,

On 29/09/15 15:49, Christoffer Dall wrote:
> Currently vgic_process_maintenance() processes dealing with a completed
> level-triggered interrupt directly, but we are soon going to reuse this
> logic for level-triggered mapped interrupts with the HW bit set, so
> move this logic into a separate static function.
> 
> Probably the most scary part of this commit is convincing yourself that
> the current flow is safe compared to the old one.  In the following I
> try to list the changes and why they are harmless:
> 
>   Move vgic_irq_clear_queued after kvm_notify_acked_irq:
> Harmless because the only potential effect of clearing the queued
> flag wrt.  kvm_set_irq is that vgic_update_irq_pending does not set
> the pending bit on the emulated CPU interface or in the
> pending_on_cpu bitmask if the function is called with level=1.
> However, the point of kvm_notify_acked_irq is to call kvm_set_irq
> with level=0, and we set the queued flag again in
> __kvm_vgic_sync_hwstate later on if the level is stil high.
> 
>   Move vgic_set_lr before kvm_notify_acked_irq:
> Also, harmless because the LR are cpu-local operations and
> kvm_notify_acked only affects the dist
> 
>   Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq:
> Also harmless because it's just a bit which is cleared and altering
> the line state does not affect this bit.

Mmmh, kvm_set_irq(level=0) will eventually execute (in
vgic_update_irq_pending()):

vgic_dist_irq_clear_level(vcpu, irq_num);
if (!vgic_dist_irq_soft_pend(vcpu, irq_num))
vgic_dist_irq_clear_pending(vcpu, irq_num);

So with the former code we would clear the (dist) pending bit if
soft_pend was set before, while with the newer code we wouldn't.
Is this just still working because Linux guests will never set the
soft_pend bit? Or is this safe because will always clear the pending bit
anyway later on? (my brain is too much jellyfish by now to still work
this dependency out)
Or what do I miss here?

> 
> Reviewed-by: Eric Auger 
> Reviewed-by: Marc Zyngier 
> Signed-off-by: Christoffer Dall 
> ---
>  virt/kvm/arm/vgic.c | 88 
> ++---
>  1 file changed, 50 insertions(+), 38 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 6bd1c9b..fe0e5db 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1322,12 +1322,56 @@ epilog:
>   }
>  }
>  
> +static int process_level_irq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> vlr)
> +{
> + int level_pending = 0;

Why is this an int and not a bool? Also see below ...

> +
> + vlr.state = 0;
> + vlr.hwirq = 0;
> + vgic_set_lr(vcpu, lr, vlr);
> +
> + /*
> +  * If the IRQ was EOIed (called from vgic_process_maintenance) or it
> +  * went from active to non-active (called from vgic_sync_hwirq) it was
> +  * also ACKed and we we therefore assume we can clear the soft pending
> +  * state (should it had been set) for this interrupt.
> +  *
> +  * Note: if the IRQ soft pending state was set after the IRQ was
> +  * acked, it actually shouldn't be cleared, but we have no way of
> +  * knowing that unless we start trapping ACKs when the soft-pending
> +  * state is set.
> +  */
> + vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq);
> +
> + /*
> +  * Tell the gic to start sampling the line of this interrupt again.
> +  */
> + vgic_irq_clear_queued(vcpu, vlr.irq);
> +
> + /* Any additional pending interrupt? */
> + if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
> + vgic_cpu_irq_set(vcpu, vlr.irq);
> + level_pending = 1;
> + } else {
> + vgic_dist_irq_clear_pending(vcpu, vlr.irq);
> + vgic_cpu_irq_clear(vcpu, vlr.irq);
> + }
> +
> + /*
> +  * Despite being EOIed, the LR may not have
> +  * been marked as empty.
> +  */
> + vgic_sync_lr_elrsr(vcpu, lr, vlr);
> +
> + return level_pending;
> +}
> +
>  static bool vgic_process_maintenance(struct kvm_vcpu *vcpu)
>  {
>   u32 status = vgic_get_interrupt_status(vcpu);
>   struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
> - bool level_pending = false;
>   struct kvm *kvm = vcpu->kvm;
> + int level_pending = 0;

Why this change here? Even after 8/8 I don't see any use of values
outside of true/false.

Cheers,
Andre.

>  
>   kvm_debug("STATUS = %08x\n", status);
>  
> @@ -1342,54 +1386,22 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
> *vcpu)
>  
>   for_each_set_bit(lr, eisr_ptr, vgic->nr_lr) {
>   struct vgic_lr vlr = vgic_get_lr(vcpu, lr);
> - WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq));
>  
> - spin_lock(&dist->lock);
> - vgic_irq_clear_queued(vcpu, vlr.irq);
> + WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq));
> 

Re: [PATCH v3 4/8] arm/arm64: KVM: Implement GICD_ICFGR as RO for PPIs

2015-10-02 Thread Andre Przywara
Hi Christoffer,

On 29/09/15 15:49, Christoffer Dall wrote:
> The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only.
> We currently simulate this behavior by writing a hardcoded value to the
> register for the SGIs and PPIs on every write of these bits to the
> register (ignoring what the guest actually wrote), and by writing the
> same value as the reset value to the register.
> 
> This is a bit counter-intuitive, as the register is RO for these bits,
> and we can just implement it that way, allowing us to control the value
> of the bits purely in the reset code.
> 
> Reviewed-by: Marc Zyngier 
> Signed-off-by: Christoffer Dall 
> ---
>  virt/kvm/arm/vgic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index fe0e5db..e606f78 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -655,7 +655,7 @@ bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio 
> *mmio,
>   ACCESS_READ_VALUE | ACCESS_WRITE_VALUE);
>   if (mmio->is_write) {
>   if (offset < 8) {
> - *reg = ~0U; /* Force PPIs/SGIs to 1 */
> + /* Ignore writes to read-only SGI and PPI bits */
>   return false;
>   }

Nit: Isn't this now violating kernel coding style because of a single
statement not needing braces? Maybe move the comment in front of the
if-statement to make this more obvious?

Other than that:

Reviewed-by: Andre Przywara 

Cheers,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 5/8] arm/arm64: KVM: Use appropriate define in VGIC reset code

2015-10-02 Thread Andre Przywara
Hi Christoffer,

On 29/09/15 15:49, Christoffer Dall wrote:
> We currently initialize the SGIs to be enabled in the VGIC code, but we
> use the VGIC_NR_PPIS define for this purpose, instead of the the more
> natural VGIC_NR_SGIS.  Change this slightly confusing use of the
> defines.
> 
> Note: This should have no functional change, as both names are defined
> to the number 16.
> 
> Acked-by: Marc Zyngier 
> Signed-off-by: Christoffer Dall 
> ---
>  virt/kvm/arm/vgic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index e606f78..9ed8d53 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -2109,7 +2109,7 @@ int vgic_init(struct kvm *kvm)
>   }
>  
>   for (i = 0; i < dist->nr_irqs; i++) {
> - if (i < VGIC_NR_PPIS)
> + if (i < VGIC_NR_SGIS)
>   vgic_bitmap_set_irq_val(&dist->irq_enabled,
>   vcpu->vcpu_id, i, 1);
>   if (i < VGIC_NR_PRIVATE_IRQS)
> 

While the patch itself is a good catch, I wonder why we iterate over all
IRQs here if we only do something for private IRQs? Can you fix that on
the way as well?
Oh, and while you are at it: ;-)
A comments like: "Set all private IRQs to be edge-triggered and enable
all SGIs." sounds useful to me.

Cheers,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Pavel Fedin
 Hello!

> So let me put it another way. The only way I look into this is when we
> have this particular platform fully supported in mainline.

 I am sorry for possible misunderstanding. Please give me one more minute to 
defend myself...
 So far, we are not putting back timer disable hack. So, i'd like to clarify 
some things about
variant 2. From kernel's point of view, this is not a hack, but pure feature 
enhancement. The idea
is to add KVM API which would allow to emulate system registers, unhandled by 
KVM, in userspace.
Currently KVM just prints error message about unhandled system register access 
and feeds guest with
"illegal instruction" exception. What i actually propose is to add an API which 
would allow to
handle these things in userspace. This will even be architecture-agnostic, and 
it can be useful for
emulating absolutely any future peripherials which could use system register 
(or coprocessor, on
ARM32) interface.
 The rest of timer-related stuff would be needed to be implemented in 
userspace, and this would have
absolutely nothing to do with kernel. By the way, this would also allow to run 
under KVM legitimate
guests which for some reason expect both timers (are there any? RTOS?)
 So, what is your final word? Would you consider this improvement?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-02 Thread Alex Williamson
According to Microsoft documentation, the signature in the standard
hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
for reporting and diagnostic purposes only.  We can therefore allow
the user to change it to whatever they want, within the 12 character
limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
for this, ex:

 -cpu host,hv_time,hv_vendor_id=KeenlyKVM

Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
Signed-off-by: Alex Williamson 
---
 target-i386/cpu-qom.h |1 +
 target-i386/cpu.c |1 +
 target-i386/kvm.c |   14 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index c35b624..6c1eaaa 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -88,6 +88,7 @@ typedef struct X86CPU {
 bool hyperv_vapic;
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
+char *hyperv_vendor_id;
 bool hyperv_time;
 bool hyperv_crash;
 bool check_cpuid;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index bd411b9..101c405 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -3122,6 +3122,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
 DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
 DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
+DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7b0ba17..85aa612 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -489,7 +489,19 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (hyperv_enabled(cpu)) {
 c = &cpuid_data.entries[cpuid_i++];
 c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
-memcpy(signature, "Microsoft Hv", 12);
+if (!cpu->hyperv_vendor_id) {
+memcpy(signature, "Microsoft Hv", 12);
+} else {
+size_t len = strlen(cpu->hyperv_vendor_id);
+
+if (len > 12) {
+fprintf(stderr,
+"hyperv-vendor-id too long, limited to 12 charaters");
+abort();
+}
+memset(signature, 0, 12);
+memcpy(signature, cpu->hyperv_vendor_id, len);
+}
 c->eax = HYPERV_CPUID_MIN;
 c->ebx = signature[0];
 c->ecx = signature[1];

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/15] arm64: Add page size to the kernel image header

2015-10-02 Thread Catalin Marinas
On Tue, Sep 15, 2015 at 04:41:18PM +0100, Suzuki K. Poulose wrote:
> From: Ard Biesheuvel 
> 
> This patch adds the page size to the arm64 kernel image header
> so that one can infer the PAGESIZE used by the kernel. This will
> be helpful to diagnose failures to boot the kernel with page size
> not supported by the CPU.
> 
> Signed-off-by: Ard Biesheuvel 

This patch needs you signed-off-by as well since you are posting it. And
IIRC I acked it as well, I'll check.

If you are fine with adding your signed-of-by, I can add it to the patch
when applying (unless I see other issues with the series).

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[kvm-unit-tests PATCHv2] arm: Add PMU test

2015-10-02 Thread Christopher Covington
Add test the ARM Performance Monitors Unit (PMU). The informational
fields from the control register are printed, but not checked, and
the number of cycles it takes to run a known-instruction-count loop
is printed, but not checked. Once QEMU is fixed, we can at least
begin to check for IPC == 1 when using -icount.

Signed-off-by: Christopher Covington 
---
 arm/pmu.c   | 89 +
 arm/unittests.cfg   | 11 ++
 config/config-arm64.mak |  4 ++-
 3 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 arm/pmu.c

diff --git a/arm/pmu.c b/arm/pmu.c
new file mode 100644
index 000..f724c2c
--- /dev/null
+++ b/arm/pmu.c
@@ -0,0 +1,89 @@
+/*
+ * Test the ARM Performance Monitors Unit (PMU).
+ *
+ * Copyright 2015 The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License version 2.1 and
+ * only version 2.1 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
+ * for more details.
+ */
+#include "libcflat.h"
+
+struct pmu_data {
+   union {
+   uint32_t pmcr_el0;
+   struct {
+   unsigned int enable:1;
+   unsigned int event_counter_reset:1;
+   unsigned int cycle_counter_reset:1;
+   unsigned int cycle_counter_clock_divider:1;
+   unsigned int event_counter_export:1;
+   unsigned int cycle_counter_disable_when_prohibited:1;
+   unsigned int cycle_counter_long:1;
+   unsigned int zeros:4;
+   unsigned int num_counters:5;
+   unsigned int identification_code:8;
+   unsigned int implementor:8;
+   };
+   };
+};
+
+/* Execute a known number of guest instructions. Only odd instruction counts
+ * greater than or equal to 3 are supported. The control register (PMCR) is
+ * initialized with the provided value (allowing for example for the cycle
+ * counter or eventer count to be reset if needed). After the known instruction
+ * count loop, zero is written to the PMCR to disable counting, allowing the
+ * cycle counter or event counters to be read as needed at a later time.
+ */
+static void measure_instrs(int len, struct pmu_data pmcr)
+{
+   int i = (len - 1) / 2;
+
+   if (len < 3 || ((len - 1) % 2))
+   abort();
+
+   asm volatile(
+   "msr pmcr_el0, %[pmcr]\n"
+   "1: subs %[i], %[i], #1\n"
+   "b.gt 1b\n"
+   "msr pmcr_el0, xzr"
+   : [i] "+r" (i) : [pmcr] "r" (pmcr) : "cc");
+}
+
+int main()
+{
+   struct pmu_data pmcr;
+   const int samples = 10;
+
+   asm volatile("mrs %0, pmcr_el0" : "=r" (pmcr));
+
+   printf("PMU implementor: %c\n", pmcr.implementor);
+   printf("Identification code: 0x%x\n", pmcr.identification_code);
+   printf("Event counters:  %d\n", pmcr.num_counters);
+
+   pmcr.cycle_counter_reset = 1;
+   pmcr.enable = 1;
+
+   printf("\ninstructions : cycles0 cycles1 ...\n");
+
+   for (int i = 3; i < 300; i += 32) {
+   int avg, sum = 0;
+   printf("%d :", i);
+   for (int j = 0; j < samples; j++) {
+   int val;
+   measure_instrs(i, pmcr);
+   asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+   sum += val;
+   printf(" %d", val);
+   }
+   avg = sum / samples;
+   printf(" sum=%d avg=%d avg_ipc=%d avg_cpi=%d\n", sum, avg, i / 
avg, avg / i);
+   }
+
+   return report_summary();
+}
diff --git a/arm/unittests.cfg b/arm/unittests.cfg
index e068a0c..b3b7ff4 100644
--- a/arm/unittests.cfg
+++ b/arm/unittests.cfg
@@ -35,3 +35,14 @@ file = selftest.flat
 smp = `getconf _NPROCESSORS_CONF`
 extra_params = -append 'smp'
 groups = selftest
+
+# Test PMU support without -icount
+[pmu]
+file = pmu.flat
+groups = pmu
+
+# Test PMU support with -icount
+[pmu-icount]
+file = pmu.flat
+groups = pmu
+extra_params = -icount 0
diff --git a/config/config-arm64.mak b/config/config-arm64.mak
index d61b703..140b611 100644
--- a/config/config-arm64.mak
+++ b/config/config-arm64.mak
@@ -12,9 +12,11 @@ cflatobjs += lib/arm64/processor.o
 cflatobjs += lib/arm64/spinlock.o
 
 # arm64 specific tests
-tests =
+tests = $(TEST_DIR)/pmu.flat
 
 include config/config-arm-common.mak
 
 arch_clean: arm_clean
$(RM) lib/arm64/.*.d
+
+$(TEST_DIR)/pmu.elf: $(cstart.o) $(TEST_DIR)/pmu.o
-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation

Re: [PATCH 09/15] arm64: Add page size to the kernel image header

2015-10-02 Thread Catalin Marinas
On Fri, Oct 02, 2015 at 04:49:01PM +0100, Catalin Marinas wrote:
> On Tue, Sep 15, 2015 at 04:41:18PM +0100, Suzuki K. Poulose wrote:
> > From: Ard Biesheuvel 
> > 
> > This patch adds the page size to the arm64 kernel image header
> > so that one can infer the PAGESIZE used by the kernel. This will
> > be helpful to diagnose failures to boot the kernel with page size
> > not supported by the CPU.
> > 
> > Signed-off-by: Ard Biesheuvel 
> 
> This patch needs you signed-off-by as well since you are posting it. And
> IIRC I acked it as well, I'll check.
> 
> If you are fine with adding your signed-of-by, I can add it to the patch
> when applying (unless I see other issues with the series).

Actually, I just realised that the kvm patches don't have any acks, so
I'm not taking this series yet. You may want to repost once you have all
the acks in place.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/15] arm64: Add page size to the kernel image header

2015-10-02 Thread Marc Zyngier
On 02/10/15 17:31, Catalin Marinas wrote:
> On Fri, Oct 02, 2015 at 04:49:01PM +0100, Catalin Marinas wrote:
>> On Tue, Sep 15, 2015 at 04:41:18PM +0100, Suzuki K. Poulose wrote:
>>> From: Ard Biesheuvel 
>>>
>>> This patch adds the page size to the arm64 kernel image header
>>> so that one can infer the PAGESIZE used by the kernel. This will
>>> be helpful to diagnose failures to boot the kernel with page size
>>> not supported by the CPU.
>>>
>>> Signed-off-by: Ard Biesheuvel 
>>
>> This patch needs you signed-off-by as well since you are posting it. And
>> IIRC I acked it as well, I'll check.
>>
>> If you are fine with adding your signed-of-by, I can add it to the patch
>> when applying (unless I see other issues with the series).
> 
> Actually, I just realised that the kvm patches don't have any acks, so
> I'm not taking this series yet. You may want to repost once you have all
> the acks in place.
> 

I'm in the middle of it. I should be done next week (I may have said the
same thing two weeks ago...).

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 8/8] arm/arm64: KVM: Support edge-triggered forwarded interrupts

2015-10-02 Thread Andre Przywara
On 29/09/15 15:49, Christoffer Dall wrote:
> We mark edge-triggered interrupts with the HW bit set as queued to
> prevent the VGIC code from injecting LRs with both the Active and
> Pending bits set at the same time while also setting the HW bit,
> because the hardware does not support this.
> 
> However, this means that we must also clear the queued flag when we sync
> back a LR where the state on the physical distributor went from active
> to inactive because the guest deactivated the interrupt.  At this point
> we must also check if the interrupt is pending on the distributor, and
> tell the VGIC to queue it again if it is.
> 
> Since these actions on the sync path are extremely close to those for
> level-triggered interrupts, rename process_level_irq to
> process_queued_irq, allowing it to cater for both cases.
> 
> Signed-off-by: Christoffer Dall 


> ---
>  virt/kvm/arm/vgic.c | 40 ++--
>  1 file changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 53548f1..f3e76e5 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1322,13 +1322,10 @@ epilog:
>   }
>  }
>  
> -static int process_level_irq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> vlr)
> +static int process_queued_irq(struct kvm_vcpu *vcpu,
> +int lr, struct vgic_lr vlr)
>  {
> - int level_pending = 0;
> -
> - vlr.state = 0;
> - vlr.hwirq = 0;
> - vgic_set_lr(vcpu, lr, vlr);
> + int pending = 0;

As I mentioned in my reply to 3/8 already: shouldn't this be "bool"?

>  
>   /*
>* If the IRQ was EOIed (called from vgic_process_maintenance) or it
> @@ -1344,26 +1341,35 @@ static int process_level_irq(struct kvm_vcpu *vcpu, 
> int lr, struct vgic_lr vlr)
>   vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq);
>  
>   /*
> -  * Tell the gic to start sampling the line of this interrupt again.
> +  * Tell the gic to start sampling this interrupt again.
>*/
>   vgic_irq_clear_queued(vcpu, vlr.irq);
>  
>   /* Any additional pending interrupt? */
> - if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
> - vgic_cpu_irq_set(vcpu, vlr.irq);
> - level_pending = 1;
> + if (vgic_irq_is_edge(vcpu, vlr.irq)) {
> + BUG_ON(!(vlr.state & LR_HW));

Is that really needed here? I don't see how this function would fail if
called on a non-mapped IRQ. Also the two current callers would always
fulfil this requirement:
- vgic_process_maintenance() already has a WARN_ON(vgic_irq_is_edge)
- vgic_sync_irq() returns early if it's not a mapped IRQ

Removing this would also allow to pass "int irq" instead of "struct
vgic_lr vlr".

Just an idea, though and not a show-stopper.

Other than that it looks good to me.

Cheers,
Andre.

> + pending = vgic_dist_irq_is_pending(vcpu, vlr.irq);
>   } else {
> - vgic_dist_irq_clear_pending(vcpu, vlr.irq);
> - vgic_cpu_irq_clear(vcpu, vlr.irq);
> + if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
> + vgic_cpu_irq_set(vcpu, vlr.irq);
> + pending = 1;
> + } else {
> + vgic_dist_irq_clear_pending(vcpu, vlr.irq);
> + vgic_cpu_irq_clear(vcpu, vlr.irq);
> + }
>   }
>  
>   /*
>* Despite being EOIed, the LR may not have
>* been marked as empty.
>*/
> + vlr.state = 0;
> + vlr.hwirq = 0;
> + vgic_set_lr(vcpu, lr, vlr);
> +
>   vgic_sync_lr_elrsr(vcpu, lr, vlr);
>  
> - return level_pending;
> + return pending;
>  }
>  
>  static bool vgic_process_maintenance(struct kvm_vcpu *vcpu)
> @@ -1400,7 +1406,7 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
> *vcpu)
>vlr.irq - VGIC_NR_PRIVATE_IRQS);
>  
>   spin_lock(&dist->lock);
> - level_pending |= process_level_irq(vcpu, lr, vlr);
> + level_pending |= process_queued_irq(vcpu, lr, vlr);
>   spin_unlock(&dist->lock);
>   }
>   }
> @@ -1422,7 +1428,7 @@ static bool vgic_process_maintenance(struct kvm_vcpu 
> *vcpu)
>  /*
>   * Save the physical active state, and reset it to inactive.
>   *
> - * Return true if there's a pending level triggered interrupt line to queue.
> + * Return true if there's a pending forwarded interrupt to queue.
>   */
>  static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> vlr)
>  {
> @@ -1458,10 +1464,8 @@ static bool vgic_sync_hwirq(struct kvm_vcpu *vcpu, int 
> lr, struct vgic_lr vlr)
>   return false;
>   }
>  
> - /* Mapped edge-triggered interrupts not yet supported. */
> - WARN_ON(vgic_irq_is_edge(vcpu, vlr.irq));
>   spin_lock(&dist->lock);
> - level_pending = process_level_irq(vcpu, lr, vlr);
> + level_pending = process_queue

Re: linux-next: Tree for Oct 2 (kvm)

2015-10-02 Thread Randy Dunlap
On 10/01/15 22:27, Stephen Rothwell wrote:
> Hi all,
> 
> There will be no linux-next release on Monday.
> 
> Changes since 20151001:
> 

on x86_64:

ERROR: "irq_set_vcpu_affinity" [arch/x86/kvm/kvm-intel.ko] undefined!


Full randconfig file is attached.


-- 
~Randy
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.3.0-rc3 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
CONFIG_KERNEL_LZO=y
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
CONFIG_KDBUS=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_FHANDLE is not set
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_DEBUG=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set

#
# RCU Subsystem
#
CONFIG_TINY_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_KTHREAD_PRIO=0
# CONFIG_RCU_EXPEDITE_BOOT is not set
# CONFIG_BUILD_BIN2C is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_FREEZER is not set
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CPUSETS=y
# CONFIG_PROC_PID_CPUSET is not set
CONFIG_CGROUP_CPUACCT=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
# CONFIG_MEMCG_SWAP is not set
CONFIG_CGROUP_HUGETLB=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
CONFIG_DEBUG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
# CONFIG_BLK_DEV_INITRD is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
# CONFIG_LTO_MENU is not set
CONFIG_ANON_INODES=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
CONFIG_EXPERT=y
# CONFIG_MULTIUSER is not set
# CONFIG_SGETMASK_SYSCALL is not set
# CONFIG_SYSFS_SYSCALL is not set
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_PRINTK=y
# CONFIG_BUG is not set
# CONFIG_PCSPKR_PLATFORM is not set
# CONFIG_BASE_FULL is not set
# CONFIG_FUTEX is not set
# CONFIG_EPOLL is not set
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_BPF_SYSCALL=y
CONFIG_SHMEM=y
# CONFIG_AIO is not set
CONFIG_ADVISE_SYSCALLS=y
# CONFIG_USERFAULTFD is not

Re: PCI passthrough problem

2015-10-02 Thread Phil (list)
On Thu, 2015-10-01 at 21:00 -0600, Alex Williamson wrote:
> On Thu, 2015-10-01 at 22:38 -0400, Phil (list) wrote:
> > On Thu, 2015-10-01 at 08:32 -0400, Mauricio Tavares wrote:
> > > On Thu, Oct 1, 2015 at 3:27 AM, Phil (list) 
> > > wrote:
> > > > If this isn't the right place to ask, any pointers to the
> > > > correct
> > > > place
> > > > are appreciated...
> > > > 
> > > > I'm trying to see if I can get PCI passthrough working for a
> > > > video
> > > > capture card (Hauppauge Colossus 1x PCIe) under a Windows XP
> > > > guest
> > > > (32
> > > > -bit).  Things appear to be somewhat working (Windows is seeing
> > > > the
> > > > device, the drivers successfully installed, and device manager
> > > > indicates everything is working) however when I fire up the
> > > > capture
> > > > application, it is not able to find the device despite Windows
> > > > recognizing it (no errors, it just doesn't 'see' any installed
> > > > capture
> > > > devices).  There is also a secondary capture/viewer application
> > > > that
> > > > won't even install due to not being able to find a capture
> > > > card. 
> > > >  Since
> > > > that wasn't the behavior when running it natively under
> > > > Windows,
> > > > I'm
> > > > assuming that the issue is related to PCI passthrough but it's
> > > > difficult to be certain since I'm not seeing any errors beyond
> > > > the
> > > > capture applications not being able to find the device.
> > > > 
> > >   I think you need to find out if the problem follows the
> > > program,
> > > the card, or the passthrough thingie. For instance, is there any
> > > other
> > > program you can run to see if it sees the card? If you can't
> > > think of
> > > anything, you could run a, say, ubuntu/fedora livecd (start you
> > > vm
> > > client and tell it to boot from iso) and see if it can see and
> > > use
> > > the
> > > card.
> > > 
> > 
> > I only have the two capture apps that came with the card as I don't
> > really use Windows for much other than this card anymore.
> > 
> > To try to verify that everything is fine from a hardware / Windows
> > driver standpoint: I took a spare drive and performed a bare metal
> > Win
> > XP install, installed the drivers, and then the capture software
> > (i.e.
> > the same sequence and software versions as I used in the VM) and
> > everything works properly (i.e. both capture applications were able
> > to
> > detect and use the capture card as expected).   Other than using a
> > different hard drive, all other system hardware was identical.  So
> > that
> > would seem to rule out everything from the hardware through to the
> > Windows applications and leave it back in the realm of kvm/PCI
> > passthrough.
> > 
> > Unfortunately, no Linux drivers exist for this card (i.e. the
> > reason
> > I'm attempting to use it under Windows in a VM) so any other Linux
> > distro would have about the same level of support in that it would
> > recognize that the PCI card exists but then not be able to do
> > anything
> > with it.  If you're thinking that there is a problem with version
> > of
> > kvm in Debian, I would be open to trying another distro if that
> > would
> > help troubleshoot it.  I'm also reasonably comfortable navigating
> > around kvm, it's the PCI passthrough functionality that is new to
> > me.
> 
> Are you using vfio to do the device assignment or legacy KVM device
> assignment.  If the latter, try the former.

vfio

>   Since you're using a 32-bit
> Windows guest, what CPU model are you exposing to the VM?  Windows
> can
> be rather particular about enabling MSI for devices if the processor
> model seen by the VM is too old (does the device support MSI?).  '
> -cpu
> host' might help or "host-passthrough" if using libvirt.

I am passing through the host CPU (Sandy Bridge... which is about a
decade newer than Win XP though.)  If I'm reading the lspci output
correctly, the card supports MSI but it is not enabled (which seems to
make sense as your comment indicates that this would be something that
Windows, not Linux, would enable?):

02:00.0 Multimedia controller: ViXS Systems, Inc. Device 3000
Subsystem: Hauppauge computer works Inc. Device d180
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
SERR-You can look
> in /proc/interrupts on the host and see if you're getting interrupts
> (non-zero count on at least one of the CPUs for the interrupt
> associated
> with the device).

It appears to be getting interrupts but I assume there's no way to tell
which device (usb1 or vfio-intx) they are from:
 16:654  
 6309  0  0  0  0  0  0
  IR-IO-APIC  16-fasteoi   ehci_hcd:usb1, vfio-intx(:02:00.0)

>   If instead the device is using INTx interrupts,
> interrupt masking might be broken.

Am I correct in interpreting  this statement along with the a

Re: [RFC] Handling CP15 timer without in-kernel irqchip

2015-10-02 Thread Christoffer Dall
On Fri, Oct 02, 2015 at 05:54:53PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > So let me put it another way. The only way I look into this is when we
> > have this particular platform fully supported in mainline.
> 
>  I am sorry for possible misunderstanding. Please give me one more minute to 
> defend myself...
>  So far, we are not putting back timer disable hack. So, i'd like to clarify 
> some things about
> variant 2. From kernel's point of view, this is not a hack, but pure feature 
> enhancement. The idea
> is to add KVM API which would allow to emulate system registers, unhandled by 
> KVM, in userspace.
> Currently KVM just prints error message about unhandled system register 
> access and feeds guest with
> "illegal instruction" exception. What i actually propose is to add an API 
> which would allow to
> handle these things in userspace. This will even be architecture-agnostic, 
> and it can be useful for
> emulating absolutely any future peripherials which could use system register 
> (or coprocessor, on
> ARM32) interface.
>  The rest of timer-related stuff would be needed to be implemented in 
> userspace, and this would have
> absolutely nothing to do with kernel.

Except that we now have an ABI that we need to maintain, userspace
functionality we need to maintain, we need to ensure that people running
VMs don't accidentally use userspace timers with poor performance
without noticigin, and there's a high chance that this whole thing will
bit-rot; we've seen it happen before.

I appreciate your enthusiasm on this topic, but we are already swamped
with a lot of other important things to do, and this is just not very
high on the radar at the moment.

> By the way, this would also allow to run under KVM legitimate
> guests which for some reason expect both timers (are there any? RTOS?)
>  So, what is your final word? Would you consider this improvement?
> 
I don't want to deny it flat out, and it's hard to say not having seen
the code, the ABI definition, and hear what the userspace maintainers
think about this.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/8] arm/arm64: KVM: vgic: Factor out level irq processing on guest exit

2015-10-02 Thread Christoffer Dall
On Fri, Oct 02, 2015 at 03:52:42PM +0100, Andre Przywara wrote:
> Hi Christoffer,
> 
> On 29/09/15 15:49, Christoffer Dall wrote:
> > Currently vgic_process_maintenance() processes dealing with a completed
> > level-triggered interrupt directly, but we are soon going to reuse this
> > logic for level-triggered mapped interrupts with the HW bit set, so
> > move this logic into a separate static function.
> > 
> > Probably the most scary part of this commit is convincing yourself that
> > the current flow is safe compared to the old one.  In the following I
> > try to list the changes and why they are harmless:
> > 
> >   Move vgic_irq_clear_queued after kvm_notify_acked_irq:
> > Harmless because the only potential effect of clearing the queued
> > flag wrt.  kvm_set_irq is that vgic_update_irq_pending does not set
> > the pending bit on the emulated CPU interface or in the
> > pending_on_cpu bitmask if the function is called with level=1.
> > However, the point of kvm_notify_acked_irq is to call kvm_set_irq
> > with level=0, and we set the queued flag again in
> > __kvm_vgic_sync_hwstate later on if the level is stil high.
> > 
> >   Move vgic_set_lr before kvm_notify_acked_irq:
> > Also, harmless because the LR are cpu-local operations and
> > kvm_notify_acked only affects the dist
> > 
> >   Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq:
> > Also harmless because it's just a bit which is cleared and altering
> > the line state does not affect this bit.
> 
> Mmmh, kvm_set_irq(level=0) will eventually execute (in
> vgic_update_irq_pending()):
> 
>   vgic_dist_irq_clear_level(vcpu, irq_num);
>   if (!vgic_dist_irq_soft_pend(vcpu, irq_num))
>   vgic_dist_irq_clear_pending(vcpu, irq_num);
> 
> So with the former code we would clear the (dist) pending bit if
> soft_pend was set before, while with the newer code we wouldn't.
> Is this just still working because Linux guests will never set the
> soft_pend bit? Or is this safe because will always clear the pending bit
> anyway later on? (my brain is too much jellyfish by now to still work
> this dependency out)
> Or what do I miss here?
> 

you're right, I need to add a check for the level state and clear the
pnding bit in the vgic_dist_irq_clear_soft_pend() function as well.

> > 
> > Reviewed-by: Eric Auger 
> > Reviewed-by: Marc Zyngier 
> > Signed-off-by: Christoffer Dall 
> > ---
> >  virt/kvm/arm/vgic.c | 88 
> > ++---
> >  1 file changed, 50 insertions(+), 38 deletions(-)
> > 
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index 6bd1c9b..fe0e5db 100644
> > --- a/virt/kvm/arm/vgic.c
> > +++ b/virt/kvm/arm/vgic.c
> > @@ -1322,12 +1322,56 @@ epilog:
> > }
> >  }
> >  
> > +static int process_level_irq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> > vlr)
> > +{
> > +   int level_pending = 0;
> 
> Why is this an int and not a bool? Also see below ...
> 

because I apply a bitwise or operation in the caller, and I wasn't sure
if this was strictly kosher to do that on a bool, so I Googled it, and
found some reports of that going wrong on certain compilers, so I
figured better safe than sorry.

I couldn't easily dig up that resource again though.

> > +
> > +   vlr.state = 0;
> > +   vlr.hwirq = 0;
> > +   vgic_set_lr(vcpu, lr, vlr);
> > +
> > +   /*
> > +* If the IRQ was EOIed (called from vgic_process_maintenance) or it
> > +* went from active to non-active (called from vgic_sync_hwirq) it was
> > +* also ACKed and we we therefore assume we can clear the soft pending
> > +* state (should it had been set) for this interrupt.
> > +*
> > +* Note: if the IRQ soft pending state was set after the IRQ was
> > +* acked, it actually shouldn't be cleared, but we have no way of
> > +* knowing that unless we start trapping ACKs when the soft-pending
> > +* state is set.
> > +*/
> > +   vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq);
> > +
> > +   /*
> > +* Tell the gic to start sampling the line of this interrupt again.
> > +*/
> > +   vgic_irq_clear_queued(vcpu, vlr.irq);
> > +
> > +   /* Any additional pending interrupt? */
> > +   if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
> > +   vgic_cpu_irq_set(vcpu, vlr.irq);
> > +   level_pending = 1;
> > +   } else {
> > +   vgic_dist_irq_clear_pending(vcpu, vlr.irq);
> > +   vgic_cpu_irq_clear(vcpu, vlr.irq);
> > +   }
> > +
> > +   /*
> > +* Despite being EOIed, the LR may not have
> > +* been marked as empty.
> > +*/
> > +   vgic_sync_lr_elrsr(vcpu, lr, vlr);
> > +
> > +   return level_pending;
> > +}
> > +
> >  static bool vgic_process_maintenance(struct kvm_vcpu *vcpu)
> >  {
> > u32 status = vgic_get_interrupt_status(vcpu);
> > struct vgic_dist *dist = &vcpu->kvm->arch.vgic;
> > -   bool level_pending = false;
> > struct kvm *kvm = vcpu->kvm;
> > +   int level_pending = 0;
> 

Re: [PATCH v3 4/8] arm/arm64: KVM: Implement GICD_ICFGR as RO for PPIs

2015-10-02 Thread Christoffer Dall
On Fri, Oct 02, 2015 at 03:51:50PM +0100, Andre Przywara wrote:
> Hi Christoffer,
> 
> On 29/09/15 15:49, Christoffer Dall wrote:
> > The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only.
> > We currently simulate this behavior by writing a hardcoded value to the
> > register for the SGIs and PPIs on every write of these bits to the
> > register (ignoring what the guest actually wrote), and by writing the
> > same value as the reset value to the register.
> > 
> > This is a bit counter-intuitive, as the register is RO for these bits,
> > and we can just implement it that way, allowing us to control the value
> > of the bits purely in the reset code.
> > 
> > Reviewed-by: Marc Zyngier 
> > Signed-off-by: Christoffer Dall 
> > ---
> >  virt/kvm/arm/vgic.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index fe0e5db..e606f78 100644
> > --- a/virt/kvm/arm/vgic.c
> > +++ b/virt/kvm/arm/vgic.c
> > @@ -655,7 +655,7 @@ bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio 
> > *mmio,
> > ACCESS_READ_VALUE | ACCESS_WRITE_VALUE);
> > if (mmio->is_write) {
> > if (offset < 8) {
> > -   *reg = ~0U; /* Force PPIs/SGIs to 1 */
> > +   /* Ignore writes to read-only SGI and PPI bits */
> > return false;
> > }
> 
> Nit: Isn't this now violating kernel coding style because of a single
> statement not needing braces? Maybe move the comment in front of the
> if-statement to make this more obvious?
> 
sure

> Other than that:
> 
> Reviewed-by: Andre Przywara 
> 
Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 8/8] arm/arm64: KVM: Support edge-triggered forwarded interrupts

2015-10-02 Thread Christoffer Dall
On Fri, Oct 02, 2015 at 06:18:03PM +0100, Andre Przywara wrote:
> On 29/09/15 15:49, Christoffer Dall wrote:
> > We mark edge-triggered interrupts with the HW bit set as queued to
> > prevent the VGIC code from injecting LRs with both the Active and
> > Pending bits set at the same time while also setting the HW bit,
> > because the hardware does not support this.
> > 
> > However, this means that we must also clear the queued flag when we sync
> > back a LR where the state on the physical distributor went from active
> > to inactive because the guest deactivated the interrupt.  At this point
> > we must also check if the interrupt is pending on the distributor, and
> > tell the VGIC to queue it again if it is.
> > 
> > Since these actions on the sync path are extremely close to those for
> > level-triggered interrupts, rename process_level_irq to
> > process_queued_irq, allowing it to cater for both cases.
> > 
> > Signed-off-by: Christoffer Dall 
> 
> 
> > ---
> >  virt/kvm/arm/vgic.c | 40 ++--
> >  1 file changed, 22 insertions(+), 18 deletions(-)
> > 
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index 53548f1..f3e76e5 100644
> > --- a/virt/kvm/arm/vgic.c
> > +++ b/virt/kvm/arm/vgic.c
> > @@ -1322,13 +1322,10 @@ epilog:
> > }
> >  }
> >  
> > -static int process_level_irq(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> > vlr)
> > +static int process_queued_irq(struct kvm_vcpu *vcpu,
> > +  int lr, struct vgic_lr vlr)
> >  {
> > -   int level_pending = 0;
> > -
> > -   vlr.state = 0;
> > -   vlr.hwirq = 0;
> > -   vgic_set_lr(vcpu, lr, vlr);
> > +   int pending = 0;
> 
> As I mentioned in my reply to 3/8 already: shouldn't this be "bool"?
> 
> >  
> > /*
> >  * If the IRQ was EOIed (called from vgic_process_maintenance) or it
> > @@ -1344,26 +1341,35 @@ static int process_level_irq(struct kvm_vcpu *vcpu, 
> > int lr, struct vgic_lr vlr)
> > vgic_dist_irq_clear_soft_pend(vcpu, vlr.irq);
> >  
> > /*
> > -* Tell the gic to start sampling the line of this interrupt again.
> > +* Tell the gic to start sampling this interrupt again.
> >  */
> > vgic_irq_clear_queued(vcpu, vlr.irq);
> >  
> > /* Any additional pending interrupt? */
> > -   if (vgic_dist_irq_get_level(vcpu, vlr.irq)) {
> > -   vgic_cpu_irq_set(vcpu, vlr.irq);
> > -   level_pending = 1;
> > +   if (vgic_irq_is_edge(vcpu, vlr.irq)) {
> > +   BUG_ON(!(vlr.state & LR_HW));
> 
> Is that really needed here?

Are BUG_ON statements every 'really needed' ?

> I don't see how this function would fail if
> called on a non-mapped IRQ. Also the two current callers would always
> fulfil this requirement:
> - vgic_process_maintenance() already has a WARN_ON(vgic_irq_is_edge)
> - vgic_sync_irq() returns early if it's not a mapped IRQ

yes, that's why it's a BUG_ON, don't ever do this;)

> 
> Removing this would also allow to pass "int irq" instead of "struct
> vgic_lr vlr".
> 
> Just an idea, though and not a show-stopper.

I don't see a problem with having it there and I put it there because I
wanted to protect us (developers) against accidentally writing a patch
that reuses this function for a non-forwarded edge-triggered interrupt,
which is not supposed to ever happen.

> 
> Other than that it looks good to me.
> 
Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 6/6] arm-smmu: Allow to set iommu mapping for MSI

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> Finally ARM SMMU declare that iommu-mapping for MSI-pages are not
> set automatically and it should be set explicitly.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/iommu/arm-smmu.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index a3956fb..9d37e72 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1401,13 +1401,18 @@ static int arm_smmu_domain_get_attr(struct 
> iommu_domain *domain,
>   enum iommu_attr attr, void *data)
>  {
>   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct iommu_domain_msi_maps *msi_maps;
>  
>   switch (attr) {
>   case DOMAIN_ATTR_NESTING:
>   *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
>   return 0;
>   case DOMAIN_ATTR_MSI_MAPPING:
> - /* Dummy handling added */
> + msi_maps = data;
> +
> + msi_maps->automap = false;
> + msi_maps->override_automap = true;
> +
>   return 0;
>   default:
>   return -ENODEV;

In previous discussions I understood one of the problems you were trying
to solve was having a limited number of MSI banks and while you may be
able to get isolated MSI banks for some number of users, it wasn't
unlimited and sharing may be required.  I don't see any of that
addressed in this series.

Also, the management of reserved IOVAs vs MSI addresses looks really
dubious to me.  How does your platform pick an MSI address and what are
we breaking by covertly changing it?  We seem to be masking over at the
VFIO level, where there should be lower level interfaces doing the right
thing when we configure MSI on the device.

The problem of reporting "automap" base address isn't addressed more
than leaving some unused field in iommu_domain_msi_maps.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/6] vfio: Add interface to iommu-map/unmap MSI pages

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> For MSI interrupts to work for a pass-through devices we need
> to have mapping of msi-pages in iommu. Now on some platforms
> (like x86) does this msi-pages mapping happens magically and in these
> case they chooses an iova which they somehow know that it will never
> overlap with guest memory. But this magic iova selection
> may not be always true for all platform (like PowerPC and ARM64).
> 
> Also on x86 platform, there is no problem as long as running a x86-guest
> on x86-host but there can be issues when running a non-x86 guest on
> x86 host or other userspace applications like (I think ODP/DPDK).
> As in these cases there can be chances that it overlaps with guest
> memory mapping.

Wow, it's amazing anything works... smoke and mirrors.

> This patch add interface to iommu-map and iommu-unmap msi-pages at
> reserved iova chosen by userspace.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/vfio.c |  52 +++
>  drivers/vfio/vfio_iommu_type1.c | 111 
> 
>  include/linux/vfio.h|   9 +++-
>  3 files changed, 171 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 2fb29df..a817d2d 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -605,6 +605,58 @@ static int vfio_iommu_group_notifier(struct 
> notifier_block *nb,
>   return NOTIFY_OK;
>  }
>  
> +int vfio_device_map_msi(struct vfio_device *device, uint64_t msi_addr,
> + uint32_t size, uint64_t *msi_iova)
> +{
> + struct vfio_container *container = device->group->container;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + /* Validate address and size */
> + if (!msi_addr || !size || !msi_iova)
> + return -EINVAL;
> +
> + down_read(&container->group_lock);
> +
> + driver = container->iommu_driver;
> + if (!driver || !driver->ops || !driver->ops->msi_map) {
> + up_read(&container->group_lock);
> + return -EINVAL;
> + }
> +
> + ret = driver->ops->msi_map(container->iommu_data,
> +msi_addr, size, msi_iova);
> +
> + up_read(&container->group_lock);
> + return ret;
> +}
> +
> +int vfio_device_unmap_msi(struct vfio_device *device, uint64_t msi_iova,
> +   uint64_t size)
> +{
> + struct vfio_container *container = device->group->container;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + /* Validate address and size */
> + if (!msi_iova || !size)
> + return -EINVAL;
> +
> + down_read(&container->group_lock);
> +
> + driver = container->iommu_driver;
> + if (!driver || !driver->ops || !driver->ops->msi_unmap) {
> + up_read(&container->group_lock);
> + return -EINVAL;
> + }
> +
> + ret = driver->ops->msi_unmap(container->iommu_data,
> +  msi_iova, size);
> +
> + up_read(&container->group_lock);
> + return ret;
> +}
> +
>  /**
>   * VFIO driver API
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 3315fb6..ab376c2 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1003,12 +1003,34 @@ out_free:
>   return ret;
>  }
>  
> +static void vfio_iommu_unmap_all_reserved_regions(struct vfio_iommu *iommu)
> +{
> + struct vfio_resvd_region *region;
> + struct vfio_domain *d;
> +
> + list_for_each_entry(region, &iommu->reserved_iova_list, next) {
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + if (!region->map_paddr)
> + continue;
> +
> + if (!iommu_iova_to_phys(d->domain, region->iova))
> + continue;
> +
> + iommu_unmap(d->domain, region->iova, PAGE_SIZE);

PAGE_SIZE?  Why not region->size?

> + region->map_paddr = 0;
> + cond_resched();
> + }
> + }
> +}
> +
>  static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  {
>   struct rb_node *node;
>  
>   while ((node = rb_first(&iommu->dma_list)))
>   vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> +
> + vfio_iommu_unmap_all_reserved_regions(iommu);
>  }
>  
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
> @@ -1048,6 +1070,93 @@ done:
>   mutex_unlock(&iommu->lock);
>  }
>  
> +static int vfio_iommu_type1_msi_map(void *iommu_data, uint64_t msi_addr,
> + uint64_t size, uint64_t *msi_iova)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + struct vfio_resvd_region *region;
> + int ret;
> +
> + mutex_lock(&iommu->lock);
> +
> + /* Do not try ceate iommu-mapping if msi reconfig not allowed */
> + if (!iommu->allow_m

Re: [RFC PATCH 2/6] iommu: Add interface to get msi-pages mapping attributes

2015-10-02 Thread Alex Williamson
[really ought to consider cc'ing the iommu list]

On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> This APIs return the capability of automatically mapping msi-pages
> in iommu with some magic iova. Which is what currently most of
> iommu's does and is the default behaviour.
> 
> Further API returns whether iommu allows the user to define different
> iova for mai-page mapping for the domain. This is required when a msi
> capable device is directly assigned to user-space/VM and user-space/VM
> need to define a non-overlapping (from other dma-able address space)
> iova for msi-pages mapping in iommu.
> 
> This patch just define the interface and follow up patches will
> extend this interface.

This is backwards, generally you want to add the infrastructure and only
expose it once all the pieces are in place for it to work.  For
instance, patch 1/6 exposes a new userspace interface for vfio that
doesn't do anything yet.  How does the user know if it's there, *and*
works?

> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/iommu/arm-smmu.c|  3 +++
>  drivers/iommu/fsl_pamu_domain.c |  3 +++
>  drivers/iommu/iommu.c   | 14 ++
>  include/linux/iommu.h   |  9 -
>  4 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 66a803b..a3956fb 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1406,6 +1406,9 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
> *domain,
>   case DOMAIN_ATTR_NESTING:
>   *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
>   return 0;
> + case DOMAIN_ATTR_MSI_MAPPING:
> + /* Dummy handling added */
> + return 0;
>   default:
>   return -ENODEV;
>   }
> diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
> index 1d45293..9a94430 100644
> --- a/drivers/iommu/fsl_pamu_domain.c
> +++ b/drivers/iommu/fsl_pamu_domain.c
> @@ -856,6 +856,9 @@ static int fsl_pamu_get_domain_attr(struct iommu_domain 
> *domain,
>   case DOMAIN_ATTR_FSL_PAMUV1:
>   *(int *)data = DOMAIN_ATTR_FSL_PAMUV1;
>   break;
> + case DOMAIN_ATTR_MSI_MAPPING:
> + /* Dummy handling added */
> + break;
>   default:
>   pr_debug("Unsupported attribute type\n");
>   ret = -EINVAL;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d4f527e..16c2eab 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1216,6 +1216,7 @@ int iommu_domain_get_attr(struct iommu_domain *domain,
>   bool *paging;
>   int ret = 0;
>   u32 *count;
> + struct iommu_domain_msi_maps *msi_maps;
>  
>   switch (attr) {
>   case DOMAIN_ATTR_GEOMETRY:
> @@ -1236,6 +1237,19 @@ int iommu_domain_get_attr(struct iommu_domain *domain,
>   ret = -ENODEV;
>  
>   break;
> + case DOMAIN_ATTR_MSI_MAPPING:
> + msi_maps = data;
> +
> + /* Default MSI-pages are magically mapped with some iova and
> +  * do now allow to configure with different iova.
> +  */
> + msi_maps->automap = true;
> + msi_maps->override_automap = false;

There's no magic.  I think what you're trying to express is the
difference between platforms that support MSI within the IOMMU IOVA
space and thus need explicit IOMMU mappings vs platforms where MSI
mappings either bypass the IOMMU entirely or are setup implicitly with
interrupt remapping support.

Why does it make sense to impose any sort of defaults?  If the IOMMU
driver doesn't tell us what to do, I don't think we want to assume
anything.

> +
> + if (domain->ops->domain_get_attr)
> + ret = domain->ops->domain_get_attr(domain, attr, data);
> +
> + break;
>   default:
>   if (!domain->ops->domain_get_attr)
>   return -EINVAL;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0546b87..6d49f3f 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -83,6 +83,13 @@ struct iommu_domain {
>   struct iommu_domain_geometry geometry;
>  };
>  
> +struct iommu_domain_msi_maps {
> + dma_addr_t base_address;
> + dma_addr_t size;

size_t?

> + bool automap;
> + bool override_automap;
> +};
> +
>  enum iommu_cap {
>   IOMMU_CAP_CACHE_COHERENCY,  /* IOMMU can enforce cache coherent DMA
>  transactions */
> @@ -111,6 +118,7 @@ enum iommu_attr {
>   DOMAIN_ATTR_FSL_PAMU_ENABLE,
>   DOMAIN_ATTR_FSL_PAMUV1,
>   DOMAIN_ATTR_NESTING,/* two stages of translation */
> + DOMAIN_ATTR_MSI_MAPPING, /* Provides MSIs mapping in iommu */
>   DOMAIN_ATTR_MAX,
>  };
>  
> @@ -167,7 +175,6 @@ struct iommu_ops {
>   int (*domain_set_windows)(struct iommu_doma

Re: [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> This patch allows the user-space to know whether msi-pages
> are automatically mapped with some magic iova or not.
> 
> Even if the msi-pages are automatically mapped, still user-space
> wants to over-ride the automatic iova selection for msi-mapping.
> For this user-space need to know whether it is allowed to change
> the automatic mapping or not and this API provides this mechanism.
> Follow up patches will provide how to over-ride this.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 32 
>  include/uapi/linux/vfio.h   |  3 +++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index fa5d3e4..3315fb6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -59,6 +59,7 @@ struct vfio_iommu {
>   struct rb_root  dma_list;
>   boolv2;
>   boolnesting;
> + boolallow_msi_reconfig;
>   struct list_headreserved_iova_list;
>  };
>  
> @@ -1117,6 +1118,23 @@ static int vfio_domains_have_iommu_cache(struct 
> vfio_iommu *iommu)
>   return ret;
>  }
>  
> +static
> +int vfio_domains_get_msi_maps(struct vfio_iommu *iommu,
> +   struct iommu_domain_msi_maps *msi_maps)
> +{
> + struct vfio_domain *d;
> + int ret;
> +
> + mutex_lock(&iommu->lock);
> + /* All domains have same msi-automap property, pick first */
> + d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> + ret = iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_MAPPING,
> + msi_maps);
> + mutex_unlock(&iommu->lock);
> +
> + return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -1138,6 +1156,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>   }
>   } else if (cmd == VFIO_IOMMU_GET_INFO) {
>   struct vfio_iommu_type1_info info;
> + struct iommu_domain_msi_maps msi_maps;
> + int ret;
>  
>   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
>  
> @@ -1149,6 +1169,18 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>   info.flags = 0;
>  
> + ret = vfio_domains_get_msi_maps(iommu, &msi_maps);
> + if (ret)
> + return ret;

And now ioctl(VFIO_IOMMU_GET_INFO) no longer works for any IOMMU
implementing domain_get_attr but not supporting DOMAIN_ATTR_MSI_MAPPING.

> +
> + if (msi_maps.override_automap) {
> + info.flags |= VFIO_IOMMU_INFO_MSI_ALLOW_RECONFIG;
> + iommu->allow_msi_reconfig = true;
> + }
> +
> + if (msi_maps.automap)
> + info.flags |= VFIO_IOMMU_INFO_MSI_AUTOMAP;
> +
>   info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>   return copy_to_user((void __user *)arg, &info, minsz);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1abd1a9..9998f6e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -391,6 +391,9 @@ struct vfio_iommu_type1_info {
>   __u32   argsz;
>   __u32   flags;
>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
> +#define VFIO_IOMMU_INFO_MSI_AUTOMAP (1 << 1) /* MSI pages are auto-mapped
> +in iommu */
> +#define VFIO_IOMMU_INFO_MSI_ALLOW_RECONFIG (1 << 2) /* Allows reconfig 
> automap*/
>   __u64   iova_pgsizes;   /* Bitmap of supported page sizes */
>  };
>  

Once again, exposing interfaces to the user before they actually do
anything is backwards.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] vfio-pci: Create iommu mapping for msi interrupt

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> An MSI-address is allocated and programmed in pcie device
> during interrupt configuration. Now for a pass-through device,
> try to create the iommu mapping for this allocted/programmed
> msi-address.  If the iommu mapping is created and the msi
> address programmed in the pcie device is different from
> msi-iova as per iommu programming then reconfigure the pci
> device to use msi-iova as msi address.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 36 ++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 1f577b4..c9690af 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -312,13 +312,23 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
>   char *name = msix ? "vfio-msix" : "vfio-msi";
>   struct eventfd_ctx *trigger;
> + struct msi_msg msg;
> + struct vfio_device *device;
> + uint64_t msi_addr, msi_iova;
>   int ret;
>  
>   if (vector >= vdev->num_ctx)
>   return -EINVAL;
>  
> + device = vfio_device_get_from_dev(&pdev->dev);

Have you looked at this function?  I don't think we want to be doing
that every time we want to poke the interrupt configuration.  Also note
that IOMMU mappings don't operate on devices, but groups, so maybe we
want to pass the group.

> + if (device == NULL)
> + return -EINVAL;

This would be a legitimate BUG_ON(!device)

> +
>   if (vdev->ctx[vector].trigger) {
>   free_irq(irq, vdev->ctx[vector].trigger);
> + get_cached_msi_msg(irq, &msg);
> + msi_iova = ((u64)msg.address_hi << 32) | msg.address_lo;
> + vfio_device_unmap_msi(device, msi_iova, PAGE_SIZE);
>   kfree(vdev->ctx[vector].name);
>   eventfd_ctx_put(vdev->ctx[vector].trigger);
>   vdev->ctx[vector].trigger = NULL;
> @@ -346,12 +356,11 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>* cached value of the message prior to enabling.
>*/
>   if (msix) {
> - struct msi_msg msg;
> -
>   get_cached_msi_msg(irq, &msg);
>   pci_write_msi_msg(irq, &msg);
>   }
>  
> +

gratuitous newline

>   ret = request_irq(irq, vfio_msihandler, 0,
> vdev->ctx[vector].name, trigger);
>   if (ret) {
> @@ -360,6 +369,29 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   return ret;
>   }
>  
> + /* Re-program the new-iova in pci-device in case there is
> +  * different iommu-mapping created for programmed msi-address.
> +  */
> + get_cached_msi_msg(irq, &msg);
> + msi_iova = 0;
> + msi_addr = (u64)(msg.address_hi) << 32 | (u64)(msg.address_lo);
> + ret = vfio_device_map_msi(device, msi_addr, PAGE_SIZE, &msi_iova);
> + if (ret) {
> + free_irq(irq, vdev->ctx[vector].trigger);
> + kfree(vdev->ctx[vector].name);
> + eventfd_ctx_put(trigger);
> + return ret;
> + }
> +
> + /* Reprogram only if iommu-mapped iova is different from msi-address */
> + if (msi_iova && (msi_iova != msi_addr)) {
> + msg.address_hi = (u32)(msi_iova >> 32);
> + /* Keep Lower bits from original msi message address */
> + msg.address_lo &= PAGE_MASK;
> + msg.address_lo |= (u32)(msi_iova & 0x);

Seems like you're making some assumptions here that are dependent on the
architecture and maybe the platform.

> + pci_write_msi_msg(irq, &msg);
> + }
> +
>   vdev->ctx[vector].trigger = trigger;
>  
>   return 0;



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> This Patch adds the VFIO APIs to add and remove reserved iova
> regions. The reserved iova region can be used for mapping some
> specific physical address in iommu.
> 
> Currently we are planning to use this interface for adding iova
> regions for creating iommu of msi-pages. But the API are designed
> for future extension where some other physical address can be mapped.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 201 
> +++-
>  include/uapi/linux/vfio.h   |  43 +
>  2 files changed, 243 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 57d8c37..fa5d3e4 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -59,6 +59,7 @@ struct vfio_iommu {
>   struct rb_root  dma_list;
>   boolv2;
>   boolnesting;
> + struct list_headreserved_iova_list;

This alignment leads to poor packing in the structure, put it above the
bools.

>  };
>  
>  struct vfio_domain {
> @@ -77,6 +78,15 @@ struct vfio_dma {
>   int prot;   /* IOMMU_READ/WRITE */
>  };
>  
> +struct vfio_resvd_region {
> + dma_addr_t  iova;
> + size_t  size;
> + int prot;   /* IOMMU_READ/WRITE */
> + int refcount;   /* ref count of mappings */
> + uint64_tmap_paddr;  /* Mapped Physical Address */

phys_addr_t

> + struct list_head next;
> +};
> +
>  struct vfio_group {
>   struct iommu_group  *iommu_group;
>   struct list_headnext;
> @@ -106,6 +116,38 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu 
> *iommu,
>   return NULL;
>  }
>  
> +/* This function must be called with iommu->lock held */
> +static bool vfio_overlap_with_resvd_region(struct vfio_iommu *iommu,
> +dma_addr_t start, size_t size)
> +{
> + struct vfio_resvd_region *region;
> +
> + list_for_each_entry(region, &iommu->reserved_iova_list, next) {
> + if (region->iova < start)
> + return (start - region->iova < region->size);
> + else if (start < region->iova)
> + return (region->iova - start < size);

<= on both of the return lines?

> +
> + return (region->size > 0 && size > 0);
> + }
> +
> + return false;
> +}
> +
> +/* This function must be called with iommu->lock held */
> +static
> +struct vfio_resvd_region *vfio_find_resvd_region(struct vfio_iommu *iommu,
> +  dma_addr_t start, size_t size)
> +{
> + struct vfio_resvd_region *region;
> +
> + list_for_each_entry(region, &iommu->reserved_iova_list, next)
> + if (region->iova == start && region->size == size)
> + return region;
> +
> + return NULL;
> +}
> +
>  static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
>  {
>   struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> @@ -580,7 +622,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  
>   mutex_lock(&iommu->lock);
>  
> - if (vfio_find_dma(iommu, iova, size)) {
> + if (vfio_find_dma(iommu, iova, size) ||
> + vfio_overlap_with_resvd_region(iommu, iova, size)) {
>   mutex_unlock(&iommu->lock);
>   return -EEXIST;
>   }
> @@ -626,6 +669,127 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>   return ret;
>  }
>  
> +/* This function must be called with iommu->lock held */
> +static
> +int vfio_iommu_resvd_region_del(struct vfio_iommu *iommu,
> + dma_addr_t iova, size_t size, int prot)
> +{
> + struct vfio_resvd_region *res_region;

Have some consistency in naming, just use "region".
> +
> + res_region = vfio_find_resvd_region(iommu, iova, size);
> + /* Region should not be mapped in iommu */
> + if (res_region == NULL || res_region->map_paddr)
> + return -EINVAL;

Are these two separate errors?  !region is -EINVAL, but being mapped is
-EBUSY.

> +
> + list_del(&res_region->next);
> + kfree(res_region);
> + return 0;
> +}
> +
> +/* This function must be called with iommu->lock held */
> +static int vfio_iommu_resvd_region_add(struct vfio_iommu *iommu,
> +dma_addr_t iova, size_t size, int prot)
> +{
> + struct vfio_resvd_region *res_region;
> +
> + /* Check overlap with with dma maping and reserved regions */
> + if (vfio_find_dma(iommu, iova, size) ||
> + vfio_find_resvd_region(iommu, iova, size))
> + return -EEXIST;
> +
> + res_region = kzalloc(sizeof(*res_region), GFP_KERNEL);
> + if (res_region == NULL)
> + return -ENOMEM;
> +

Re: [RFC PATCH v3] os-android: Add support to android platform

2015-10-02 Thread Houcheng Lin
2015-09-28 19:40 GMT+08:00 Paolo Bonzini :
>
>
> On 24/09/2015 15:21, Houcheng Lin wrote:
>> +if [ "$android" = "yes" ] ; then
>> +  LIBS="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc $LIBS"
>> +  libs_qga="-lglib-2.0 -lgthread-2.0 -lz -lpixman-1 -lintl -liconv -lc"
>> +fi
>
> This change should not be necessary.
>
>> +#define getdtablesize qemu_getdtablesize
>
> Please instead replace all occurrences of getdtablesize with
> qemu_getdtablesize.
>
>>
>> +#ifdef CONFIG_ANDROID
>> +#include "sysemu/os-android.h"
>> +#endif
>> +
>
> Please replace this with
>
> #include 
>
> #ifndef IOV_MAX
> #define IOV_MAX 1024
> #endif
>
> and get rid of os-android.h.
>
>>
>> +#if defined(CONFIG_ANDROID)
>> +char pty_buf[PATH_MAX];
>> +#define ptsname(fd) pty_buf
>> +#endif
>>  const char *slave;
>>  int mfd = -1, sfd = -1;
>>
>> @@ -67,17 +72,21 @@ static int openpty(int *amaster, int *aslave, char *name,
>>
>>  if (grantpt(mfd) == -1 || unlockpt(mfd) == -1)
>>  goto err;
>> -
>> +#if defined(CONFIG_ANDROID)
>> +if (ptsname_r(mfd, pty_buf, PATH_MAX) < 0)
>> +goto err;
>> +#endif
>>  if ((slave = ptsname(mfd)) == NULL)
>>  goto err;
>>
>
>
> Better:
>
> #if defined(CONFIG_ANDROID)
> char slave[PATH_MAX];
> #else
> const char *slave;
> #endif
>
> ...
>
> #if defined(CONFIG_ANDROID)
> if (ptsname_r(mfd, slave, PATH_MAX) < 0)
> goto err;
> #else
> if ((slave = ptsname(mfd)) == NULL)
> goto err;
> #endif

Hi Paolo,
Okay and I will send the patch with these modifications soon. Thanks!

-- 
Best regards,
Houcheng Lin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v4] os-android: Add support to android platform

2015-10-02 Thread Houcheng Lin
Hi,

The v4 patch uses qemu_getdtablesize(), remove os-android.h and
use a local buffer when calling ptsname_r(). This patch is based
on version: 007e620a7576e4ce2ea6955541e87d8ae8ed32ae.

---

Building QEMU on android reqiures android NDK r10 cross-compilation
toolchain with following changes:

- install qemu dependent libraries
- upgrade bionic libc
- add mssing includes, scsi/sg.h, in toolchain
- make a symbolic link x86_64-linux-android-pkg-config that links
  to host's pkg-config.

Then, configure and build static linked qemu by commands:
  $ export SYSROOT="/opt/android-toolchain64/sysroot"
  $ PKG_CONFIG_LIBDIR=/opt/android-toolchain64/sysroot/usr/lib/pkgconfig \
./configure \
--cross-prefix=x86_64-linux-android- --enable-kvm \
--enable-trace-backend=nop --disable-fdt --target-list=x86_64-softmmu \
--disable-spice --disable-vhost-net --disable-libiscsi \
--audio-drv-list="" --disable-gtk --disable-gnutls \
--disable-libnfs --disable-glusterfs --disable-libssh2 \
--disable-seccomp --disable-usb-redir --disable-libusb \
--disable-guest-agent --static
  $ make -j4

For dynamic build, you can skip the "static" option during configure, copy
dependent so files into android and add "." into LD_LIBRARY_PATH.

How to prepare your cross-compilation toolcahin
---

1. Download NDK r10, install toolchain from NDK and build the following 
libraries and install in your toolchain sysroot:
libiconv-1.14
gettext-0.19
libffi-3.0.12
glib-2.34.3
libpng-1.2.52
pixman-0.30

2. Download AOSP and apply this patch I made to support posix lockf()
https://android-review.googlesource.com/#/c/172150/

3. Build bionic C and update your toolchain's libc.a and libc.so.

4. Copy kernel header file, scsi/sg.h into toolchain's 
sysroot/usr/includes/scsi/

5. Update these header files in your toolchain to prevent compilation warning,
   includes:
unistd.h for lockf() and related define
sys/uio.h for pread() and pwrite()
signal.h for sigtimedwait()

Signed-off-by: Houcheng Lin 
---
 configure  | 14 --
 default-configs/pci.mak|  2 +-
 hw/i386/kvm/pci-assign.c   |  1 -
 hw/xenpv/xen_domainbuild.c |  2 +-
 include/qemu/osdep.h   | 10 ++
 kvm-all.c  |  4 
 slirp/misc.c   |  2 +-
 tests/Makefile |  2 ++
 util/oslib-posix.c | 16 
 util/qemu-openpty.c| 21 -
 10 files changed, 63 insertions(+), 11 deletions(-)

diff --git a/configure b/configure
index d7c24cd..cda88c1 100755
--- a/configure
+++ b/configure
@@ -567,7 +567,6 @@ fi
 
 # host *BSD for user mode
 HOST_VARIANT_DIR=""
-
 case $targetos in
 CYGWIN*)
   mingw32="yes"
@@ -693,6 +692,13 @@ Haiku)
   vhost_net="yes"
   vhost_scsi="yes"
   QEMU_INCLUDES="-I\$(SRC_PATH)/linux-headers -I$(pwd)/linux-headers 
$QEMU_INCLUDES"
+  case $cross_prefix in
+*android*)
+  android="yes"
+;;
+*)
+;;
+  esac
 ;;
 esac
 
@@ -3791,7 +3797,7 @@ elif compile_prog "" "$pthread_lib -lrt" ; then
 fi
 
 if test "$darwin" != "yes" -a "$mingw32" != "yes" -a "$solaris" != yes -a \
-"$aix" != "yes" -a "$haiku" != "yes" ; then
+"$aix" != "yes" -a "$haiku" != "yes" -a "$android" != "yes" ; then
 libs_softmmu="-lutil $libs_softmmu"
 fi
 
@@ -4737,6 +4743,10 @@ if test "$linux" = "yes" ; then
   echo "CONFIG_LINUX=y" >> $config_host_mak
 fi
 
+if test "$android" = "yes" ; then
+  echo "CONFIG_ANDROID=y" >> $config_host_mak
+fi
+
 if test "$darwin" = "yes" ; then
   echo "CONFIG_DARWIN=y" >> $config_host_mak
 fi
diff --git a/default-configs/pci.mak b/default-configs/pci.mak
index 7e10903..e76dd41 100644
--- a/default-configs/pci.mak
+++ b/default-configs/pci.mak
@@ -35,5 +35,5 @@ CONFIG_SDHCI=y
 CONFIG_EDU=y
 CONFIG_VGA=y
 CONFIG_VGA_PCI=y
-CONFIG_IVSHMEM=$(CONFIG_KVM)
+CONFIG_IVSHMEM=$(call land,$(call lnot,$(CONFIG_ANDROID)),$(CONFIG_KVM))
 CONFIG_ROCKER=y
diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index b1beaa6..44beee3 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -22,7 +22,6 @@
  */
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/hw/xenpv/xen_domainbuild.c b/hw/xenpv/xen_domainbuild.c
index c0ab753..480ee87 100644
--- a/hw/xenpv/xen_domainbuild.c
+++ b/hw/xenpv/xen_domainbuild.c
@@ -170,7 +170,7 @@ static int xen_domain_watcher(void)
 
 /* close all file handles, except stdio/out/err,
  * our watch pipe and the xen interface handle */
-n = getdtablesize();
+n = qemu_getdtablesize();
 for (i = 3; i < n; i++) {
 if (i == fd[0])
 continue;
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ab3c876..9e26d10 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -74,6 +74