Re: [PATCH] powerpc/mpic: Remove WHOAMI readback after EOI

2015-05-13 Thread Purcareata Bogdan

Ping?

On 24.03.2015 12:43, Bogdan Purcareata wrote:

After previous discussions regarding the subject [1][2], there's no clear
explanation or reason why the call was needed in the first place. The sensible
argument is some sort of synchronization between the CPU and the MPIC, which
hasn't been pointed out precisely and is no longer required (at least on BookE
platforms).

The benefit of this change is saving a MMIO trap per interrupt when running in a
KVM guest.

[1] https://patchwork.ozlabs.org/patch/429098/
[2] https://patchwork.ozlabs.org/patch/433557/

Signed-off-by: Bogdan Purcareata 
---
  arch/powerpc/sysdev/mpic.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index bbfbbf2..045e72a9 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -655,7 +655,6 @@ static inline struct mpic * mpic_from_irq_data(struct 
irq_data *d)
  static inline void mpic_eoi(struct mpic *mpic)
  {
mpic_cpu_write(MPIC_INFO(CPU_EOI), 0);
-   (void)mpic_cpu_read(MPIC_INFO(CPU_WHOAMI));
  }

  /*


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-04-26 Thread Purcareata Bogdan

On 24.04.2015 00:26, Scott Wood wrote:

On Thu, 2015-04-23 at 15:31 +0300, Purcareata Bogdan wrote:

On 23.04.2015 03:30, Scott Wood wrote:

On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:

On 21.04.2015 03:52, Scott Wood wrote:

On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:

There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
function is kvmppc_set_epr, which is a static inline. Removing the static inline
yields a compiler crash (Segmentation fault (core dumped) -
scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
but that's a different story, so I just let it be for now. Point is the time may
include other work after the lock has been released, but before the function
actually returned. I noticed this was the case for .kvm_set_msi, which could
work up to 90 ms, not actually under the lock. This made me change what I'm
looking at.


kvm_set_msi does pretty much nothing outside the lock -- I suspect
you're measuring an interrupt that happened as soon as the lock was
released.


That's exactly right. I've seen things like a timer interrupt occuring right
after the spinlock_irqrestore, but before kvm_set_msi actually returned.

[...]


Or perhaps a different stress scenario involving a lot of VCPUs
and external interrupts?


You could instrument the MPIC code to find out how many loop iterations
you maxed out on, and compare that to the theoretical maximum.


Numbers are pretty low, and I'll try to explain based on my observations.

The problematic section in openpic_update_irq is this [1], since it loops
through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops
through all pending interrupts for a VCPU [2].

The guest interfaces are virtio-vhostnet, which are based on MSI
(/proc/interrupts in guest shows they are MSI). For external interrupts to the
guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized),
so [1] will go on and deliver the interrupt directly and unicast (no VCPUs 
loop).

I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts
are actually pending for the destination VCPU. At most, there were 3 interrupts
- n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that
guest virtio interrupts are cascaded over 1 or a couple of shared MSI 
interrupts.

So worst case, in this scenario, was checking the priorities for 3 pending
interrupts for 1 VCPU. Something like this (some of my prints included):

[61010.582033] openpic_update_irq: destmask 1 last_cpu 0
[61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
[61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
[61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
[61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
[61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1

It would be really helpful to get your comments regarding whether these are
realistical number for everyday use, or they are relevant only to this
particular scenario.


RT isn't about "realistic numbers for everyday use".  It's about worst
cases.


- Can these interrupts be used in directed delivery, so that the destination
mask can include multiple VCPUs?


The Freescale MPIC does not support multiple destinations for most
interrupts, but the (non-FSL-specific) emulation code appears to allow
it.


   The MPIC manual states that timer and IPI
interrupts are supported for directed delivery, altough I'm not sure how much of
this is used in the emulation. I know that kvmppc uses the decrementer outside
of the MPIC.

- How are virtio interrupts cascaded over the shared MSI interrupts?
/proc/device-tree/soc@e000/msi@41600/interrupts in the guest shows 8 values
- 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is
that correct?


It looks like that's currently the case, but actual hardware supports
more than that, so it's possible (albeit unlikely any time soon) that
the emulation eventually does as well.

But it's possible to have interrupts other than MSIs...


Right.

So given that the raw spinlock conversion is not suitable for all the scenarios
supported by the OpenPIC emulation, is it ok that my next step would be to send
a patch containing both the raw spinlock conversion and a mandatory disable of
the in-kernel MPIC? This is actually the last conclusion we came up with some
time ago, but I guess it was good to get some more insight on how things
actually work (at least for me).


Fine with me.  Have you given any thought to ways to restructure the
code to eliminate the problem?


My first thought would be to create a separate lock for each VCPU pending 
interrupts queue, so that we make the whole openpic_irq_update more granular. 
However, this is just a very preliminary thought. Before I can come up with 
anything worthy of consideration, I must read the OpenPIC specification a

Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-04-23 Thread Purcareata Bogdan

On 23.04.2015 03:30, Scott Wood wrote:

On Wed, 2015-04-22 at 15:06 +0300, Purcareata Bogdan wrote:

On 21.04.2015 03:52, Scott Wood wrote:

On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:

There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
function is kvmppc_set_epr, which is a static inline. Removing the static inline
yields a compiler crash (Segmentation fault (core dumped) -
scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
but that's a different story, so I just let it be for now. Point is the time may
include other work after the lock has been released, but before the function
actually returned. I noticed this was the case for .kvm_set_msi, which could
work up to 90 ms, not actually under the lock. This made me change what I'm
looking at.


kvm_set_msi does pretty much nothing outside the lock -- I suspect
you're measuring an interrupt that happened as soon as the lock was
released.


That's exactly right. I've seen things like a timer interrupt occuring right
after the spinlock_irqrestore, but before kvm_set_msi actually returned.

[...]


   Or perhaps a different stress scenario involving a lot of VCPUs
and external interrupts?


You could instrument the MPIC code to find out how many loop iterations
you maxed out on, and compare that to the theoretical maximum.


Numbers are pretty low, and I'll try to explain based on my observations.

The problematic section in openpic_update_irq is this [1], since it loops
through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops
through all pending interrupts for a VCPU [2].

The guest interfaces are virtio-vhostnet, which are based on MSI
(/proc/interrupts in guest shows they are MSI). For external interrupts to the
guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized),
so [1] will go on and deliver the interrupt directly and unicast (no VCPUs 
loop).

I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts
are actually pending for the destination VCPU. At most, there were 3 interrupts
- n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that
guest virtio interrupts are cascaded over 1 or a couple of shared MSI 
interrupts.

So worst case, in this scenario, was checking the priorities for 3 pending
interrupts for 1 VCPU. Something like this (some of my prints included):

[61010.582033] openpic_update_irq: destmask 1 last_cpu 0
[61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
[61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
[61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
[61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
[61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1

It would be really helpful to get your comments regarding whether these are
realistical number for everyday use, or they are relevant only to this
particular scenario.


RT isn't about "realistic numbers for everyday use".  It's about worst
cases.


- Can these interrupts be used in directed delivery, so that the destination
mask can include multiple VCPUs?


The Freescale MPIC does not support multiple destinations for most
interrupts, but the (non-FSL-specific) emulation code appears to allow
it.


  The MPIC manual states that timer and IPI
interrupts are supported for directed delivery, altough I'm not sure how much of
this is used in the emulation. I know that kvmppc uses the decrementer outside
of the MPIC.

- How are virtio interrupts cascaded over the shared MSI interrupts?
/proc/device-tree/soc@e000/msi@41600/interrupts in the guest shows 8 values
- 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is
that correct?


It looks like that's currently the case, but actual hardware supports
more than that, so it's possible (albeit unlikely any time soon) that
the emulation eventually does as well.

But it's possible to have interrupts other than MSIs...


Right.

So given that the raw spinlock conversion is not suitable for all the scenarios 
supported by the OpenPIC emulation, is it ok that my next step would be to send 
a patch containing both the raw spinlock conversion and a mandatory disable of 
the in-kernel MPIC? This is actually the last conclusion we came up with some 
time ago, but I guess it was good to get some more insight on how things 
actually work (at least for me).


Thanks,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-04-22 Thread Purcareata Bogdan

On 21.04.2015 03:52, Scott Wood wrote:

On Mon, 2015-04-20 at 13:53 +0300, Purcareata Bogdan wrote:

There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner
function is kvmppc_set_epr, which is a static inline. Removing the static inline
yields a compiler crash (Segmentation fault (core dumped) -
scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed),
but that's a different story, so I just let it be for now. Point is the time may
include other work after the lock has been released, but before the function
actually returned. I noticed this was the case for .kvm_set_msi, which could
work up to 90 ms, not actually under the lock. This made me change what I'm
looking at.


kvm_set_msi does pretty much nothing outside the lock -- I suspect
you're measuring an interrupt that happened as soon as the lock was
released.


That's exactly right. I've seen things like a timer interrupt occuring right 
after the spinlock_irqrestore, but before kvm_set_msi actually returned.


[...]


  Or perhaps a different stress scenario involving a lot of VCPUs
and external interrupts?


You could instrument the MPIC code to find out how many loop iterations
you maxed out on, and compare that to the theoretical maximum.


Numbers are pretty low, and I'll try to explain based on my observations.

The problematic section in openpic_update_irq is this [1], since it loops 
through all VCPUs, and IRQ_local_pipe further calls IRQ_check, which loops 
through all pending interrupts for a VCPU [2].


The guest interfaces are virtio-vhostnet, which are based on MSI 
(/proc/interrupts in guest shows they are MSI). For external interrupts to the 
guest, the irq_source destmask is currently 0, and last_cpu is 0 (unitialized), 
so [1] will go on and deliver the interrupt directly and unicast (no VCPUs loop).


I activated the pr_debugs in arch/powerpc/kvm/mpic.c, to see how many interrupts 
are actually pending for the destination VCPU. At most, there were 3 interrupts 
- n_IRQ = {224,225,226} - even for 24 flows of ping flood. I understand that 
guest virtio interrupts are cascaded over 1 or a couple of shared MSI interrupts.


So worst case, in this scenario, was checking the priorities for 3 pending 
interrupts for 1 VCPU. Something like this (some of my prints included):


[61010.582033] openpic_update_irq: destmask 1 last_cpu 0
[61010.582034] openpic_update_irq: Only one CPU is allowed to receive this IRQ
[61010.582036] IRQ_local_pipe: IRQ 224 active 0 was 1
[61010.582037] IRQ_check: irq 226 set ivpr_pr=8 pr=-1
[61010.582038] IRQ_check: irq 225 set ivpr_pr=8 pr=-1
[61010.582039] IRQ_check: irq 224 set ivpr_pr=8 pr=-1

It would be really helpful to get your comments regarding whether these are 
realistical number for everyday use, or they are relevant only to this 
particular scenario.


- Can these interrupts be used in directed delivery, so that the destination 
mask can include multiple VCPUs? The MPIC manual states that timer and IPI 
interrupts are supported for directed delivery, altough I'm not sure how much of 
this is used in the emulation. I know that kvmppc uses the decrementer outside 
of the MPIC.


- How are virtio interrupts cascaded over the shared MSI interrupts? 
/proc/device-tree/soc@e000/msi@41600/interrupts in the guest shows 8 values 
- 224 - 231 - so at most there might be 8 pending interrupts in IRQ_check, is 
that correct?


Looking forward to your feedback.

[1] http://lxr.free-electrons.com/source/arch/powerpc/kvm/mpic.c#L454
[2] http://lxr.free-electrons.com/source/arch/powerpc/kvm/mpic.c#L303
[3] 
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/F27971551C9EED8E8525774A0048770A/$file/mpic_db_05_16_2011.pdf


Best regards,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-04-20 Thread Purcareata Bogdan

On 10.04.2015 02:53, Scott Wood wrote:

On Thu, 2015-04-09 at 10:44 +0300, Purcareata Bogdan wrote:

So at this point I was getting kinda frustrated so I decided to measure
the time spend in kvm_mpic_write and kvm_mpic_read. I assumed these were
the main entry points in the in-kernel MPIC and were basically executed
while holding the spinlock. The scenario was the same - 24 VCPUs guest,
with 24 virtio+vhost interfaces, only this time I ran 24 ping flood
threads to another board instead of netperf. I assumed this would impose
a heavier stress.

The latencies look pretty ok, around 1-2 us on average, with the max
shown below:

.kvm_mpic_read  14.560
.kvm_mpic_write 12.608

Those are also microseconds. This was run for about 15 mins.


What about other entry points such as kvm_set_msi() and
kvmppc_mpic_set_epr()?


Thanks for the pointers! I redid the measurements, this time for the functions 
run with the openpic lock down:


.kvm_mpic_read_internal (.kvm_mpic_read)1.664
.kvmppc_mpic_set_epr6.880
.kvm_mpic_write_internal (.kvm_mpic_write)  7.840
.openpic_msi_write (.kvm_set_msi)   10.560

Same scenario, 15 mins, numbers are microseconds.

There was a weird situation for .kvmppc_mpic_set_epr - its corresponding inner 
function is kvmppc_set_epr, which is a static inline. Removing the static inline 
yields a compiler crash (Segmentation fault (core dumped) - 
scripts/Makefile.build:441: recipe for target 'arch/powerpc/kvm/kvm.o' failed), 
but that's a different story, so I just let it be for now. Point is the time may 
include other work after the lock has been released, but before the function 
actually returned. I noticed this was the case for .kvm_set_msi, which could 
work up to 90 ms, not actually under the lock. This made me change what I'm 
looking at.


So far it looks pretty decent. Are there any other MPIC entry points worthy of 
investigation? Or perhaps a different stress scenario involving a lot of VCPUs 
and external interrupts?


Thanks,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-04-09 Thread Purcareata Bogdan

On 04.04.2015 00:26, Scott Wood wrote:

On Fri, 2015-04-03 at 11:07 +0300, Purcareata Bogdan wrote:

On 03.04.2015 02:11, Scott Wood wrote:

On Fri, 2015-03-27 at 19:07 +0200, Purcareata Bogdan wrote:

On 27.02.2015 03:05, Scott Wood wrote:

On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:

On 02/26/2015 02:02 PM, Paolo Bonzini wrote:



On 24/02/2015 00:27, Scott Wood wrote:

This isn't a host PIC driver.  It's guest PIC emulation, some of which
is indeed not suitable for a rawlock (in particular, openpic_update_irq
which loops on the number of vcpus, with a loop body that calls
IRQ_check() which loops over all pending IRQs).


The question is what behavior is wanted of code that isn't quite
RT-ready.  What is preferred, bugs or bad latency?

If the answer is bad latency (which can be avoided simply by not running
KVM on a RT kernel in production), patch 1 can be applied.  If the

can be applied *but* makes no difference if applied or not.


answer is bugs, patch 1 is not upstream material.

I myself prefer to have bad latency; if something takes a spinlock in
atomic context, that spinlock should be raw.  If it hurts (latency),
don't do it (use the affected code).


The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
only in -RT. There is no change upstream. In general we fix such things
in -RT first and forward the patches upstream if possible. This convert
thingy would be possible.
Bug fixing comes before latency no matter if RT or not. Converting
every lock into a rawlock is not always the answer.
Last thing I read from Scott is that he is not entirely sure if this is
the right approach or not and patch #1 was not acked-by him either.

So for now I wait for Scott's feedback and maybe a backtrace :)


Obviously leaving it in a buggy state is not what we want -- but I lean
towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
in-kernel MPIC emulation (which is itself just an optimization -- you
can still use KVM without it).  This way people don't enable it with RT
without being aware of the issue, and there's more of an incentive to
fix it properly.

I'll let Bogdan supply the backtrace.


So about the backtrace. Wasn't really sure how to "catch" this, so what
I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest
run 24 netperf flows with an external back to back board of the same
kind. I assumed this would provide the sufficient VCPUs and external
interrupt to expose an alleged culprit.

With regards to measuring the latency, I thought of using ftrace,
specifically the preemptirqsoff latency histogram. Unfortunately, I
wasn't able to capture any major differences between running a guest
with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion
applied) vs. no in-kernel MPIC emulation. Function profiling
(trace_stat) shows that in the second case there's a far greater time
spent in kvm_handle_exit (100x), but overall, the maximum latencies for
preemptirqsoff don't look that much different.

Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host
RT Linux, sorted in descending order, expressed in microseconds:

In-kernel MPIC  QEMU MPIC
39755105


What are you measuring?  Latency in the host, or in the guest?


This is in the host kernel.


Those are terrible numbers in both cases.  Can you use those tracing
tools to find out what the code path is for QEMU MPIC?


After more careful inspection, I noticed that those big-big numbers
(couple of milliseconds) are isolated cases, and in fact 99.99% of those
latencies top to somewhere around 800us. I also have a feeling that the
isolated huge latencies might have something to do with
enabling/disabling tracing, since those numbers don't come up at all in
the actual trace output, only in the latency histogram. From what I
know, there are two separate mechanisms - the function tracer and the
latency histogram.

Now, about that max 800us - there are 2 things that are enabled by
default, and can cause bad latency:
1. scheduler load balancing - latency can top to up to 800us (as seen in
the trace output).
2. RT throttling - which calls sched_rt_period_timer, which cycles
through the runqueues of all CPUs - latency can top to 600us.

I'm mentioning these since the trace output for the max preemptirqsoff
period was always "stolen" by these activities, basically hiding
anything related to the kvm in-kernel openpic.

I disabled both of them, and now the max preemptirqsoff trace shows a
transition between a vhost thread and the qemu process, involving a
timer and external interrupt (do_irq), which you can see at the end of
this e-mail. Not much particularly related to the kvm openpic (but
perhaps I'm not able to understand it entirely). The trace for QEMU
MPIC looks pretty much the same.

So at this point I was getting kinda f

Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-04-03 Thread Purcareata Bogdan

On 03.04.2015 02:11, Scott Wood wrote:

On Fri, 2015-03-27 at 19:07 +0200, Purcareata Bogdan wrote:

On 27.02.2015 03:05, Scott Wood wrote:

On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:

On 02/26/2015 02:02 PM, Paolo Bonzini wrote:



On 24/02/2015 00:27, Scott Wood wrote:

This isn't a host PIC driver.  It's guest PIC emulation, some of which
is indeed not suitable for a rawlock (in particular, openpic_update_irq
which loops on the number of vcpus, with a loop body that calls
IRQ_check() which loops over all pending IRQs).


The question is what behavior is wanted of code that isn't quite
RT-ready.  What is preferred, bugs or bad latency?

If the answer is bad latency (which can be avoided simply by not running
KVM on a RT kernel in production), patch 1 can be applied.  If the

can be applied *but* makes no difference if applied or not.


answer is bugs, patch 1 is not upstream material.

I myself prefer to have bad latency; if something takes a spinlock in
atomic context, that spinlock should be raw.  If it hurts (latency),
don't do it (use the affected code).


The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
only in -RT. There is no change upstream. In general we fix such things
in -RT first and forward the patches upstream if possible. This convert
thingy would be possible.
Bug fixing comes before latency no matter if RT or not. Converting
every lock into a rawlock is not always the answer.
Last thing I read from Scott is that he is not entirely sure if this is
the right approach or not and patch #1 was not acked-by him either.

So for now I wait for Scott's feedback and maybe a backtrace :)


Obviously leaving it in a buggy state is not what we want -- but I lean
towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
in-kernel MPIC emulation (which is itself just an optimization -- you
can still use KVM without it).  This way people don't enable it with RT
without being aware of the issue, and there's more of an incentive to
fix it properly.

I'll let Bogdan supply the backtrace.


So about the backtrace. Wasn't really sure how to "catch" this, so what
I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest
run 24 netperf flows with an external back to back board of the same
kind. I assumed this would provide the sufficient VCPUs and external
interrupt to expose an alleged culprit.

With regards to measuring the latency, I thought of using ftrace,
specifically the preemptirqsoff latency histogram. Unfortunately, I
wasn't able to capture any major differences between running a guest
with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion
applied) vs. no in-kernel MPIC emulation. Function profiling
(trace_stat) shows that in the second case there's a far greater time
spent in kvm_handle_exit (100x), but overall, the maximum latencies for
preemptirqsoff don't look that much different.

Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host
RT Linux, sorted in descending order, expressed in microseconds:

In-kernel MPIC  QEMU MPIC
39755105


What are you measuring?  Latency in the host, or in the guest?


This is in the host kernel. It's the maximum continuous period of time 
when both interrupts and preemption were disabled on the host kernel 
(basically making it unresponsive). This has been tracked while the 
guest was running with high prio, with 24 VCPUs, and in the guest there 
were 24 netperf flows - so a lot of VCPUs and a lot of external 
interrupts - for about 15 minutes.


Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-03-27 Thread Purcareata Bogdan

On 27.02.2015 03:05, Scott Wood wrote:

On Thu, 2015-02-26 at 14:31 +0100, Sebastian Andrzej Siewior wrote:

On 02/26/2015 02:02 PM, Paolo Bonzini wrote:



On 24/02/2015 00:27, Scott Wood wrote:

This isn't a host PIC driver.  It's guest PIC emulation, some of which
is indeed not suitable for a rawlock (in particular, openpic_update_irq
which loops on the number of vcpus, with a loop body that calls
IRQ_check() which loops over all pending IRQs).


The question is what behavior is wanted of code that isn't quite
RT-ready.  What is preferred, bugs or bad latency?

If the answer is bad latency (which can be avoided simply by not running
KVM on a RT kernel in production), patch 1 can be applied.  If the

can be applied *but* makes no difference if applied or not.


answer is bugs, patch 1 is not upstream material.

I myself prefer to have bad latency; if something takes a spinlock in
atomic context, that spinlock should be raw.  If it hurts (latency),
don't do it (use the affected code).


The problem, that is fixed by this s/spin_lock/raw_spin_lock/, exists
only in -RT. There is no change upstream. In general we fix such things
in -RT first and forward the patches upstream if possible. This convert
thingy would be possible.
Bug fixing comes before latency no matter if RT or not. Converting
every lock into a rawlock is not always the answer.
Last thing I read from Scott is that he is not entirely sure if this is
the right approach or not and patch #1 was not acked-by him either.

So for now I wait for Scott's feedback and maybe a backtrace :)


Obviously leaving it in a buggy state is not what we want -- but I lean
towards a short term "fix" of putting "depends on !PREEMPT_RT" on the
in-kernel MPIC emulation (which is itself just an optimization -- you
can still use KVM without it).  This way people don't enable it with RT
without being aware of the issue, and there's more of an incentive to
fix it properly.

I'll let Bogdan supply the backtrace.


So about the backtrace. Wasn't really sure how to "catch" this, so what 
I did was to start a 24 VCPUs guest on a 24 CPU board, and in the guest 
run 24 netperf flows with an external back to back board of the same 
kind. I assumed this would provide the sufficient VCPUs and external 
interrupt to expose an alleged culprit.


With regards to measuring the latency, I thought of using ftrace, 
specifically the preemptirqsoff latency histogram. Unfortunately, I 
wasn't able to capture any major differences between running a guest 
with in-kernel MPIC emulation (with the openpic raw_spinlock_conversion 
applied) vs. no in-kernel MPIC emulation. Function profiling 
(trace_stat) shows that in the second case there's a far greater time 
spent in kvm_handle_exit (100x), but overall, the maximum latencies for 
preemptirqsoff don't look that much different.


Here are the max numbers (preemptirqsoff) for the 24 CPUs, on the host 
RT Linux, sorted in descending order, expressed in microseconds:


In-kernel MPIC  QEMU MPIC
39755105
20793972
13033557
11061725
447 907
423 853
362 723
343 182
260 121
133 116
131 116
118 115
116 114
114 114
114 114
114 99
113 99
103 98
98  98
95  97
87  96
83  83
83  82
80  81

I'm not sure if this captures openpic behavior or just scheduler behavior.

Anyways, I'm pro adding the openpic raw_spinlock conversion along with 
disabling the in-kernel MPIC emulation for upstream. But just wanted to 
catch up with this last request from a while ago.


Do you think it would be better to just submit the new patch or should I 
do some further testing? Do you have any suggestions regarding what else 
I should look at / how to test?


Thank you,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/3] powerpc: Enable seccomp filter support

2015-03-23 Thread Purcareata Bogdan

On 27.02.2015 22:54, Benjamin Herrenschmidt wrote:

On Fri, 2015-02-27 at 09:28 +0200, Purcareata Bogdan wrote:

Ping?


What is the ping for ?

Ben.


Hello Ben,

I just wanted to check with you what's the current status of these 
patches. I noticed in patchwork [1][2][3] that the patches are marked as 
non-applicable.


As of today, I cloned Michael Ellerman's tree [4], applied the patches 
on the master branch, compiled and tested. Tests pass both with the 
libseccomp regression suite and my LXC tests.


Is there a specific tree I should send them against, or on another 
mailing list? Is there any other reason the patches are not applicable?


[1] https://patchwork.ozlabs.org/patch/440827/
[2] https://patchwork.ozlabs.org/patch/440828/
[3] https://patchwork.ozlabs.org/patch/440829/
[4] git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux.git

Thank you,
Bogdan P.


On 18.02.2015 10:16, Bogdan Purcareata wrote:

Add the missing pieces in order to enable SECCOMP_FILTER on PowerPC
architectures, and enable this support.

Testing has been pursued using libseccomp with the latest ppc support patches
[1][2], on Freescale platforms for both ppc and ppc64. Support on ppc64le has
also been tested, courtesy of Mike Strosaker.

[1] https://groups.google.com/forum/#!topic/libseccomp/oz42LfMDsxg
[2] https://groups.google.com/forum/#!topic/libseccomp/TQWfCt_nD7c

v4:
- rebased on top of 3.19

v3:
- keep setting ENOSYS in syscall entry assembly when syscall tracing is disabled

v2:
- move setting ENOSYS from syscall entry assembly to do_syscall_trace_enter

Bogdan Purcareata (3):
powerpc: Don't force ENOSYS as error on syscall fail
powerpc: Relax secure computing on syscall entry trace
powerpc: Enable HAVE_ARCH_SECCOMP_FILTER

   arch/powerpc/Kconfig   | 1 +
   arch/powerpc/kernel/entry_32.S | 7 ++-
   arch/powerpc/kernel/entry_64.S | 5 +++--
   arch/powerpc/kernel/ptrace.c   | 8 ++--
   4 files changed, 16 insertions(+), 5 deletions(-)





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/3] powerpc: Enable seccomp filter support

2015-03-09 Thread Purcareata Bogdan

On 27.02.2015 22:54, Benjamin Herrenschmidt wrote:

On Fri, 2015-02-27 at 09:28 +0200, Purcareata Bogdan wrote:

Ping?


What is the ping for ?

Ben.


Making sure the patches are not lost on the mailing lists :) Didn't 
receive any feedback on v4 and just wanted to check if there's anything 
more I can do.


Thank you,
Bogdan P.


On 18.02.2015 10:16, Bogdan Purcareata wrote:

Add the missing pieces in order to enable SECCOMP_FILTER on PowerPC
architectures, and enable this support.

Testing has been pursued using libseccomp with the latest ppc support patches
[1][2], on Freescale platforms for both ppc and ppc64. Support on ppc64le has
also been tested, courtesy of Mike Strosaker.

[1] https://groups.google.com/forum/#!topic/libseccomp/oz42LfMDsxg
[2] https://groups.google.com/forum/#!topic/libseccomp/TQWfCt_nD7c

v4:
- rebased on top of 3.19

v3:
- keep setting ENOSYS in syscall entry assembly when syscall tracing is disabled

v2:
- move setting ENOSYS from syscall entry assembly to do_syscall_trace_enter

Bogdan Purcareata (3):
powerpc: Don't force ENOSYS as error on syscall fail
powerpc: Relax secure computing on syscall entry trace
powerpc: Enable HAVE_ARCH_SECCOMP_FILTER

   arch/powerpc/Kconfig   | 1 +
   arch/powerpc/kernel/entry_32.S | 7 ++-
   arch/powerpc/kernel/entry_64.S | 5 +++--
   arch/powerpc/kernel/ptrace.c   | 8 ++--
   4 files changed, 16 insertions(+), 5 deletions(-)





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/3] powerpc: Enable seccomp filter support

2015-02-26 Thread Purcareata Bogdan

Ping?

On 18.02.2015 10:16, Bogdan Purcareata wrote:

Add the missing pieces in order to enable SECCOMP_FILTER on PowerPC
architectures, and enable this support.

Testing has been pursued using libseccomp with the latest ppc support patches
[1][2], on Freescale platforms for both ppc and ppc64. Support on ppc64le has
also been tested, courtesy of Mike Strosaker.

[1] https://groups.google.com/forum/#!topic/libseccomp/oz42LfMDsxg
[2] https://groups.google.com/forum/#!topic/libseccomp/TQWfCt_nD7c

v4:
- rebased on top of 3.19

v3:
- keep setting ENOSYS in syscall entry assembly when syscall tracing is disabled

v2:
- move setting ENOSYS from syscall entry assembly to do_syscall_trace_enter

Bogdan Purcareata (3):
   powerpc: Don't force ENOSYS as error on syscall fail
   powerpc: Relax secure computing on syscall entry trace
   powerpc: Enable HAVE_ARCH_SECCOMP_FILTER

  arch/powerpc/Kconfig   | 1 +
  arch/powerpc/kernel/entry_32.S | 7 ++-
  arch/powerpc/kernel/entry_64.S | 5 +++--
  arch/powerpc/kernel/ptrace.c   | 8 ++--
  4 files changed, 16 insertions(+), 5 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-23 Thread Purcareata Bogdan

On 20.02.2015 17:17, Sebastian Andrzej Siewior wrote:

On 02/20/2015 04:10 PM, Paolo Bonzini wrote:

On 20/02/2015 16:06, Sebastian Andrzej Siewior wrote:

On 02/20/2015 03:57 PM, Paolo Bonzini wrote:



Yes, but large latencies just mean the code has to be rewritten (x86
doesn't anymore do event injection in an atomic regions for example).
Until it is, using raw_spin_lock is correct.


It does not sound like it. It sounds more like disabling interrupts to
get things run faster and then limit it on a different corner to not
blow up everything.


"This patchset enables running KVM SMP guests with external interrupts
on an underlying RT-enabled Linux. Previous to this patch, a guest with
in-kernel MPIC emulation could easily panic the kernel due to preemption
when delivering IPIs and external interrupts, because of the openpic
spinlock becoming a sleeping mutex on PREEMPT_RT_FULL Linux".


Max latencies was decreased "Max latency (us)  7062" and that
is why this is done? For 8 us and possible DoS in case there are too
many cpus?


My understanding is that:

1) netperf can get you a BUG KVM, and raw_spinlock fixes that


Actually, it's not just netperf. The bug triggers in the following 
scenarios:
- running CPU intensive task (while true; do date; done) in SMP guest 
(even with 2 VCPUs)

- running netperf in guest
- running cyclictest in SMP guest


May I please see a backtrace with context tracking which states where
the interrupts / preemption gets disabled and where the lock was taken?


Will do, I will get back to you as soon as I have it available. I will 
try and capture it using function trace.



I'm not totally against this patch I just want to make sure this is not
a blind raw conversation to shup up the warning the kernel throws.


2) cyclictest did not trigger the BUG, and you can also get reduced
latency from using raw_spinlock.

I think we agree that (2) is not a factor in accepting the patch.

good :)



Paolo


Sebastian


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-22 Thread Purcareata Bogdan

On 20.02.2015 17:06, Sebastian Andrzej Siewior wrote:

On 02/20/2015 03:57 PM, Paolo Bonzini wrote:



On 20/02/2015 15:54, Sebastian Andrzej Siewior wrote:

Usually you see "scheduling while atomic" on -RT and convert them to
raw locks if it is appropriate.

Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
not cause a DoS and large latencies in the host. I haven't seen an
answer to my why question. Because if the conversation leads to
large latencies in the host then it does not look right.

Each host PIC has a rawlock and does mostly just mask/unmask and the
raw lock makes sure the value written is not mixed up due to
preemption.
This hardly increase latencies because the "locked" path is very short.
If this conversation leads to higher latencies then the locked path is
too long and hardly suitable to become a rawlock.


Yes, but large latencies just mean the code has to be rewritten (x86
doesn't anymore do event injection in an atomic regions for example).
Until it is, using raw_spin_lock is correct.


It does not sound like it. It sounds more like disabling interrupts to
get things run faster and then limit it on a different corner to not
blow up everything.
Max latencies was decreased "Max latency (us)  7062" and that
is why this is done? For 8 us and possible DoS in case there are too
many cpus?


The main reason for this patch was to enable KVM guests to run on RT 
hosts in certain scenarios, such as delivering external interrupts to 
the guest and the guest being SMP. The cyclictest measurements were just 
a "sanity check" to make sure the latencies don't get messed up too bad, 
albeit in a light scenario (guest with 1 VCPU), for a use case when the 
guest is not SMP and doesn't have any external interrupts delivered. 
This latter scenario works even without the kvm openpic being a 
raw_spinlock.


Previous to this patch, KVM was indeed blowing up on guest_enter [1], 
and making the openpic lock a raw_spinlock fixes that, without causing 
too much cyclictest damage when the guest doesn't have many VCPUs. I had 
a discussion with Scott Wood a while ago regarding delivering external 
interrupts to the guest, and he mentioned that the correct solution was 
to rework the entire interrupt delivery mechanism into multiple lock 
domains, minimize the code on the EPR path and the locking involved. 
Until that can be achieved, converting the openpic lock to a 
raw_spinlock would be acceptable, as long as we keep the number of guest 
VCPUs small, so as to not cause big host latencies.


[1] http://lxr.free-electrons.com/source/include/linux/kvm_host.h#L762

Best regards,
Bogdan P.


Paolo



Sebastian


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] powerpc/kvm: Enable running guests on RT Linux

2015-02-22 Thread Purcareata Bogdan

On 20.02.2015 16:54, Sebastian Andrzej Siewior wrote:

On 02/20/2015 03:12 PM, Paolo Bonzini wrote:

Thomas, what is the usual approach for patches like this? Do you take
them into your rt tree or should they get integrated to upstream?


Patch 1 is definitely suitable for upstream, that's the reason why we
have raw_spin_lock vs. raw_spin_unlock.


raw_spin_lock were introduced in c2f21ce2e31286a0a32 ("locking:
Implement new raw_spinlock). They are used in context which runs with
IRQs off - especially on -RT. This includes usually interrupt
controllers and related core-code pieces.

Usually you see "scheduling while atomic" on -RT and convert them to
raw locks if it is appropriate.

Bogdan wrote in 2/2 that he needs to limit the number of CPUs in oder
not cause a DoS and large latencies in the host. I haven't seen an
answer to my why question. Because if the conversation leads to
large latencies in the host then it does not look right.


What I did notice were bad cyclictest results, when run in a guest with 
24 VCPUs. There were 24 netperf flows running in the guest. The max 
cyclictest latencies got up to 15ms in the guest, however I haven't 
captured any host side information related to preemptirqs off statistics.


What I was planning to do in the past days was to rerun the test and 
come up with the host preemptirqs off disabled statistics (mainly the 
max latency), so I could have a more reliable argument. I haven't had 
the time nor the setup to do that yet, and will come back with this as 
soon as I have them available.



Each host PIC has a rawlock and does mostly just mask/unmask and the
raw lock makes sure the value written is not mixed up due to
preemption.
This hardly increase latencies because the "locked" path is very short.
If this conversation leads to higher latencies then the locked path is
too long and hardly suitable to become a rawlock.


From my understanding, the kvm openpic emulation code does more than 
just that - it requires to be atomic with interrupt delivery. This might 
mean the bad cyclictest max latencies visible from the guest side 
(15ms), may also have a correspondent to how much time that raw spinlock 
is taken, leading to an unresponsive host.


Best regards,
Bogdan P.


Paolo



Sebastian


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: PPC: Convert openpic lock to raw_spinlock

2015-02-18 Thread Purcareata Bogdan

On 17.02.2015 19:59, Sebastian Andrzej Siewior wrote:

* Sebastian Andrzej Siewior | 2015-02-17 18:53:17 [+0100]:


* Purcareata Bogdan | 2015-02-17 14:27:44 [+0200]:


Ping?

On 02.02.2015 11:35, Purcareata Bogdan wrote:

Ping?


No body?

bah! That mutt thing is too fast.

The raw conversation looks sane and could go upstream. This other chunk:

|+#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_KVM_MPIC)
|+/* Limit the number of vcpus due to in-kernel mpic concurrency */
|+#define KVM_MAX_VCPUS  4
|+#define KVM_MAX_VCORES 4
|+#else
| #define KVM_MAX_VCPUS  NR_CPUS
| #define KVM_MAX_VCORES NR_CPUS
|+#endif

should be a separate patch. Please repost including ppc ml.


Thanks! Will send a patchset separating these 2 functional changes - the 
openpic raw_spinlock for upstream ppc (since it doesn't bring any 
changes anyway), and the MAX_VCPUS limitation for the RT tree.



This remains of my multiple-MSI patch which someone other posted a while
ago. What happend to it?


I'm not aware of this patch, could you give more details, please?

Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/3] powerpc: Don't force ENOSYS as error on syscall fail

2015-02-17 Thread Purcareata Bogdan

On 18.02.2015 05:01, Mike Strosaker wrote:

This patch failed to build using pseries_le_defconfig.  With the change
noted below in entry_64.S, the build succeeded and seccomp mode 2 worked
correctly.

Best,
Mike Strosaker

On 02/13/2015 02:22 AM, Bogdan Purcareata wrote:

In certain scenarios - e.g. seccomp filtering with ERRNO as default action -
the system call fails for other reasons than the syscall not being available.
The seccomp filter can be configured to store a user-defined error code on
return from a blacklisted syscall. Don't always set ENOSYS on
do_syscall_trace_enter failure.

Delegate setting ENOSYS in case of failure, where appropriate, to
do_syscall_trace_enter.

v3:
- keep setting ENOSYS in the syscall entry assembly for scenarios without
syscall tracing

v2:
- move setting ENOSYS as errno from the syscall entry assembly to
do_syscall_trace_enter, only in the specific case

Signed-off-by: Bogdan Purcareata 
---
   arch/powerpc/kernel/entry_32.S | 7 ++-
   arch/powerpc/kernel/entry_64.S | 5 +++--
   arch/powerpc/kernel/ptrace.c   | 4 +++-
   3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 46fc0f4..b2f88cd 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -333,12 +333,12 @@ _GLOBAL(DoSyscall)
lwz r11,TI_FLAGS(r10)
andi.   r11,r11,_TIF_SYSCALL_DOTRACE
bne-syscall_dotrace
-syscall_dotrace_cont:
cmplwi  0,r0,NR_syscalls
lis r10,sys_call_table@h
ori r10,r10,sys_call_table@l
slwir0,r0,2
bge-66f
+syscall_dotrace_cont:
lwzxr10,r10,r0  /* Fetch system call handler [ptr] */
mtlrr10
addir9,r1,STACK_FRAME_OVERHEAD
@@ -457,6 +457,11 @@ syscall_dotrace:
lwz r7,GPR7(r1)
lwz r8,GPR8(r1)
REST_NVGPRS(r1)
+   cmplwi  0,r0,NR_syscalls
+   lis r10,sys_call_table@h
+   ori r10,r10,sys_call_table@l
+   slwir0,r0,2
+   bge-ret_from_syscall
b   syscall_dotrace_cont

   syscall_exit_work:
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index d180caf2..0d22fa8 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -144,7 +144,6 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
bne syscall_dotrace
-.Lsyscall_dotrace_cont:
cmpldi  0,r0,NR_syscalls
bge-syscall_enosys

@@ -253,7 +252,9 @@ syscall_dotrace:
addir9,r1,STACK_FRAME_OVERHEAD
CURRENT_THREAD_INFO(r10, r1)
ld  r10,TI_FLAGS(r10)
-   b   .Lsyscall_dotrace_cont
+   cmpldi  0,r0,NR_syscalls
+   bge-syscall_exit


Shouldn't this be .Lsyscall_exit?


Thanks for testing and spotting this! The kernel I tested with didn't 
have syscall_exit converted to a local label (commit 
4c3b21686111e0ac6018469dacbc5549f9915cf8).


Will resend with this change.

Bogdan P.


+   b   system_call

   syscall_enosys:
li  r3,-ENOSYS
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index f21897b..2edae06 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1775,13 +1775,15 @@ long do_syscall_trace_enter(struct pt_regs *regs)
secure_computing_strict(regs->gpr[0]);

if (test_thread_flag(TIF_SYSCALL_TRACE) &&
-   tracehook_report_syscall_entry(regs))
+   tracehook_report_syscall_entry(regs)) {
/*
 * Tracing decided this syscall should not happen.
 * We'll return a bogus call number to get an ENOSYS
 * error, but leave the original number in regs->gpr[0].
 */
ret = -1L;
+   syscall_set_return_value(current, regs, ENOSYS, 0);
+   }

if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
trace_sys_enter(regs, regs->gpr[0]);




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: PPC: Convert openpic lock to raw_spinlock

2015-02-17 Thread Purcareata Bogdan

Ping?

On 02.02.2015 11:35, Purcareata Bogdan wrote:

Ping?

On 22.01.2015 11:39, Bogdan Purcareata wrote:

This patch enables running intensive I/O workloads, e.g. netperf, in a guest
deployed on a RT host. It also enable guests to be SMP.

The openpic spinlock becomes a sleeping mutex on a RT system. This no longer
guarantees that EPR is atomic with exception delivery. The guest VCPU thread
fails due to a BUG_ON(preemptible()) when running netperf.

In order to make the kvmppc_mpic_set_epr() call safe on RT from non-atomic
context, convert the openpic lock to a raw_spinlock. A similar approach can
be seen for x86 platforms in the following commit [1].

Here are some comparative cyclitest measurements run inside a high priority RT
guest run on a RT host. The guest has 1 VCPU and the test has been run for 15
minutes. The guest runs ~750 hackbench processes as background stress.

spinlock  raw_spinlock
Min latency (us)  4 4
Avg latency (us)  1519
Max latency (us)  7062

Due to the introduction of the raw_spinlock, guests with a high number of VCPUs
may induce great latencies on the underlying RT Linux system (e.g. cyclictest
reports latencies of ~15ms for guests with 24 VCPUs). This can be further
aggravated by sending a lot of external interrupts to the guest. A malicious app
can abuse this scenario, causing a DoS of the host Linux. Until the KVM openpic
code is refactored to use finer lock granularity, impose a limitation on the
number of VCPUs a guest can have when running on a PREEMPT_RT_FULL system with
KVM_MPIC emulation.

Sent against v3.14-rt branch of
git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

[1] https://lkml.org/lkml/2010/1/11/289

Signed-off-by: Bogdan Purcareata 
[ add KVM_MAX_VCPUS limitation ]
Signed-off-by: Mihai Caraman 
Reviewed-by: Scott Wood 
---
   arch/powerpc/include/asm/kvm_host.h |  6 +
   arch/powerpc/kvm/mpic.c | 44 
++---
   2 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1eaea2d..5ae38c5 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -35,8 +35,14 @@
   #include 
   #include 

+#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_KVM_MPIC)
+/* Limit the number of vcpus due to in-kernel mpic concurrency */
+#define KVM_MAX_VCPUS  4
+#define KVM_MAX_VCORES 4
+#else
   #define KVM_MAX_VCPUSNR_CPUS
   #define KVM_MAX_VCORES   NR_CPUS
+#endif
   #define KVM_USER_MEM_SLOTS 32
   #define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index efbd996..b9802a3 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -194,7 +194,7 @@ struct openpic {
int num_mmio_regions;

gpa_t reg_base;
-   spinlock_t lock;
+   raw_spinlock_t lock;

/* Behavior control */
struct fsl_mpic_info *fsl;
@@ -1105,9 +1105,9 @@ static int openpic_cpu_write_internal(void *opaque, gpa_t 
addr,
mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
}

-   spin_unlock(&opp->lock);
+   raw_spin_unlock(&opp->lock);
kvm_notify_acked_irq(opp->kvm, 0, notify_eoi);
-   spin_lock(&opp->lock);
+   raw_spin_lock(&opp->lock);

break;
}
@@ -1182,12 +1182,12 @@ void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
int cpu = vcpu->arch.irq_cpu_id;
unsigned long flags;

-   spin_lock_irqsave(&opp->lock, flags);
+   raw_spin_lock_irqsave(&opp->lock, flags);

if ((opp->gcr & opp->mpic_mode_mask) == GCR_MODE_PROXY)
kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));

-   spin_unlock_irqrestore(&opp->lock, flags);
+   raw_spin_unlock_irqrestore(&opp->lock, flags);
   }

   static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
@@ -1387,9 +1387,9 @@ static int kvm_mpic_read(struct kvm_io_device *this, 
gpa_t addr,
return -EINVAL;
}

-   spin_lock_irq(&opp->lock);
+   raw_spin_lock_irq(&opp->lock);
ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
-   spin_unlock_irq(&opp->lock);
+   raw_spin_unlock_irq(&opp->lock);

/*
 * Technically only 32-bit accesses are allowed, but be nice to
@@ -1427,10 +1427,10 @@ static int kvm_mpic_write(struct kvm_io_device *this, 
gpa_t addr,
return -EOPNOTSUPP;
}

-   spin_lock_irq(&opp->lock);
+   raw_spin_lock_irq(&opp->lock);
ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
  *(const u32 *)ptr);
-   spin_unlock_i

Re: [PATCH v2 1/3] powerpc: Don't force ENOSYS as error on syscall fail

2015-02-12 Thread Purcareata Bogdan

On 12.02.2015 07:24, Michael Ellerman wrote:

On Wed, 2015-02-11 at 08:36 +, Bogdan Purcareata wrote:

In certain scenarios - e.g. seccomp filtering with ERRNO as default action -
the system call fails for other reasons than the syscall not being available.
The seccomp filter can be configured to store a user-defined error code on
return from a blacklisted syscall. Don't always set ENOSYS on
do_syscall_trace_enter failure.

v2:
- move setting ENOSYS as errno from the syscall entry assembly to
   do_syscall_trace_enter, only in the specific case



diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 194e46d..0111e04 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -269,7 +269,6 @@ syscall_dotrace:
b   .Lsyscall_dotrace_cont

  syscall_enosys:
-   li  r3,-ENOSYS
b   syscall_exit



This still looks wrong to me.

On 64 bit we do:

CURRENT_THREAD_INFO(r11, r1)
ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
bne syscall_dotrace
.Lsyscall_dotrace_cont:
cmpldi  0,r0,NR_syscalls
bge-syscall_enosys
...

syscall_enosys:
li  r3,-ENOSYS
b   .Lsyscall_exit


Your patch removes the load of ENOSYS.

Which means if we're not doing syscall tracing, and we get an out-of-bounds
syscall number, we'll return with something random on r3. Won't we?


Thanks for pointing this out, you are absolutely right. Perhaps this is 
a fix for the issue - on 64 bit:


ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_T_OR_A
bne syscall_dotrace
-.Lsyscall_dotrace_cont:
cmpldi  0,r0,NR_syscalls
bge-syscall_enosys

system_call:
...

syscall_enosys:
li  r3,-ENOSYS
b   .Lsyscall_exit
...

syscall_dotrace:
...
addir9,r1,STACK_FRAME_OVERHEAD
CURRENT_THREAD_INFO(r10, r1)
ld  r10,TI_FLAGS(r10)
-   b   .Lsyscall_dotrace_cont
+   cmpldi  0,r0,NR_syscalls
+   bge syscall_exit
+   b   system_call

So basically I leave the code for syscall_enosys unchanged, but I keep 
using it only when not doing syscall tracing. When doing syscall 
tracing, I'm assuming do_syscall_trace_enter will take care of setting 
the errno, and should it return an invalid syscall number, go directly 
to syscall_exit.



The 32-bit code looks more or less similar, although the label has a different
name.


Same thing for 32-bit:

_GLOBAL(DoSyscall)
lwz r11,TI_FLAGS(r10)
andi.   r11,r11,_TIF_SYSCALL_T_OR_A
bne-syscall_dotrace
-syscall_dotrace_cont:
cmplwi  0,r0,NR_syscalls
lis r10,sys_call_table@h
ori r10,r10,sys_call_table@l
slwir0,r0,2
bge 66f
+syscall_dotrace_cont:
lwzxr10,r10,r0  /* Fetch system call handler [ptr] */
mtlrr10
addir9,r1,STACK_FRAME_OVERHEAD
...

66: li  r3,-ENOSYS
b   ret_from_syscall
...

syscall_dotrace:
lwz r7,GPR7(r1)
lwz r8,GPR8(r1)
REST_NVGPRS(r1)
+   cmplwi  0,r0,NR_syscalls
+   lis r10,sys_call_table@h
+   ori r10,r10,sys_call_table@l
+   slwir0,r0,2
+   bge-ret_from_syscall
b   syscall_dotrace_cont

However I must admit that I don't like duplicating those 4 lines of code 
associated with verifying the syscall number. I can't think of any 
better way to do this. I also thought about leaving this check in one 
place, and then branch differently according to _TIF_SYSCALL_T_OR_A. Do 
you think that would be a better approach?


Thank you,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/3] powerpc: Don't force ENOSYS as error on syscall fail

2015-02-11 Thread Purcareata Bogdan

On 11.02.2015 05:04, Michael Ellerman wrote:

On Mon, 2015-02-09 at 07:55 +, Bogdan Purcareata wrote:

In certain scenarios - e.g. seccomp filtering with ERRNO as default action -
the system call fails for other reasons than the syscall not being available.
The seccomp filter can be configured to store a user-defined error code on
return from a blacklisted syscall.

The RFC is this: are there currently any user-space scenarios where it is
required that the system call return ENOSYS as error code on failure, no matter
the circumstances? I don't want to break userspace requirements. I have not
added code to force this error code in situations different than
secure_computing failure, in order to keep overhead at a minimum.

Signed-off-by: Bogdan Purcareata 
---
  arch/powerpc/kernel/entry_32.S | 3 ++-
  arch/powerpc/kernel/entry_64.S | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 59848e5..52e48dd 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -425,7 +425,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
b   1b
  #endif  /* CONFIG_44x */

-66:li  r3,-ENOSYS
+66:
+#  li  r3,-ENOSYS
b   ret_from_syscall

.globl  ret_from_fork
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index e6bfe8e..80db02e 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -269,7 +269,7 @@ syscall_dotrace:
b   .Lsyscall_dotrace_cont

  syscall_enosys:
-   li  r3,-ENOSYS
+#  li  r3,-ENOSYS
b   syscall_exit


So what happens if you call this with a syscall number that's out of bounds?


As far as my current understanding goes, the call will return with -1 
with a errno that's undefined (or I've not seen it be defined anywhere).


I've thought more about this, and I guess the best option would be to 
move setting -ENOSYS as errno from the syscall entry assembly to 
do_syscall_trace_enter (as opposed to eliminating it at all). I was a 
little reluctant to do this at first in order to keep overhead to a 
minimum, but it's certainly not an option to change behavior if the 
syscall number is out of bounds.


v2 to come shortly.

Thanks,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/3] powerpc: Don't force ENOSYS as error on syscall fail

2015-02-09 Thread Purcareata Bogdan
Obvious mistake on my behalf to send the patch with lines commented out. 
I will fix it in v2.


On 09.02.2015 09:55, Bogdan Purcareata wrote:

In certain scenarios - e.g. seccomp filtering with ERRNO as default action -
the system call fails for other reasons than the syscall not being available.
The seccomp filter can be configured to store a user-defined error code on
return from a blacklisted syscall.

The RFC is this: are there currently any user-space scenarios where it is
required that the system call return ENOSYS as error code on failure, no matter
the circumstances? I don't want to break userspace requirements. I have not
added code to force this error code in situations different than
secure_computing failure, in order to keep overhead at a minimum.

Signed-off-by: Bogdan Purcareata 
---
  arch/powerpc/kernel/entry_32.S | 3 ++-
  arch/powerpc/kernel/entry_64.S | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 59848e5..52e48dd 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -425,7 +425,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
b   1b
  #endif  /* CONFIG_44x */

-66:li  r3,-ENOSYS
+66:
+#  li  r3,-ENOSYS
b   ret_from_syscall

.globl  ret_from_fork
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index e6bfe8e..80db02e 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -269,7 +269,7 @@ syscall_dotrace:
b   .Lsyscall_dotrace_cont

  syscall_enosys:
-   li  r3,-ENOSYS
+#  li  r3,-ENOSYS
b   syscall_exit

  syscall_exit_work:


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] KVM: PPC: Convert openpic lock to raw_spinlock

2015-02-02 Thread Purcareata Bogdan

Ping?

On 22.01.2015 11:39, Bogdan Purcareata wrote:

This patch enables running intensive I/O workloads, e.g. netperf, in a guest
deployed on a RT host. It also enable guests to be SMP.

The openpic spinlock becomes a sleeping mutex on a RT system. This no longer
guarantees that EPR is atomic with exception delivery. The guest VCPU thread
fails due to a BUG_ON(preemptible()) when running netperf.

In order to make the kvmppc_mpic_set_epr() call safe on RT from non-atomic
context, convert the openpic lock to a raw_spinlock. A similar approach can
be seen for x86 platforms in the following commit [1].

Here are some comparative cyclitest measurements run inside a high priority RT
guest run on a RT host. The guest has 1 VCPU and the test has been run for 15
minutes. The guest runs ~750 hackbench processes as background stress.

   spinlock  raw_spinlock
Min latency (us)  4 4
Avg latency (us)  1519
Max latency (us)  7062

Due to the introduction of the raw_spinlock, guests with a high number of VCPUs
may induce great latencies on the underlying RT Linux system (e.g. cyclictest
reports latencies of ~15ms for guests with 24 VCPUs). This can be further
aggravated by sending a lot of external interrupts to the guest. A malicious app
can abuse this scenario, causing a DoS of the host Linux. Until the KVM openpic
code is refactored to use finer lock granularity, impose a limitation on the
number of VCPUs a guest can have when running on a PREEMPT_RT_FULL system with
KVM_MPIC emulation.

Sent against v3.14-rt branch of
git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

[1] https://lkml.org/lkml/2010/1/11/289

Signed-off-by: Bogdan Purcareata 
[ add KVM_MAX_VCPUS limitation ]
Signed-off-by: Mihai Caraman 
Reviewed-by: Scott Wood 
---
  arch/powerpc/include/asm/kvm_host.h |  6 +
  arch/powerpc/kvm/mpic.c | 44 ++---
  2 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1eaea2d..5ae38c5 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -35,8 +35,14 @@
  #include 
  #include 

+#if defined(CONFIG_PREEMPT_RT_FULL) && defined(CONFIG_KVM_MPIC)
+/* Limit the number of vcpus due to in-kernel mpic concurrency */
+#define KVM_MAX_VCPUS  4
+#define KVM_MAX_VCORES 4
+#else
  #define KVM_MAX_VCPUS NR_CPUS
  #define KVM_MAX_VCORESNR_CPUS
+#endif
  #define KVM_USER_MEM_SLOTS 32
  #define KVM_MEM_SLOTS_NUM KVM_USER_MEM_SLOTS

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index efbd996..b9802a3 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -194,7 +194,7 @@ struct openpic {
int num_mmio_regions;

gpa_t reg_base;
-   spinlock_t lock;
+   raw_spinlock_t lock;

/* Behavior control */
struct fsl_mpic_info *fsl;
@@ -1105,9 +1105,9 @@ static int openpic_cpu_write_internal(void *opaque, gpa_t 
addr,
mpic_irq_raise(opp, dst, ILR_INTTGT_INT);
}

-   spin_unlock(&opp->lock);
+   raw_spin_unlock(&opp->lock);
kvm_notify_acked_irq(opp->kvm, 0, notify_eoi);
-   spin_lock(&opp->lock);
+   raw_spin_lock(&opp->lock);

break;
}
@@ -1182,12 +1182,12 @@ void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
int cpu = vcpu->arch.irq_cpu_id;
unsigned long flags;

-   spin_lock_irqsave(&opp->lock, flags);
+   raw_spin_lock_irqsave(&opp->lock, flags);

if ((opp->gcr & opp->mpic_mode_mask) == GCR_MODE_PROXY)
kvmppc_set_epr(vcpu, openpic_iack(opp, &opp->dst[cpu], cpu));

-   spin_unlock_irqrestore(&opp->lock, flags);
+   raw_spin_unlock_irqrestore(&opp->lock, flags);
  }

  static int openpic_cpu_read_internal(void *opaque, gpa_t addr,
@@ -1387,9 +1387,9 @@ static int kvm_mpic_read(struct kvm_io_device *this, 
gpa_t addr,
return -EINVAL;
}

-   spin_lock_irq(&opp->lock);
+   raw_spin_lock_irq(&opp->lock);
ret = kvm_mpic_read_internal(opp, addr - opp->reg_base, &u.val);
-   spin_unlock_irq(&opp->lock);
+   raw_spin_unlock_irq(&opp->lock);

/*
 * Technically only 32-bit accesses are allowed, but be nice to
@@ -1427,10 +1427,10 @@ static int kvm_mpic_write(struct kvm_io_device *this, 
gpa_t addr,
return -EOPNOTSUPP;
}

-   spin_lock_irq(&opp->lock);
+   raw_spin_lock_irq(&opp->lock);
ret = kvm_mpic_write_internal(opp, addr - opp->reg_base,
  *(const u32 *)ptr);
-   spin_unlock_irq(&opp->lock);
+   raw_spin_unlock_irq(&opp->lock);

pr_debug("%s: addr %llx ret %d val %x\n",
 __func__, addr, ret, *(const u32 *)ptr);
@@ -1501,14 +1501,14 @@ static int access_reg(str

Re: [PATCH] powerpc/mpic: Add DT option to skip readback after EOI

2015-01-21 Thread Purcareata Bogdan

On 14.01.2015 19:58, Scott Wood wrote:

On Wed, 2015-01-14 at 11:57 +, Bogdan Purcareata wrote:

The readback is necessary in order to handle PCI posted
writes,


That is unclear.


I'm going to try an venture a different explanation here.

I found a good explanation in Writing PCI Drivers [1] as to why it's 
necessary that a read is performed before returning from an EOI write:


"A good example of such a case is when a driver’s Interrupt Service 
Routine (ISR) is dealing with the Interrupt Request Register (IRR) on a 
card. Clearing a bit in the register indicates that the interrupt has 
been serviced. This is done by posting a write to the register. If the 
driver posts this write and exits its ISR, it could get interrupted 
again immediately because the write hadn’t yet reached the bit in the 
IRR to tell it to stop trying to interrupt. One solution to this 
potential problem is to make sure to read back the value in the IRR 
before exiting from the ISR. Most drivers do this anyway so they can 
handle multiple interrupts in the same ISR visit."


This is designed to work for legacy line interrupts. I understand MSI 
has its way to order writes on the bus, and also it won't send another 
MSI, signaling a new interrupt, unless there's a state change in the IRR 
on the card. When using line interrupts, the line stays asserted as long 
as IRR is set, so the MPIC could trigger an additional interrupt if the 
EOI write is posted and the ISR bit is cleared.





  or when the MPIC is handling interrupts in a loop
(ppc_md.get_irq).


I'm questioning this one as well -- if reading WHOAMI is sufficient to
sync with the EOI, why wouldn't reading INTACK work as well?


I think I'm going to drop this part of the description. However, I don't 
think reading INTACK is good, since it has some additional side effects 
- interrupt pending register is cleared for edge-sensitive interrupts, 
the in-service register is updated, and the int output signal from the 
MPIC is negated. This is available for Freescale SoCs. This also 
signifies entering the processing cycle of a new interrupt.


Normally, the readback should be done on the EOI register - this is 
stated in the OpenPIC specification and actually was the initial 
implementation of the driver. However, this can't be done since, at 
least for Freescale SoCs, the EOI register is write only.


However, in the light of the above explanation related to PCI posted 
writes, I don't see how performing a readback on the WHOAMI register, 
which is specific to the MPIC, would have any effect in synchronizing 
with the PCI device registers. I guess it's related to the bus 
interconnect on the SoC.



  Newer MPIC versions don't require this readback. Leave the option
configurable using a device tree entry.

This saves a MMIO trap per interrupt.

Signed-off-by: Scott Wood 
Signed-off-by: Bogdan Purcareata 


You should note changes (e.g. with a [] note between signoffs) if it
differs from the original...


Will do, sorry for omitting this.



-Scott


[1] http://h21007.www2.hp.com/portal/download/files/unprot/ddk/ddg/chap8.pdf

Best regards,
Bogdan P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] PPC: MPIC: necessary readback after EOI?

2015-01-05 Thread Purcareata Bogdan

Hello,

While doing some performance testing of a KVM guest on a PPC platform, I 
noticed that there's a read of the CPU_WHOAMI register after each MPIC 
EOI [1]. This has been present since the initial implementation of the 
MPIC driver [2]. In a KVM virtualized environment, this results in an 
additional kvm_exit.


Is the read back necessary? Is it used to provide some sort of 
synchronization mechanism, making sure that nothing else is executed 
until the EOI write is finished? I eliminated the mpic_cpu_read call and 
run the kernel on hardware and noticed no anomaly, however I am not sure 
of all the implications and race conditions it might lead to.


I was curious why the mpic_cpu_read(MPIC_INFO(CPU_WHOAMI)) was there in 
the first place and if it's still needed. If it's still required, I 
guess a better approach is to eliminate the call only if the kernel is 
running on the KVM guest side, where the MPIC is emulated and no longer 
requires a readback.


Thank you,
Bogdan P.

[1] http://lxr.free-electrons.com/source/arch/powerpc/sysdev/mpic.c#L659
[2] https://lkml.org/lkml/2004/10/22/483
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/