Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-25 Thread Peter Maydell
On 23 June 2015 at 19:15, Peter Maydell peter.mayd...@linaro.org wrote:
 On 23 June 2015 at 08:31, Frederic Konrad fred.kon...@greensocs.com wrote:
 The normal boot with -smp 4 and a smp 4 guest is slow and become a lot
 faster
 when I enable the window (which have timer callbacks and refresh the screen
 regularly)

 Is it just overall slow, or does it appear to hang? I have an
 interesting effect where *with* this patch an -smp 3 or 4 guest
 boot seems to hang between SCSI subsystem initialized and
 Switched to clocksource arch_sys_counter...

At least part of what is happening here seems to be that we're
falling into a similar flavour of stall to the original test case:
in the SMP 3 setup, CPU #2 is in the busy-loop of Linux's
multi_cpu_stop() function, and it can sit in that loop for an entire
second. We should never let a single CPU grab execution for that
long when doing TCG round-robin...

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-23 Thread Peter Maydell
On 23 June 2015 at 08:31, Frederic Konrad fred.kon...@greensocs.com wrote:
 The normal boot with -smp 4 and a smp 4 guest is slow and become a lot
 faster
 when I enable the window (which have timer callbacks and refresh the screen
 regularly)

Is it just overall slow, or does it appear to hang? I have an
interesting effect where *with* this patch an -smp 3 or 4 guest
boot seems to hang between SCSI subsystem initialized and
Switched to clocksource arch_sys_counter...

Weirdly, you can make it recover from that hang if there's a
GTK window and you wave the mouse over it, even if that window
is only showing the QEMU monitor, not a guest graphics window.

thanks
-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-23 Thread Frederic Konrad

On 23/06/2015 10:09, Peter Maydell wrote:

On 23 June 2015 at 08:31, Frederic Konrad fred.kon...@greensocs.com wrote:

Can you send me a complete diff?

The link I pointed you at in the other thread has the complete diff:
http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg03824.html

-- PMM
Oops sorry I saw other piece of patch in the thread that's why I was not 
sure.


Seems not fixing the problem anyway.

Thanks,
Fred



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-23 Thread Peter Maydell
On 23 June 2015 at 08:31, Frederic Konrad fred.kon...@greensocs.com wrote:

 Can you send me a complete diff?

The link I pointed you at in the other thread has the complete diff:
http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg03824.html

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-23 Thread Frederic Konrad

On 19/06/2015 17:53, Peter Maydell wrote:

On 16 June 2015 at 12:53, Peter Maydell peter.mayd...@linaro.org wrote:

In particular I think the
'do cpu_exit if one CPU triggers an interrupt on another'
approach is probably good, but I need to investigate why
it isn't working on your test programs without that extra
'level ' condition first...

I've figured out what's happening here, and it's an accidental
artefact of our GIC implementation. What happens is:

  * cpu 0 does an IPI, which turns into raise IRQ line on cpu 1
  * arm_cpu_set_irq logic causes us to cpu_exit() cpu 0
  * cpu 1 does then run; however pretty early on it does a read
on the GIC to acknowledge the interrupt
  * this causes the function gic_update() to run, which recalculates
the current state and sets CPU interrupt lines accordingly;
among other things this results in an unnecessary but harmless
call to arm_cpu_set_irq(CPU #0, irq, 0)
  * without the level   clause in the conditional, that causes
us to cpu_exit() cpu 1
  * we then start running cpu 0 again, which is pointless, and
since there's no further irq traffic we don't yield til 0
reaches the end of its timeslice

So basically without the level check we do make 0 yield to 1
as it should, but we then spuriously yield back to 0 again
pretty much immediately.

Next up: see if it gives us a perf improvement on Linux guests...

-- PMM


Hi,

Can you send me a complete diff?

I might have reproduced the same bug during MTTCG speed comparison.
http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05704.html

The normal boot with -smp 4 and a smp 4 guest is slow and become a lot 
faster

when I enable the window (which have timer callbacks and refresh the screen
regularly)

Thanks,
Fred



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-19 Thread Peter Maydell
On 16 June 2015 at 12:53, Peter Maydell peter.mayd...@linaro.org wrote:
 In particular I think the
 'do cpu_exit if one CPU triggers an interrupt on another'
 approach is probably good, but I need to investigate why
 it isn't working on your test programs without that extra
 'level ' condition first...

I've figured out what's happening here, and it's an accidental
artefact of our GIC implementation. What happens is:

 * cpu 0 does an IPI, which turns into raise IRQ line on cpu 1
 * arm_cpu_set_irq logic causes us to cpu_exit() cpu 0
 * cpu 1 does then run; however pretty early on it does a read
   on the GIC to acknowledge the interrupt
 * this causes the function gic_update() to run, which recalculates
   the current state and sets CPU interrupt lines accordingly;
   among other things this results in an unnecessary but harmless
   call to arm_cpu_set_irq(CPU #0, irq, 0)
 * without the level   clause in the conditional, that causes
   us to cpu_exit() cpu 1
 * we then start running cpu 0 again, which is pointless, and
   since there's no further irq traffic we don't yield til 0
   reaches the end of its timeslice

So basically without the level check we do make 0 yield to 1
as it should, but we then spuriously yield back to 0 again
pretty much immediately.

Next up: see if it gives us a perf improvement on Linux guests...

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-19 Thread Paolo Bonzini


On 12/06/2015 18:38, Alex Züpke wrote:
   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...

Shouldn't CPU#0 do a WFE here?  That would work too.

Considering that sooner or later we'll have true multithreaded
emulation, putting a hack doesn't sound like a great prospect.

Paolo



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-19 Thread Peter Maydell
On 19 June 2015 at 17:57, Paolo Bonzini pbonz...@redhat.com wrote:


 On 12/06/2015 18:38, Alex Züpke wrote:
   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...

 Shouldn't CPU#0 do a WFE here?  That would work too.

You can do this with SEV/WFE, yes, but you don't have to,
and in fact Linux doesn't currently:
http://lxr.free-electrons.com/source/kernel/smp.c#L108

 Considering that sooner or later we'll have true multithreaded
 emulation, putting a hack doesn't sound like a great prospect.

I'd bet on later rather than sooner, especially if
you want multithreaded on all host architectures.

My not-very-scientific testing of time for a 2xSMP
32-bit Linux guest to boot to userspace shell and
shutdown again suggests it does help: 32.531 secs
vs 34.148 secs. The without-patch version seems more
prone to occasionally stalling so much the boot time
goes up to 45 seconds, too...

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-16 Thread Alex Züpke
Am 16.06.2015 um 13:53 schrieb Peter Maydell:
 On 16 June 2015 at 12:11, Alex Züpke alexander.zue...@hs-rm.de wrote:
 But the startup is not my problem, it's the later parts.
 
 But it was my problem because it meant your test case wasn't
 functional :-)
 
 I added the WFE to the initial lock. Here are two new tests, both are now 
 3178 bytes in size:
 http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf
 http://www.cs.hs-rm.de/~zuepke/qemu/ipi_yield.elf

 Both start on my machine. The IPI ping-pong starts after the
 first timer interrupt after 1s. The problem is that IPIs are
 delivered only once a second after the timer interrupts QEMU's
 main loop.
 
 Thanks. These test cases work for me, and I can repro the
 same behaviour you see.

OK, I'm glad that you can trigger my bug.

 I intend to investigate why we're not at least timeslicing
 between the two CPUs at a faster rate than when there's
 another timer interrupt.

Probably there is no other way of time slicing in QEMU ... every OS uses some 
kind of timer interrupt, so it's not necessary.
And even Linux' tickless kernel doesn't run into this issue because it uses 
SEV/WFE properly.

 Something else: Existing ARM CPU so far do not use hyper-threading,
 but have real phyical cores. In contrast, QEMU is an extreme
 coarse-grained hyper-threading architectures, so existing legacy
 code that was written with physical cores in mind will trigger
 timing bugs in synchronization primitives then, especially code
 originally written for ARM11 MPCore like mine, which lacks WFE/SEV.
 If we consider QEMU as a platform to run legacy code, doesn't it
 make sense to address these issues?
 
 In general QEMU's approach is more run correct code reasonably
 fast rather than run buggy code the same way the hardware
 would or identify bugs in buggy code. There's certainly
 scope for heuristics for making our timeslicing approach less
 obtrusive, but we need to understand the underlying behaviour
 first (and check it doesn't accidentally slow down other
 common workloads in the process). In particular I think the
 'do cpu_exit if one CPU triggers an interrupt on another'
 approach is probably good, but I need to investigate why
 it isn't working on your test programs without that extra
 'level ' condition first...
 
 thanks
 -- PMM

OK, thank you Peter!


Best regards
Alex




Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-16 Thread Peter Maydell
On 16 June 2015 at 12:11, Alex Züpke alexander.zue...@hs-rm.de wrote:
 But the startup is not my problem, it's the later parts.

But it was my problem because it meant your test case wasn't
functional :-)

 I added the WFE to the initial lock. Here are two new tests, both are now 
 3178 bytes in size:
 http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf
 http://www.cs.hs-rm.de/~zuepke/qemu/ipi_yield.elf

 Both start on my machine. The IPI ping-pong starts after the
 first timer interrupt after 1s. The problem is that IPIs are
 delivered only once a second after the timer interrupts QEMU's
 main loop.

Thanks. These test cases work for me, and I can repro the
same behaviour you see.

I intend to investigate why we're not at least timeslicing
between the two CPUs at a faster rate than when there's
another timer interrupt.

 Something else: Existing ARM CPU so far do not use hyper-threading,
 but have real phyical cores. In contrast, QEMU is an extreme
 coarse-grained hyper-threading architectures, so existing legacy
 code that was written with physical cores in mind will trigger
 timing bugs in synchronization primitives then, especially code
 originally written for ARM11 MPCore like mine, which lacks WFE/SEV.
 If we consider QEMU as a platform to run legacy code, doesn't it
 make sense to address these issues?

In general QEMU's approach is more run correct code reasonably
fast rather than run buggy code the same way the hardware
would or identify bugs in buggy code. There's certainly
scope for heuristics for making our timeslicing approach less
obtrusive, but we need to understand the underlying behaviour
first (and check it doesn't accidentally slow down other
common workloads in the process). In particular I think the
'do cpu_exit if one CPU triggers an interrupt on another'
approach is probably good, but I need to investigate why
it isn't working on your test programs without that extra
'level ' condition first...

thanks
-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-16 Thread Peter Maydell
On 15 June 2015 at 21:03, Alex Zuepke alexander.zue...@hs-rm.de wrote:
 Am 15.06.2015 um 20:58 schrieb Peter Maydell:
 I'm beginning to suspect that your guest code has a race
 condition in it, such that if the other CPU runs at a
 point you weren't expecting it to then you end up
 deadlocking or otherwise running into a bug in your guest.

 In particular, I see the emulation getting stuck even without
 this patch to arm_cpu_set_irq().

 Yes, it's a bug, sorry for that. I removed too much code to get a simple
 testcase. It's stuck in the first spinlock where CPU#1 is waiting for CPU#0
 to initialize the rest of the system, and I need to WFE or YIELD here as
 well.

 But this is showing the original problem again: the emulation get's stuck
 spinning on CPU #1 forever, because the main loop doesn't switch to CPU #0
 voluntarily. Just press a key on the console/emulated serial line to trigger
 an event to QEMU's main loop, and the testcase should continue.

Pressing a key does not unwedge the test case for me.

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-16 Thread Peter Maydell
On 16 June 2015 at 11:33, Peter Maydell peter.mayd...@linaro.org wrote:
 Pressing a key does not unwedge the test case for me.

Looking at the logs, this seems to be expected given what
the guest code does with CPU #1: (the below is edited logs,
created with a hacky patch I have that annotates the debug
logs with CPU numbers):

CPU #1: Trace 0x7f2d67afa000 [8100] _start
 # we start
CPU #1: Trace 0x7f2d67afc060 [841c] main_cpu1
 # we correctly figured out we're CPU 1
CPU #1: Trace 0x7f2d67afc220 [8448] main_cpu1
 # we took the branch to 8448
CPU #1: Trace 0x7f2d67afc220 [8448] main_cpu1
 # 8000448 is a branch-to-self, so here we stay

CPU #1 never bothered to enable its GICC cpu interface,
so it will never receive interrupts and will never get
out of this tight loop.

We get here because CPU #1 has got through main_cpu1
to the point of testing your 'release' variable before
CPU #0 has got through main_cpu0 far enough to set it
to 1, so it still has the zero in it that it has on
system startup. If scheduling happened to mean that
CPU #0 ran further through main_cpu0 before CPU #1
ran, we wouldn't end up in this situation -- you have a
race condition, as I suggested.

The log shows we're sat with CPU#0 fruitlessly looping
on a variable in memory, and CPU#1 in this endless loop.

PS: QEMU doesn't care, but your binary seems to be entirely
devoid of barrier instructions, which is likely to cause
you problems on real hardware.

thanks
-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-16 Thread Alex Züpke
Hi Peter,

Am 16.06.2015 um 12:59 schrieb Peter Maydell:
 On 16 June 2015 at 11:33, Peter Maydell peter.mayd...@linaro.org wrote:
 Pressing a key does not unwedge the test case for me.
 
 Looking at the logs, this seems to be expected given what
 the guest code does with CPU #1: (the below is edited logs,
 created with a hacky patch I have that annotates the debug
 logs with CPU numbers):
 
 CPU #1: Trace 0x7f2d67afa000 [8100] _start
  # we start
 CPU #1: Trace 0x7f2d67afc060 [841c] main_cpu1
  # we correctly figured out we're CPU 1
 CPU #1: Trace 0x7f2d67afc220 [8448] main_cpu1
  # we took the branch to 8448
 CPU #1: Trace 0x7f2d67afc220 [8448] main_cpu1
  # 8000448 is a branch-to-self, so here we stay
 
 CPU #1 never bothered to enable its GICC cpu interface,
 so it will never receive interrupts and will never get
 out of this tight loop.

Yes. CPU#1 is stuck in the initial spinlock which lacks WFE.

 We get here because CPU #1 has got through main_cpu1
 to the point of testing your 'release' variable before
 CPU #0 has got through main_cpu0 far enough to set it
 to 1, so it still has the zero in it that it has on
 system startup. If scheduling happened to mean that
 CPU #0 ran further through main_cpu0 before CPU #1
 ran, we wouldn't end up in this situation -- you have a
 race condition, as I suggested.
 
 The log shows we're sat with CPU#0 fruitlessly looping
 on a variable in memory, and CPU#1 in this endless loop.

I know that the startup has a racy because I removed too much code from the 
original project.
But the startup is not my problem, it's the later parts.

I added the WFE to the initial lock. Here are two new tests, both are now 3178 
bytes in size:
http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf
http://www.cs.hs-rm.de/~zuepke/qemu/ipi_yield.elf

Both start on my machine. The IPI ping-pong starts after the first timer 
interrupt after 1s.
The problem is that IPIs are delivered only once a second after the timer 
interrupts QEMU's main loop.


 PS: QEMU doesn't care, but your binary seems to be entirely
 devoid of barrier instructions, which is likely to cause
 you problems on real hardware.
 
 thanks
 -- PMM

Yes, I trimmed down my code to the bare minimum to handle IPIs on QEMU only. It 
lacks barriers, cache handling and has bogus baudrate settings.


Something else: Existing ARM CPU so far do not use hyper-threading, but have 
real phyical cores.
In contrast, QEMU is an extreme coarse-grained hyper-threading architectures, 
so existing legacy code that was written with physical cores in mind will 
trigger timing bugs in synchronization primitives then, especially code 
originally written for ARM11 MPCore like mine, which lacks WFE/SEV.
If we consider QEMU as a platform to run legacy code, doesn't it make sense to 
address these issues?


Best regards
Alex




Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Alex Züpke
Am 15.06.2015 um 16:51 schrieb Peter Maydell:
 On 15 June 2015 at 15:44, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Am 12.06.2015 um 20:03 schrieb Peter Maydell:
 Probably the best approach would be to have something in
 arm_cpu_set_irq() which says if we are CPU X and we've
 just caused an interrupt to be set for CPU Y, then we
 should ourselves yield back to the main loop.

 Something like this, maybe, though I have done no more testing
 than checking it doesn't actively break kernel booting :-)


 Thanks! One more check for level is needed to get it work:
 
 What happens without that? It's reasonable to have it,
 but extra cpu_exit()s shouldn't cause a problem beyond
 being a bit inefficient...

The emulation get's stuck, for whatever reason I don't understand.
I checked if something similar is done on other architectures and found 
that the level check is missing, see for example cpu_request_exit() in 
hw/ppc/prep.c:
  static void cpu_request_exit(void *opaque, int irq, int level)
  {
  CPUState *cpu = current_cpu;

  if (cpu  level) {
  cpu_exit(cpu);
  }
  }

But probably this is used for something completely unrelated.

 It would be interesting to know if this helps Linux as well
 as your custom OS. (I don't know whether a CPU #0 polls
 approach is bad on hardware too; the other option would be
 to have CPU #1 IPI back in the other direction if 0 needed
 to wait for a response.)
 
 -- PMM

IIRC, Linux TLB shootdown on x86 once used such a scheme, but I don't know if 
they changed it.

I'd say that an IPI+poll pattern is used quite often in the tricky parts of a 
kernel, like kernel debugging.



Here's a simple IPI tester sending IPIs from CPU #0 to CPU #1 in an endless 
loop.
The IPIs are delayed until the timer interrupt triggers the main loop.

http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf
3174 bytes, md5sum 8d73890a60cd9b24a4f9139509b580e2

Run testcase:
$ qemu-system-arm -M vexpress-a15 -smp 2 -kernel ipi.elf -nographic

The testcase prints the following on the serial console without the patch:

  +--- CPU 0 came up
  |+-- CPU 0 initialization completed
  || + CPU 0 timer interrupt, 1 HZ
  || |
  vv v
  0!1T.T.T.T.T.T.T.
^ ^
| |
| +-- CPU 1 received an IPI
+ CPU 1 came up


Expected testcase output with patch:

  0!1T..hundreds of dots.T...

So: more dots == more IPIs handled between two timer interrupts T ...



Best regards
Alex




Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Alex Züpke
Hi Peter,

Am 12.06.2015 um 20:03 schrieb Peter Maydell:
 On 12 June 2015 at 17:38, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Hi,

 I'm benchmarking some IPI (== inter-processor-interrupt) synchronization 
 stuff of my custom kernel on QEMU ARM (qemu-system-arm -M vexpress-a15 -smp 
 2) and ran into the following problem: pending IPIs are delayed until the 
 QEMU main loop receives an event (for example the timer interrupt expires or 
 I press a key on the console).

 The following timing diagram tries to show this:

   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...


  timer interrupt expires
  now QEMU switches to CPU #1
receives IPI
signals completeness
WFI
  QEMU switches to CPU #0
   polling done
   process timer interrupt
   ...
 
 Right. The problem is that we don't have any way of telling
 that CPU 0 is just sat busy waiting for CPU 1.
 
 It works as expects (I get thousands of IPIs per second now), but
 it does not feel right, so is there a better way to improve the
 responsiveness of IPI handling in QEMU?
 
 Probably the best approach would be to have something in
 arm_cpu_set_irq() which says if we are CPU X and we've
 just caused an interrupt to be set for CPU Y, then we
 should ourselves yield back to the main loop.
 
 Something like this, maybe, though I have done no more testing
 than checking it doesn't actively break kernel booting :-)


Thanks! One more check for level is needed to get it work:

--- a/target-arm/cpu.c
+++ b/target-arm/cpu.c
@@ -325,6 +325,18 @@ static void arm_cpu_set_irq(void *opaque, int irq, int 
level)
 default:
 hw_error(arm_cpu_set_irq: Bad interrupt line %d\n, irq);
 }
+
+/* If we are currently executing code for CPU X, and this
+ * CPU we've just triggered an interrupt on is CPU Y, then
+ * make CPU X yield control back to the main loop at the
+ * end of the TB it's currently executing.
+ * This avoids problems where the interrupt was an IPI
+ * and CPU X would otherwise sit busy looping for the rest
+ * of its timeslice because Y hasn't had a chance to run.
+ */
+if (level  current_cpu  current_cpu != cs) {
+cpu_exit(current_cpu);
+}
 }
 
 static void arm_cpu_kvm_set_irq(void *opaque, int irq, int level)


Do you need a testcase for this?



Best regards
Alex




Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Alex Züpke
Am 15.06.2015 um 17:18 schrieb Peter Maydell:
 On 15 June 2015 at 16:07, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Am 15.06.2015 um 17:04 schrieb Peter Maydell:
 On 12 June 2015 at 17:38, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Hi,

 I'm benchmarking some IPI (== inter-processor-interrupt) synchronization 
 stuff of my custom kernel on QEMU ARM (qemu-system-arm -M vexpress-a15 
 -smp 2) and ran into the following problem: pending IPIs are delayed until 
 the QEMU main loop receives an event (for example the timer interrupt 
 expires or I press a key on the console).

 The following timing diagram tries to show this:

   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...


  timer interrupt expires
  now QEMU switches to CPU #1
receives IPI
signals completeness
WFI
  QEMU switches to CPU #0
   polling done
   process timer interrupt
   ...

 Does your polling loop have a YIELD insn in it? We (and hardware)
 can use that as a hint that you're busy-looping and we should
 try doing something else. (QEMU doesn't implement that for A32/T32
 yet, but we should; we already do on A64.)

 Yes, I should be yielding here, but SEV isn't implemented.
 Probably the notification should be done there as well.
 
 YIELD isn't related to SEV -- it's just a generic hey, I'm
 polling hint. We NOP SEV, and make WFE be a yield this CPU's
 timeslice event, which is architecturally valid, and sufficient
 for this situation anyway.
 
 -- PMM
 

Ah, YIELD, thanks for the hint. I never used it because all CPUs I got my hand 
on were physical ones so far ...


So this is the way to go:

--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -4084,6 +4084,7 @@ static void gen_nop_hint(DisasContext *s, int val)
 gen_set_pc_im(s, s-pc);
 s-is_jmp = DISAS_WFI;
 break;
+case 1: /* yield */
 case 2: /* wfe */
 gen_set_pc_im(s, s-pc);
 s-is_jmp = DISAS_WFE;


Thanks
Alex



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Peter Maydell
On 12 June 2015 at 17:38, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Hi,

 I'm benchmarking some IPI (== inter-processor-interrupt) synchronization 
 stuff of my custom kernel on QEMU ARM (qemu-system-arm -M vexpress-a15 -smp 
 2) and ran into the following problem: pending IPIs are delayed until the 
 QEMU main loop receives an event (for example the timer interrupt expires or 
 I press a key on the console).

 The following timing diagram tries to show this:

   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...


  timer interrupt expires
  now QEMU switches to CPU #1
receives IPI
signals completeness
WFI
  QEMU switches to CPU #0
   polling done
   process timer interrupt
   ...

Does your polling loop have a YIELD insn in it? We (and hardware)
can use that as a hint that you're busy-looping and we should
try doing something else. (QEMU doesn't implement that for A32/T32
yet, but we should; we already do on A64.)

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Peter Maydell
On 15 June 2015 at 16:07, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Am 15.06.2015 um 17:04 schrieb Peter Maydell:
 On 12 June 2015 at 17:38, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Hi,

 I'm benchmarking some IPI (== inter-processor-interrupt) synchronization 
 stuff of my custom kernel on QEMU ARM (qemu-system-arm -M vexpress-a15 -smp 
 2) and ran into the following problem: pending IPIs are delayed until the 
 QEMU main loop receives an event (for example the timer interrupt expires 
 or I press a key on the console).

 The following timing diagram tries to show this:

   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...


  timer interrupt expires
  now QEMU switches to CPU #1
receives IPI
signals completeness
WFI
  QEMU switches to CPU #0
   polling done
   process timer interrupt
   ...

 Does your polling loop have a YIELD insn in it? We (and hardware)
 can use that as a hint that you're busy-looping and we should
 try doing something else. (QEMU doesn't implement that for A32/T32
 yet, but we should; we already do on A64.)

 Yes, I should be yielding here, but SEV isn't implemented.
 Probably the notification should be done there as well.

YIELD isn't related to SEV -- it's just a generic hey, I'm
polling hint. We NOP SEV, and make WFE be a yield this CPU's
timeslice event, which is architecturally valid, and sufficient
for this situation anyway.

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Peter Maydell
On 15 June 2015 at 15:44, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Am 12.06.2015 um 20:03 schrieb Peter Maydell:
 Probably the best approach would be to have something in
 arm_cpu_set_irq() which says if we are CPU X and we've
 just caused an interrupt to be set for CPU Y, then we
 should ourselves yield back to the main loop.

 Something like this, maybe, though I have done no more testing
 than checking it doesn't actively break kernel booting :-)


 Thanks! One more check for level is needed to get it work:

What happens without that? It's reasonable to have it,
but extra cpu_exit()s shouldn't cause a problem beyond
being a bit inefficient...

It would be interesting to know if this helps Linux as well
as your custom OS. (I don't know whether a CPU #0 polls
approach is bad on hardware too; the other option would be
to have CPU #1 IPI back in the other direction if 0 needed
to wait for a response.)

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Alex Züpke
Am 15.06.2015 um 17:04 schrieb Peter Maydell:
 On 12 June 2015 at 17:38, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Hi,

 I'm benchmarking some IPI (== inter-processor-interrupt) synchronization 
 stuff of my custom kernel on QEMU ARM (qemu-system-arm -M vexpress-a15 -smp 
 2) and ran into the following problem: pending IPIs are delayed until the 
 QEMU main loop receives an event (for example the timer interrupt expires or 
 I press a key on the console).

 The following timing diagram tries to show this:

   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...


  timer interrupt expires
  now QEMU switches to CPU #1
receives IPI
signals completeness
WFI
  QEMU switches to CPU #0
   polling done
   process timer interrupt
   ...
 
 Does your polling loop have a YIELD insn in it? We (and hardware)
 can use that as a hint that you're busy-looping and we should
 try doing something else. (QEMU doesn't implement that for A32/T32
 yet, but we should; we already do on A64.)

Yes, I should be yielding here, but SEV isn't implemented.
Probably the notification should be done there as well.


Best regards
Alex




Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Peter Maydell
On 15 June 2015 at 16:36, Alex Züpke alexander.zue...@hs-rm.de wrote:
 So this is the way to go:

 --- a/target-arm/translate.c
 +++ b/target-arm/translate.c
 @@ -4084,6 +4084,7 @@ static void gen_nop_hint(DisasContext *s, int val)
  gen_set_pc_im(s, s-pc);
  s-is_jmp = DISAS_WFI;
  break;
 +case 1: /* yield */
  case 2: /* wfe */
  gen_set_pc_im(s, s-pc);
  s-is_jmp = DISAS_WFE;

Actually I want to split out the yield code path from the wfe
one, because some day we may actually implement WFE as WFE,
at which point WFE has some trap-to-EL2 logic that YIELD
doesn't. I was about to write a patch to do that...

(If you plan to run your custom OS under a hypervisor you
might prefer SEV/WFE over YIELD, because then if your custom OS
is under heavy load the hypervisor has a chance to swap this
vcpu out and run some other one.)

thanks
-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Peter Maydell
On 15 June 2015 at 16:05, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Am 15.06.2015 um 16:51 schrieb Peter Maydell:
 On 15 June 2015 at 15:44, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Am 12.06.2015 um 20:03 schrieb Peter Maydell:
 Probably the best approach would be to have something in
 arm_cpu_set_irq() which says if we are CPU X and we've
 just caused an interrupt to be set for CPU Y, then we
 should ourselves yield back to the main loop.

 Something like this, maybe, though I have done no more testing
 than checking it doesn't actively break kernel booting :-)


 Thanks! One more check for level is needed to get it work:

 What happens without that? It's reasonable to have it,
 but extra cpu_exit()s shouldn't cause a problem beyond
 being a bit inefficient...

 The emulation get's stuck, for whatever reason I don't understand.

I'm beginning to suspect that your guest code has a race
condition in it, such that if the other CPU runs at a
point you weren't expecting it to then you end up
deadlocking or otherwise running into a bug in your guest.

In particular, I see the emulation getting stuck even without
this patch to arm_cpu_set_irq().

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Peter Maydell
On 15 June 2015 at 16:05, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Here's a simple IPI tester sending IPIs from CPU #0 to CPU #1 in an endless 
 loop.
 The IPIs are delayed until the timer interrupt triggers the main loop.

 http://www.cs.hs-rm.de/~zuepke/qemu/ipi.elf
 3174 bytes, md5sum 8d73890a60cd9b24a4f9139509b580e2

 Run testcase:
 $ qemu-system-arm -M vexpress-a15 -smp 2 -kernel ipi.elf -nographic

 The testcase prints the following on the serial console without the patch:

   +--- CPU 0 came up
   |+-- CPU 0 initialization completed
   || + CPU 0 timer interrupt, 1 HZ
   || |
   vv v
   0!1T.T.T.T.T.T.T.
 ^ ^
 | |
 | +-- CPU 1 received an IPI
 + CPU 1 came up


 Expected testcase output with patch:

   0!1T..hundreds of dots.T...

 So: more dots == more IPIs handled between two timer interrupts T ...

For me this test case (without any IPI related patches)
just prints 0!TT (or sometimes 0!T) and then hangs.
The yield test binary does the same thing.

-- PMM



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Alex Züpke
Am 15.06.2015 um 17:49 schrieb Peter Maydell:
 On 15 June 2015 at 16:36, Alex Züpke alexander.zue...@hs-rm.de wrote:
 So this is the way to go:

 --- a/target-arm/translate.c
 +++ b/target-arm/translate.c
 @@ -4084,6 +4084,7 @@ static void gen_nop_hint(DisasContext *s, int val)
  gen_set_pc_im(s, s-pc);
  s-is_jmp = DISAS_WFI;
  break;
 +case 1: /* yield */
  case 2: /* wfe */
  gen_set_pc_im(s, s-pc);
  s-is_jmp = DISAS_WFE;
 
 Actually I want to split out the yield code path from the wfe
 one, because some day we may actually implement WFE as WFE,
 at which point WFE has some trap-to-EL2 logic that YIELD
 doesn't. I was about to write a patch to do that...

OK.

Both the cpu_exit-after-sending-IPI or the YIELD patch would fix my issue, but 
I think the YIELD one fits better.

I updated my testcase to YIELD during polling:
http://www.cs.hs-rm.de/~zuepke/qemu/ipi_yield.elf
3174 bytes, md5sum e74897e6b6d70f472db9e9d657780035


 (If you plan to run your custom OS under a hypervisor you
 might prefer SEV/WFE over YIELD, because then if your custom OS
 is under heavy load the hypervisor has a chance to swap this
 vcpu out and run some other one.)
 
 thanks
 -- PMM
 

Thanks for the hint!



Best regards
Alex



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Alex Zuepke

Am 15.06.2015 um 20:58 schrieb Peter Maydell:

On 15 June 2015 at 16:05, Alex Züpke alexander.zue...@hs-rm.de wrote:

Am 15.06.2015 um 16:51 schrieb Peter Maydell:

On 15 June 2015 at 15:44, Alex Züpke alexander.zue...@hs-rm.de wrote:

Am 12.06.2015 um 20:03 schrieb Peter Maydell:

Probably the best approach would be to have something in
arm_cpu_set_irq() which says if we are CPU X and we've
just caused an interrupt to be set for CPU Y, then we
should ourselves yield back to the main loop.

Something like this, maybe, though I have done no more testing
than checking it doesn't actively break kernel booting :-)



Thanks! One more check for level is needed to get it work:


What happens without that? It's reasonable to have it,
but extra cpu_exit()s shouldn't cause a problem beyond
being a bit inefficient...


The emulation get's stuck, for whatever reason I don't understand.


I'm beginning to suspect that your guest code has a race
condition in it, such that if the other CPU runs at a
point you weren't expecting it to then you end up
deadlocking or otherwise running into a bug in your guest.

In particular, I see the emulation getting stuck even without
this patch to arm_cpu_set_irq().

-- PMM


Yes, it's a bug, sorry for that. I removed too much code to get a simple 
testcase. It's stuck in the first spinlock where CPU#1 is waiting for 
CPU#0 to initialize the rest of the system, and I need to WFE or YIELD 
here as well.


But this is showing the original problem again: the emulation get's 
stuck spinning on CPU #1 forever, because the main loop doesn't switch 
to CPU #0 voluntarily. Just press a key on the console/emulated serial 
line to trigger an event to QEMU's main loop, and the testcase should 
continue.



Best regards
Alex



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-15 Thread Peter Crosthwaite
On Mon, Jun 15, 2015 at 8:49 AM, Peter Maydell peter.mayd...@linaro.org wrote:
 On 15 June 2015 at 16:36, Alex Züpke alexander.zue...@hs-rm.de wrote:
 So this is the way to go:

 --- a/target-arm/translate.c
 +++ b/target-arm/translate.c
 @@ -4084,6 +4084,7 @@ static void gen_nop_hint(DisasContext *s, int val)
  gen_set_pc_im(s, s-pc);
  s-is_jmp = DISAS_WFI;
  break;
 +case 1: /* yield */
  case 2: /* wfe */
  gen_set_pc_im(s, s-pc);
  s-is_jmp = DISAS_WFE;

 Actually I want to split out the yield code path from the wfe
 one, because some day we may actually implement WFE as WFE,
 at which point WFE has some trap-to-EL2 logic that YIELD
 doesn't. I was about to write a patch to do that...


If anything, the existing wfe code is the yield semantic, so the
correct way is probably the rename wfe helper as yield and make
comment that wfe is approximated as yield until that second code path
exists.

Regards,
Peter

 (If you plan to run your custom OS under a hypervisor you
 might prefer SEV/WFE over YIELD, because then if your custom OS
 is under heavy load the hypervisor has a chance to swap this
 vcpu out and run some other one.)

 thanks
 -- PMM




[Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-12 Thread Alex Züpke
Hi,

I'm benchmarking some IPI (== inter-processor-interrupt) synchronization stuff 
of my custom kernel on QEMU ARM (qemu-system-arm -M vexpress-a15 -smp 2) and 
ran into the following problem: pending IPIs are delayed until the QEMU main 
loop receives an event (for example the timer interrupt expires or I press a 
key on the console).

The following timing diagram tries to show this:

  CPU #0   CPU #1
  ==   ==
  ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
  send SGI in MPCore
  polls for completeness
 time passes ...
  polls ...
 ... and passes ...
  still polls ...
 ... and passes ...
  still polls ...
 ... and passes ...

  
 timer interrupt expires
 now QEMU switches to CPU #1
   receives IPI
   signals completeness
   WFI
 QEMU switches to CPU #0
  polling done
  process timer interrupt
  ...


My timer is setup to generate an interrupt once a second, so I only get 1 IPI 
interrupt per second on QEMU. When I run the test on real hardware (i.MX6Q), I 
get millions of IPIs instead.


I tried to fix this by forcing QEMU back into the main loop and added a call 
to qemu_notify_event() in the IPI-sending path of the ARM interrupt controller:


diff --git a/hw/intc/arm_gic.c b/hw/intc/arm_gic.c
index c1d2e70..20dba75 100644
--- a/hw/intc/arm_gic.c
+++ b/hw/intc/arm_gic.c
@@ -21,6 +21,7 @@
 #include hw/sysbus.h
 #include gic_internal.h
 #include qom/cpu.h
+#include qemu/main-loop.h
 
 //#define DEBUG_GIC
 
@@ -898,6 +899,7 @@ static void gic_dist_writel(void *opaque, hwaddr offset,
 target_cpu = ctz32(mask);
 }
 gic_update(s);
+qemu_notify_event();
 return;
 }
 gic_dist_writew(opaque, offset, value  0x, attrs);



It works as expects (I get thousands of IPIs per second now), but it does not 
feel right, so is there a better way to improve the responsiveness of IPI 
handling in QEMU?

Best regards
Alex



Re: [Qemu-devel] QEMU ARM SMP: IPI delivery delayed until next main loop event // how to improve IPI latency?

2015-06-12 Thread Peter Maydell
On 12 June 2015 at 17:38, Alex Züpke alexander.zue...@hs-rm.de wrote:
 Hi,

 I'm benchmarking some IPI (== inter-processor-interrupt) synchronization 
 stuff of my custom kernel on QEMU ARM (qemu-system-arm -M vexpress-a15 -smp 
 2) and ran into the following problem: pending IPIs are delayed until the 
 QEMU main loop receives an event (for example the timer interrupt expires or 
 I press a key on the console).

 The following timing diagram tries to show this:

   CPU #0   CPU #1
   ==   ==
   ... other stuff ...  WFI (wait for interrupt, like x86 HLT)
   send SGI in MPCore
   polls for completeness
  time passes ...
   polls ...
  ... and passes ...
   still polls ...
  ... and passes ...
   still polls ...
  ... and passes ...


  timer interrupt expires
  now QEMU switches to CPU #1
receives IPI
signals completeness
WFI
  QEMU switches to CPU #0
   polling done
   process timer interrupt
   ...

Right. The problem is that we don't have any way of telling
that CPU 0 is just sat busy waiting for CPU 1.

 It works as expects (I get thousands of IPIs per second now), but
 it does not feel right, so is there a better way to improve the
 responsiveness of IPI handling in QEMU?

Probably the best approach would be to have something in
arm_cpu_set_irq() which says if we are CPU X and we've
just caused an interrupt to be set for CPU Y, then we
should ourselves yield back to the main loop.

Something like this, maybe, though I have done no more testing
than checking it doesn't actively break kernel booting :-)

--- a/target-arm/cpu.c
+++ b/target-arm/cpu.c
@@ -325,6 +325,18 @@ static void arm_cpu_set_irq(void *opaque, int
irq, int level)
 default:
 hw_error(arm_cpu_set_irq: Bad interrupt line %d\n, irq);
 }
+
+/* If we are currently executing code for CPU X, and this
+ * CPU we've just triggered an interrupt on is CPU Y, then
+ * make CPU X yield control back to the main loop at the
+ * end of the TB it's currently executing.
+ * This avoids problems where the interrupt was an IPI
+ * and CPU X would otherwise sit busy looping for the rest
+ * of its timeslice because Y hasn't had a chance to run.
+ */
+if (current_cpu  current_cpu != cs) {
+cpu_exit(current_cpu);
+}
 }

 static void arm_cpu_kvm_set_irq(void *opaque, int irq, int level)


-- PMM



[Qemu-devel] QEMU with smp

2007-02-25 Thread Danny Chieh-Yao, Cheng

Hi all,

I am trying to run QEMU under Windows, but I can't get -smp 2 working. I 
know that QEMU under Windows is still alpha stage, but does anyone know 
if the -smp 2 option works at all under Windows? I am getting the error 
Can't find cpu1 and QEMU just halts.


Thanks,
Danny


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


[Qemu-devel] QEMU and SMP Option on dual core processor

2007-02-12 Thread Danny Chieh-Yao, Cheng

Hi all,

I am running a Centrino Duo on Windows XP Home SP2.

Here's the problem...

I want to run the option -smp 2 with QEMU, but when I start it, it 
says it cannot find cpu1. When I look into task manager and the 
affinity, it's running on CPU0, but if I use imagecfg to set QEMU to run 
on CPU1, it STILL give me the same error. Does anyone know of this problem?


Also, I tried to turn off one processor in BIOs, but there's no such 
option either, I'm using ASUS A8JM laptop.


Regards,
Danny


___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel