Re: [Qemu-devel] Help on TLB Flush

2015-02-13 Thread Paolo Bonzini


On 12/02/2015 22:57, Peter Maydell wrote:
 The only
 requirement is that if the CPU that did the TLB maintenance
 op executes a DMB (barrier) then the TLB op must finish
 before the barrier completes execution. So you could split
 the kick off TLB invalidate and make sure all CPUs
 are done phases if you wanted. [cf v8 ARM ARM rev A.e
 section D4.7.2 and in particular the subsection on
 ordering and completion.]

You can just make DMB start a new translation block.  Then when the TLB
flush helpers call cpu_exit() or cpu_interrupt() the flush request is
serviced.

Paolo



Re: [Qemu-devel] Help on TLB Flush

2015-02-13 Thread Paolo Bonzini


On 13/02/2015 10:37, Mark Burton wrote:
 the memory barrier is on the cpu requesting the flush isn’t it (not
 on the CPU that is being flushed)?

Oops, I misread Peter's explanation.

In that case, perhaps DMB can be treated in a similar way as WFI, using
cpu-halted.  Queueing work on other CPUs can be done with
async_run_on_cpu, which exits the idle loop in qemu_tcg_wait_io_event
(this avoids the deadlocks).  Checking that other CPUs have flushed the
TLBs can be done in cpu_has_work (always return false if cpu-halted ==
true there are outstanding TLB requests).

Paolo



Re: [Qemu-devel] Help on TLB Flush

2015-02-13 Thread Mark Burton
the memory barrier is on the cpu requesting the flush isn’t it (not on the CPU 
that is being flushed)?
Cheers
Mark.

 On 13 Feb 2015, at 10:34, Paolo Bonzini pbonz...@redhat.com wrote:
 
 
 
 On 12/02/2015 22:57, Peter Maydell wrote:
 The only
 requirement is that if the CPU that did the TLB maintenance
 op executes a DMB (barrier) then the TLB op must finish
 before the barrier completes execution. So you could split
 the kick off TLB invalidate and make sure all CPUs
 are done phases if you wanted. [cf v8 ARM ARM rev A.e
 section D4.7.2 and in particular the subsection on
 ordering and completion.]
 
 You can just make DMB start a new translation block.  Then when the TLB
 flush helpers call cpu_exit() or cpu_interrupt() the flush request is
 serviced.
 
 Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-13 Thread Lluís Vilanova
Mark Burton writes:

 On 13 Feb 2015, at 08:24, Peter Maydell peter.mayd...@linaro.org wrote:
 
 On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote:
 If the kernel is doing this - then effectively - for X86, each CPU only
 flush’s it’s own TLB (from the perspective of Qemu) - correct?
 (in which case, for Qemu itself - for x86) - we dont need to implement
 a global flush, and hence we dont need to build the mechanism to sync ?

 The semantics you need are flush the QEMU TLB for CPU X (where
 X may not be the CPU you're running on). This is what tlb_flush()
 does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
 We then use that to implement the target's required semantics
 (eg in ARM the tlbiall_is_write() function is handled by iterating
 through all CPUs and calling tlb_flush on them).

 What Lluis implied seemed to be that the kernel arranged to signal the CPU 
 that would flush. Hence, (for X86), we would only ever flush our own TLB.

That's correct.

[...]
 For our immediate concern, in the interests of getting the thing working and
 making sure we’ve turned over all the stones, on ARM - it MAY help us to check
 that the flush has happened ‘in the next memory barrier’….
   - I dont know if that will help us or not, and - even if it does, I 
 agree with you, it would be more messy than it need be.
 However, in the interests of making sure that there are no other issues - we 
 may ‘hack’ something before we put in place a more elegant solution…. 
 (right now, we have some mutex issues, shifting the sync to the barrier MAY 
 help us avoid that…. To Be Seen…. and anyway - it would only be a temporary 
 fix).

But you shouldn't assume that everyone either uses x86's semantics (aka, each
CPU gets an IPI), or the ARM semantics you described where the global TLB flush
instruction has asynchronous effects. First, in ARM you still have to ensure
other CPUs did what you asked them to (whenever the arch manual says you must do
so). Second, it seems like ARM does not always behave in the way you described:

  http://lxr.free-electrons.com/source/arch/arm/kernel/smp.c?v=2.6.32#L630

Granted, this is just the same behaviour as x86, but noone guarantees you that
some other operation in any of the multiple architectures supported by QEMU will
never need a synchronous instruction with global effects.

I understand the pressure of getting something running and work from that, but I
think that having a framework for asynchronous cross-CPU messaging would be
rather useful in the future. That can be then complemented with a mechanism to
wait for these asynchronous messages. You can achieve any desired behaviour by
composing these two.


Cheers,
  Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth



Re: [Qemu-devel] Help on TLB Flush

2015-02-13 Thread Mark Burton
Agreed
Cheers
Mark.

 On 13 Feb 2015, at 14:30, Lluís Vilanova vilan...@ac.upc.edu wrote:
 
 Mark Burton writes:
 
 On 13 Feb 2015, at 08:24, Peter Maydell peter.mayd...@linaro.org wrote:
 
 On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote:
 If the kernel is doing this - then effectively - for X86, each CPU only
 flush’s it’s own TLB (from the perspective of Qemu) - correct?
 (in which case, for Qemu itself - for x86) - we dont need to implement
 a global flush, and hence we dont need to build the mechanism to sync ?
 
 The semantics you need are flush the QEMU TLB for CPU X (where
 X may not be the CPU you're running on). This is what tlb_flush()
 does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
 We then use that to implement the target's required semantics
 (eg in ARM the tlbiall_is_write() function is handled by iterating
 through all CPUs and calling tlb_flush on them).
 
 What Lluis implied seemed to be that the kernel arranged to signal the CPU 
 that would flush. Hence, (for X86), we would only ever flush our own TLB.
 
 That's correct.
 
 [...]
 For our immediate concern, in the interests of getting the thing working and
 making sure we’ve turned over all the stones, on ARM - it MAY help us to 
 check
 that the flush has happened ‘in the next memory barrier’….
  - I dont know if that will help us or not, and - even if it does, I 
 agree with you, it would be more messy than it need be.
 However, in the interests of making sure that there are no other issues - we 
 may ‘hack’ something before we put in place a more elegant solution…. 
 (right now, we have some mutex issues, shifting the sync to the barrier MAY 
 help us avoid that…. To Be Seen…. and anyway - it would only be a temporary 
 fix).
 
 But you shouldn't assume that everyone either uses x86's semantics (aka, each
 CPU gets an IPI), or the ARM semantics you described where the global TLB 
 flush
 instruction has asynchronous effects. First, in ARM you still have to ensure
 other CPUs did what you asked them to (whenever the arch manual says you must 
 do
 so). Second, it seems like ARM does not always behave in the way you 
 described:
 
  http://lxr.free-electrons.com/source/arch/arm/kernel/smp.c?v=2.6.32#L630
 
 Granted, this is just the same behaviour as x86, but noone guarantees you that
 some other operation in any of the multiple architectures supported by QEMU 
 will
 never need a synchronous instruction with global effects.
 
 I understand the pressure of getting something running and work from that, 
 but I
 think that having a framework for asynchronous cross-CPU messaging would be
 rather useful in the future. That can be then complemented with a mechanism to
 wait for these asynchronous messages. You can achieve any desired behaviour by
 composing these two.
 
 
 Cheers,
  Lluis
 
 -- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

 On 12 Feb 2015, at 16:31, Dr. David Alan Gilbert dgilb...@redhat.com wrote:
 
 * Mark Burton (mark.bur...@greensocs.com) wrote:
 
 On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote:
 
 On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
 
 On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote:
 We are proposing to implement this by signalling all other CPU???s
 to exit (and requesting they flush before re-starting). In other
 words, this would happen asynchronously.
 
 For global flushes, give them a pointer payload along with the flush
 request and tell all cpus to increment it atomically. In your main
 thread, wait until *ptr == nKickedCpus.
 
 I bet this will not be the only situation where you want to
 do an get all other CPUs to do $something and wait til they
 have done so kind of operation, so some lightweight but generic
 infrastructure for doing that would not be a bad plan. (Similarly
 get all other CPUs to stop, then I can do $something and let
 the others continue???.)
 
 We tried this - we ended up in knots.
 We had 2 CPU???s trying to flush at about the same time, both waiting for 
 the other.
 We had CPU???s trying to get the global mutex to finish what they were 
 doing, while being told to flush, 
 We had CPU???s in the global mutex trying to do something that would cause a 
 flush??? etc
 We had spaghetti with extra Bolognese sauce???
 
 This is the hard problem of multithreaded emulation.
 You've always got to let CPUs get back to a point where you can
 invalidate a mapping/page quickly.
 
 Thus you've also got to be very careful about where any CPU might
 get into a loop or take another lock that would stop another CPU
 causing an invalidate.  Either that or you need a way of somehow
 breaking locks or recovering from the situation.

Indeed - 
for now - we’re building something which will likely be less than ideal. Once 
we have some sort of evidence that it works, and (hopefully) more reliably than 
the approach we have right now, then we come up with a more elegant scheme.


 
 We eventually concluded, yes - in an infinite universe everything is 
 possible, but if we could simply do this ???asynchronously??? then our lives 
 would be a LOT easier.
 e.g.  - ask all CPU???s to ???exit and do something??? is easy -  wait for 
 them to do that is a whole other problem???
 
 Which is why you've got to bound how long it might take
 those CPUs to get back to you, and optimise out cases where
 it's not really needed later.
 
 Our question is - do we need this ???sync??? (before the flush), or can we 
 actually allow CPU???s to flush themselves asynchronously???.
 
 Always assume the worst.

:-)

Cheers
Mark.

 
 Dave
 
 
 Cheers
 
 Mark.
 
 
 
 
 -- PMM
 
 
   +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210
 
  +33 (0)603762104
  mark.burton
 
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Peter Maydell
On 12 February 2015 at 15:19, Alexander Graf ag...@suse.de wrote:
 On 12.02.15 16:08, Mark Burton wrote:
 Our question is - do we need this ‘sync’ (before the flush),
 or can we actually allow CPU’s to flush themselves asynchronously….

 The respective target architecture specs will tell you. And I very much
 doubt that it is ok in most cases.

For ARM note that TLB maintenance operations do not have to
complete synchronously. They can be reordered relative to other
TLB maintenance ops or to loads or stores (by this CPU or
by other CPUs if this is a global invalidate). The only
requirement is that if the CPU that did the TLB maintenance
op executes a DMB (barrier) then the TLB op must finish
before the barrier completes execution. So you could split
the kick off TLB invalidate and make sure all CPUs
are done phases if you wanted. [cf v8 ARM ARM rev A.e
section D4.7.2 and in particular the subsection on
ordering and completion.]

This only applies to ARM guests, of course. (Other CPU
architectures are available. :-))

-- PMM



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Lluís Vilanova
Mark Burton writes:

 On 12 Feb 2015, at 16:38, Alexander Graf ag...@suse.de wrote:
 
 
 
 On 12.02.15 15:58, Peter Maydell wrote:
 On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
 almost nobody except x86 does global flushes
 
 All ARM TLB maintenance operations have both this CPU only
 and all TLBs in the Inner Shareable domain [that's ARM-speak
 for every CPU core in the cluster] variants (the latter
 being the TLB *IS operations). Looking at Linux's
 arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
 most of the operations defined there use the IS variants.
 
 Wow, did anyone benchmark this? I know that PPC switched away from
 global flushes and instead tracks the CPUs a task was running on to
 limit the scope of CPUs that need to flush.

 Doesn’t that mean you have to signal a specific CPU to cause it to flush 
 itself…. Isn’t that in itself expensive? Do you have to organise some sort of 
 atomicity yourself around that too?

Yup. AFAIR, Linux in x86-64 queues a request to a per-CPU request list, and uses
IPIs to signal these types of operations to the target CPU:

  http://lxr.free-electrons.com/source/kernel/smp.c?v=2.6.32#L386

Waiting for completion is implemented on top by incrementing some counter from
each CPU, and waiting for it to have the correct final value.

If something were implemented on these lines, it could be used as a generic
cross-CPU event messaging infrastructure (plus some interrupt bit in the CPU
structure that TCG would check to break away from guest code; I believe
something similar is already being used - icount? -).

PS: To be honest, I still don't know which TLBs we're talking about here, and
which cases trigger these TLB flush operations.


Cheers,
  Lluis

-- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Peter Maydell
On 12 February 2015 at 15:38, Alexander Graf ag...@suse.de wrote:
 On 12.02.15 15:58, Peter Maydell wrote:
 All ARM TLB maintenance operations have both this CPU only
 and all TLBs in the Inner Shareable domain [that's ARM-speak
 for every CPU core in the cluster] variants (the latter
 being the TLB *IS operations). Looking at Linux's
 arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
 most of the operations defined there use the IS variants.

 Wow, did anyone benchmark this? I know that PPC switched away from
 global flushes and instead tracks the CPUs a task was running on to
 limit the scope of CPUs that need to flush.

That would be a valid implementation. The CPU has to behave
as the spec says it must, but there's no reason you couldn't
implement flush by ASID for all TLBs via some implementation
specific tracking of ASID use per CPU to limit which cores
you sent the flush request to, if you thought that was a
better way to do it.

-- PMM



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton
Up top - thanks Peter, I think you may give us an idea !

 On 12 Feb 2015, at 23:10, Lluís Vilanova vilan...@ac.upc.edu wrote:
 
 Mark Burton writes:
 
 On 12 Feb 2015, at 16:38, Alexander Graf ag...@suse.de wrote:
 
 
 
 On 12.02.15 15:58, Peter Maydell wrote:
 On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
 almost nobody except x86 does global flushes
 
 All ARM TLB maintenance operations have both this CPU only
 and all TLBs in the Inner Shareable domain [that's ARM-speak
 for every CPU core in the cluster] variants (the latter
 being the TLB *IS operations). Looking at Linux's
 arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
 most of the operations defined there use the IS variants.
 
 Wow, did anyone benchmark this? I know that PPC switched away from
 global flushes and instead tracks the CPUs a task was running on to
 limit the scope of CPUs that need to flush.
 
 Doesn’t that mean you have to signal a specific CPU to cause it to flush 
 itself…. Isn’t that in itself expensive? Do you have to organise some sort 
 of atomicity yourself around that too?
 
 Yup. AFAIR, Linux in x86-64 queues a request to a per-CPU request list, and 
 uses
 IPIs to signal these types of operations to the target CPU:
 
  http://lxr.free-electrons.com/source/kernel/smp.c?v=2.6.32#L386
 
 Waiting for completion is implemented on top by incrementing some counter from
 each CPU, and waiting for it to have the correct final value.

If the kernel is doing this - then effectively - for X86, each CPU only flush’s 
it’s own TLB (from the perspective of Qemu) - correct?
(in which case, for Qemu itself - for x86) - we dont need to implement a global 
flush, and hence we dont need to build the mechanism to sync ?

If I understand correctly then - the processor that causes some pain is the ARM 
that has (and uses) global flush, but the mitigating factors is that those 
flushes can by asyncronous so long as they complete before a memory barrier….

Cheers

Mark.


 
 If something were implemented on these lines, it could be used as a generic
 cross-CPU event messaging infrastructure (plus some interrupt bit in the CPU
 structure that TCG would check to break away from guest code; I believe
 something similar is already being used - icount? -).
 
 PS: To be honest, I still don't know which TLBs we're talking about here, and
which cases trigger these TLB flush operations.
 
 
 Cheers,
  Lluis
 
 -- 
 And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer.
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

 On 13 Feb 2015, at 08:24, Peter Maydell peter.mayd...@linaro.org wrote:
 
 On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote:
 If the kernel is doing this - then effectively - for X86, each CPU only
 flush’s it’s own TLB (from the perspective of Qemu) - correct?
 (in which case, for Qemu itself - for x86) - we dont need to implement
 a global flush, and hence we dont need to build the mechanism to sync ?

 The semantics you need are flush the QEMU TLB for CPU X (where
 X may not be the CPU you're running on). This is what tlb_flush()
 does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
 We then use that to implement the target's required semantics
 (eg in ARM the tlbiall_is_write() function is handled by iterating
 through all CPUs and calling tlb_flush on them).

What Lluis implied seemed to be that the kernel arranged to signal the CPU that 
would flush. Hence, (for X86), we would only ever flush our own TLB.

 
 If you don't want the pain of checking the semantics of every
 backend and figuring out a new set of primitives to implement,
 then what you need to do is continue to provide the guarantees
 the current tlb_flush function does: when it returns then the
 CPU it's supposed to have acted on has definitely done so.
 
 You can try and be cleverer if you want to, but personally
 I would recommend keeping the scope of your work simple
 where you can.

yes - though keeping it simple (silly) seems to have some complexities in this 
case, which is why we are trying to reduce the guarantees that tlm_flush() 
provides. 

At present - the ‘foreach cpu, tlb_flush()’ is effectively atomic, as no other 
CPU will be executing at the same time.
Adding multi-thread, we can already say - this ‘atomicity’ isn’t strictly 
required. As you say, the only thing tlb_flush needs to guarantee is that the 
CPU concerned has flushed. 
- that already helps. And I agree with you is the right place to take 
tlb_flush().

Of course, when only the current CPU is flushed things are much simpler (and 
already handled)...


For our immediate concern, in the interests of getting the thing working and 
making sure we’ve turned over all the stones, on ARM - it MAY help us to check 
that the flush has happened ‘in the next memory barrier’….
- I dont know if that will help us or not, and - even if it does, I 
agree with you, it would be more messy than it need be.
However, in the interests of making sure that there are no other issues - we 
may ‘hack’ something before we put in place a more elegant solution…. 
(right now, we have some mutex issues, shifting the sync to the barrier MAY 
help us avoid that…. To Be Seen…. and anyway - it would only be a temporary 
fix).

Cheers

Mark.



 
 -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Peter Maydell
On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote:
 If the kernel is doing this - then effectively - for X86, each CPU only
 flush’s it’s own TLB (from the perspective of Qemu) - correct?
 (in which case, for Qemu itself - for x86) - we dont need to implement
 a global flush, and hence we dont need to build the mechanism to sync ?

The semantics you need are flush the QEMU TLB for CPU X (where
X may not be the CPU you're running on). This is what tlb_flush()
does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
We then use that to implement the target's required semantics
(eg in ARM the tlbiall_is_write() function is handled by iterating
through all CPUs and calling tlb_flush on them).

If you don't want the pain of checking the semantics of every
backend and figuring out a new set of primitives to implement,
then what you need to do is continue to provide the guarantees
the current tlb_flush function does: when it returns then the
CPU it's supposed to have acted on has definitely done so.

You can try and be cleverer if you want to, but personally
I would recommend keeping the scope of your work simple
where you can.

-- PMM



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Alexander Graf

 On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote:
 
 
 TLB Flush:
 
 We have spent a few days on this issue, and still haven’t resolved the best 
 path.
 
 Our solution seems to work, most of the time, but we still have some strange 
 issues - so I want to check that what we are proposing has a chance of 
 working.
 
 
 Our plan is to allow all CPU’s to continue. Potentially one CPU will want to 
 write to the TLBs. Subsequent to the write, it requests a TLB Flush.

Local or global? For local TLB flushes you don't notify the other CPUs at all. 
For global ones, the semantics of the call usually dictate atomicity.

 We are proposing to implement this by signalling all other CPU’s to exit (and 
 requesting they flush before re-starting). In other words, this would happen 
 asynchronously.

For global flushes, give them a pointer payload along with the flush request 
and tell all cpus to increment it atomically. In your main thread, wait until 
*ptr == nKickedCpus.

FWIW TLBs are always CPU local. When there's a global TLB flush instruction, 
it pretty much does stall the CPU, notifies the others to also flush their 
TLBs, waits and then continues.

If this really does become a performance bottleneck (which I doubt it does, 
almost nobody except x86 does global flushes), you can also do some nasty hacky 
tricks, such as (atomically) change the valid bit in remote CPUs TLB entries. 
But really only do this as a last resort if the clean version doesn't perform 
well.


Alex

 This means - there is a theoretical period of time when one CPU is writing to 
 the TLBs while other CPU’s are executing.  Our belief is that this has to be 
 handled by software anyway, and this should not be an issue from Qemu’s point 
 of view. 
 The alternative would be to force all other CPU’s to exit before writing the 
 TLB’s - this is both expensive and very painful to organise (as we get into 
 horrid deadlocks whichever way we turn)…
 
 We’d appreciate some thoughts on this...
 
 Cheers
 
 Mark.
 
 
 
+44 (0)20 7100 3485 x 210
  +33 (0)5 33 52 01 77x 210
 
   +33 (0)603762104
   mark.burton
 
 




[Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

TLB Flush:

We have spent a few days on this issue, and still haven’t resolved the best 
path.

Our solution seems to work, most of the time, but we still have some strange 
issues - so I want to check that what we are proposing has a chance of working.


Our plan is to allow all CPU’s to continue. Potentially one CPU will want to 
write to the TLBs. Subsequent to the write, it requests a TLB Flush. We are 
proposing to implement this by signalling all other CPU’s to exit (and 
requesting they flush before re-starting). In other words, this would happen 
asynchronously.

This means - there is a theoretical period of time when one CPU is writing to 
the TLBs while other CPU’s are executing.  Our belief is that this has to be 
handled by software anyway, and this should not be an issue from Qemu’s point 
of view. 
The alternative would be to force all other CPU’s to exit before writing the 
TLB’s - this is both expensive and very painful to organise (as we get into 
horrid deadlocks whichever way we turn)…

We’d appreciate some thoughts on this...

Cheers

Mark.



 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton
 applewebdata://3693B246-CDAA-4901-A9EC-AD07F4E94137/www.greensocs.com


Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Peter Maydell
On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:

 On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote:
 We are proposing to implement this by signalling all other CPU’s
 to exit (and requesting they flush before re-starting). In other
 words, this would happen asynchronously.

 For global flushes, give them a pointer payload along with the flush
 request and tell all cpus to increment it atomically. In your main
 thread, wait until *ptr == nKickedCpus.

I bet this will not be the only situation where you want to
do an get all other CPUs to do $something and wait til they
have done so kind of operation, so some lightweight but generic
infrastructure for doing that would not be a bad plan. (Similarly
get all other CPUs to stop, then I can do $something and let
the others continue.)

-- PMM



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton
OK - Alex - your implication is that it has to be atomic, we need the sync…

:-(

I have a horrid feeling that the atomicity of global flush can’t be causing the 
(almost, but not quite reproducible) errors we’re seeing - but… anyway ;-)

Cheers

Mark.

 On 12 Feb 2015, at 15:45, Alexander Graf ag...@suse.de wrote:
 
 
 On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote:
 
 
 TLB Flush:
 
 We have spent a few days on this issue, and still haven’t resolved the best 
 path.
 
 Our solution seems to work, most of the time, but we still have some strange 
 issues - so I want to check that what we are proposing has a chance of 
 working.
 
 
 Our plan is to allow all CPU’s to continue. Potentially one CPU will want to 
 write to the TLBs. Subsequent to the write, it requests a TLB Flush.
 
 Local or global? For local TLB flushes you don't notify the other CPUs at 
 all. For global ones, the semantics of the call usually dictate atomicity.
 
 We are proposing to implement this by signalling all other CPU’s to exit 
 (and requesting they flush before re-starting). In other words, this would 
 happen asynchronously.
 
 For global flushes, give them a pointer payload along with the flush request 
 and tell all cpus to increment it atomically. In your main thread, wait until 
 *ptr == nKickedCpus.
 
 FWIW TLBs are always CPU local. When there's a global TLB flush 
 instruction, it pretty much does stall the CPU, notifies the others to also 
 flush their TLBs, waits and then continues.
 
 If this really does become a performance bottleneck (which I doubt it does, 
 almost nobody except x86 does global flushes), you can also do some nasty 
 hacky tricks, such as (atomically) change the valid bit in remote CPUs TLB 
 entries. But really only do this as a last resort if the clean version 
 doesn't perform well.
 
 
 Alex
 
 This means - there is a theoretical period of time when one CPU is writing 
 to the TLBs while other CPU’s are executing.  Our belief is that this has to 
 be handled by software anyway, and this should not be an issue from Qemu’s 
 point of view. 
 The alternative would be to force all other CPU’s to exit before writing the 
 TLB’s - this is both expensive and very painful to organise (as we get into 
 horrid deadlocks whichever way we turn)…
 
 We’d appreciate some thoughts on this...
 
 Cheers
 
 Mark.
 
 
 
   +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210
 
  +33 (0)603762104
  mark.burton
 
 
 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Alexander Graf


On 12.02.15 16:08, Mark Burton wrote:
 
 On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote:

 On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:

 On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote:
 We are proposing to implement this by signalling all other CPU’s
 to exit (and requesting they flush before re-starting). In other
 words, this would happen asynchronously.

 For global flushes, give them a pointer payload along with the flush
 request and tell all cpus to increment it atomically. In your main
 thread, wait until *ptr == nKickedCpus.

 I bet this will not be the only situation where you want to
 do an get all other CPUs to do $something and wait til they
 have done so kind of operation, so some lightweight but generic
 infrastructure for doing that would not be a bad plan. (Similarly
 get all other CPUs to stop, then I can do $something and let
 the others continue”.)
 
 We tried this - we ended up in knots.
 We had 2 CPU’s trying to flush at about the same time, both waiting for the 
 other.
 We had CPU’s trying to get the global mutex to finish what they were doing, 
 while being told to flush, 
 We had CPU’s in the global mutex trying to do something that would cause a 
 flush… etc
 We had spaghetti with extra Bolognese sauce…
 
 We eventually concluded, yes - in an infinite universe everything is 
 possible, but if we could simply do this ‘asynchronously’ then our lives 
 would be a LOT easier.
 e.g.  - ask all CPU’s to “exit and do something” is easy -  wait for them to 
 do that is a whole other problem…
 
 Our question is - do we need this ‘sync’ (before the flush), or can we 
 actually allow CPU’s to flush themselves asynchronously….

The respective target architecture specs will tell you. And I very much
doubt that it is ok in most cases.


Alex



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Dr. David Alan Gilbert
* Mark Burton (mark.bur...@greensocs.com) wrote:
 
  On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote:
  
  On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
  
  On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote:
  We are proposing to implement this by signalling all other CPU???s
  to exit (and requesting they flush before re-starting). In other
  words, this would happen asynchronously.
  
  For global flushes, give them a pointer payload along with the flush
  request and tell all cpus to increment it atomically. In your main
  thread, wait until *ptr == nKickedCpus.
  
  I bet this will not be the only situation where you want to
  do an get all other CPUs to do $something and wait til they
  have done so kind of operation, so some lightweight but generic
  infrastructure for doing that would not be a bad plan. (Similarly
  get all other CPUs to stop, then I can do $something and let
  the others continue???.)
 
 We tried this - we ended up in knots.
 We had 2 CPU???s trying to flush at about the same time, both waiting for the 
 other.
 We had CPU???s trying to get the global mutex to finish what they were doing, 
 while being told to flush, 
 We had CPU???s in the global mutex trying to do something that would cause a 
 flush??? etc
 We had spaghetti with extra Bolognese sauce???

This is the hard problem of multithreaded emulation.
You've always got to let CPUs get back to a point where you can
invalidate a mapping/page quickly.

Thus you've also got to be very careful about where any CPU might
get into a loop or take another lock that would stop another CPU
causing an invalidate.  Either that or you need a way of somehow
breaking locks or recovering from the situation.

 We eventually concluded, yes - in an infinite universe everything is 
 possible, but if we could simply do this ???asynchronously??? then our lives 
 would be a LOT easier.
 e.g.  - ask all CPU???s to ???exit and do something??? is easy -  wait for 
 them to do that is a whole other problem???

Which is why you've got to bound how long it might take
those CPUs to get back to you, and optimise out cases where
it's not really needed later.

 Our question is - do we need this ???sync??? (before the flush), or can we 
 actually allow CPU???s to flush themselves asynchronously???.

Always assume the worst.

Dave

 
 Cheers
 
 Mark.
 
 
 
  
  -- PMM
 
 
+44 (0)20 7100 3485 x 210
  +33 (0)5 33 52 01 77x 210
 
   +33 (0)603762104
   mark.burton
 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Peter Maydell
On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
 almost nobody except x86 does global flushes

All ARM TLB maintenance operations have both this CPU only
and all TLBs in the Inner Shareable domain [that's ARM-speak
for every CPU core in the cluster] variants (the latter
being the TLB *IS operations). Looking at Linux's
arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
most of the operations defined there use the IS variants.

-- PMM



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

 On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote:
 
 On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
 
 On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote:
 We are proposing to implement this by signalling all other CPU’s
 to exit (and requesting they flush before re-starting). In other
 words, this would happen asynchronously.
 
 For global flushes, give them a pointer payload along with the flush
 request and tell all cpus to increment it atomically. In your main
 thread, wait until *ptr == nKickedCpus.
 
 I bet this will not be the only situation where you want to
 do an get all other CPUs to do $something and wait til they
 have done so kind of operation, so some lightweight but generic
 infrastructure for doing that would not be a bad plan. (Similarly
 get all other CPUs to stop, then I can do $something and let
 the others continue”.)

We tried this - we ended up in knots.
We had 2 CPU’s trying to flush at about the same time, both waiting for the 
other.
We had CPU’s trying to get the global mutex to finish what they were doing, 
while being told to flush, 
We had CPU’s in the global mutex trying to do something that would cause a 
flush… etc
We had spaghetti with extra Bolognese sauce…

We eventually concluded, yes - in an infinite universe everything is possible, 
but if we could simply do this ‘asynchronously’ then our lives would be a LOT 
easier.
e.g.  - ask all CPU’s to “exit and do something” is easy -  wait for them to do 
that is a whole other problem…

Our question is - do we need this ‘sync’ (before the flush), or can we actually 
allow CPU’s to flush themselves asynchronously….

Cheers

Mark.



 
 -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Alexander Graf


On 12.02.15 15:58, Peter Maydell wrote:
 On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
 almost nobody except x86 does global flushes
 
 All ARM TLB maintenance operations have both this CPU only
 and all TLBs in the Inner Shareable domain [that's ARM-speak
 for every CPU core in the cluster] variants (the latter
 being the TLB *IS operations). Looking at Linux's
 arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
 most of the operations defined there use the IS variants.

Wow, did anyone benchmark this? I know that PPC switched away from
global flushes and instead tracks the CPUs a task was running on to
limit the scope of CPUs that need to flush.


Alex



Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

 On 12 Feb 2015, at 16:38, Alexander Graf ag...@suse.de wrote:
 
 
 
 On 12.02.15 15:58, Peter Maydell wrote:
 On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote:
 almost nobody except x86 does global flushes
 
 All ARM TLB maintenance operations have both this CPU only
 and all TLBs in the Inner Shareable domain [that's ARM-speak
 for every CPU core in the cluster] variants (the latter
 being the TLB *IS operations). Looking at Linux's
 arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
 most of the operations defined there use the IS variants.
 
 Wow, did anyone benchmark this? I know that PPC switched away from
 global flushes and instead tracks the CPUs a task was running on to
 limit the scope of CPUs that need to flush.

Doesn’t that mean you have to signal a specific CPU to cause it to flush 
itself…. Isn’t that in itself expensive? Do you have to organise some sort of 
atomicity yourself around that too?

Cheers

Mark.



 
 
 Alex


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton