Re: [Qemu-devel] Help on TLB Flush
On 12/02/2015 22:57, Peter Maydell wrote: The only requirement is that if the CPU that did the TLB maintenance op executes a DMB (barrier) then the TLB op must finish before the barrier completes execution. So you could split the kick off TLB invalidate and make sure all CPUs are done phases if you wanted. [cf v8 ARM ARM rev A.e section D4.7.2 and in particular the subsection on ordering and completion.] You can just make DMB start a new translation block. Then when the TLB flush helpers call cpu_exit() or cpu_interrupt() the flush request is serviced. Paolo
Re: [Qemu-devel] Help on TLB Flush
On 13/02/2015 10:37, Mark Burton wrote: the memory barrier is on the cpu requesting the flush isn’t it (not on the CPU that is being flushed)? Oops, I misread Peter's explanation. In that case, perhaps DMB can be treated in a similar way as WFI, using cpu-halted. Queueing work on other CPUs can be done with async_run_on_cpu, which exits the idle loop in qemu_tcg_wait_io_event (this avoids the deadlocks). Checking that other CPUs have flushed the TLBs can be done in cpu_has_work (always return false if cpu-halted == true there are outstanding TLB requests). Paolo
Re: [Qemu-devel] Help on TLB Flush
the memory barrier is on the cpu requesting the flush isn’t it (not on the CPU that is being flushed)? Cheers Mark. On 13 Feb 2015, at 10:34, Paolo Bonzini pbonz...@redhat.com wrote: On 12/02/2015 22:57, Peter Maydell wrote: The only requirement is that if the CPU that did the TLB maintenance op executes a DMB (barrier) then the TLB op must finish before the barrier completes execution. So you could split the kick off TLB invalidate and make sure all CPUs are done phases if you wanted. [cf v8 ARM ARM rev A.e section D4.7.2 and in particular the subsection on ordering and completion.] You can just make DMB start a new translation block. Then when the TLB flush helpers call cpu_exit() or cpu_interrupt() the flush request is serviced. Paolo +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
Re: [Qemu-devel] Help on TLB Flush
Mark Burton writes: On 13 Feb 2015, at 08:24, Peter Maydell peter.mayd...@linaro.org wrote: On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote: If the kernel is doing this - then effectively - for X86, each CPU only flush’s it’s own TLB (from the perspective of Qemu) - correct? (in which case, for Qemu itself - for x86) - we dont need to implement a global flush, and hence we dont need to build the mechanism to sync ? The semantics you need are flush the QEMU TLB for CPU X (where X may not be the CPU you're running on). This is what tlb_flush() does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.) We then use that to implement the target's required semantics (eg in ARM the tlbiall_is_write() function is handled by iterating through all CPUs and calling tlb_flush on them). What Lluis implied seemed to be that the kernel arranged to signal the CPU that would flush. Hence, (for X86), we would only ever flush our own TLB. That's correct. [...] For our immediate concern, in the interests of getting the thing working and making sure we’ve turned over all the stones, on ARM - it MAY help us to check that the flush has happened ‘in the next memory barrier’…. - I dont know if that will help us or not, and - even if it does, I agree with you, it would be more messy than it need be. However, in the interests of making sure that there are no other issues - we may ‘hack’ something before we put in place a more elegant solution…. (right now, we have some mutex issues, shifting the sync to the barrier MAY help us avoid that…. To Be Seen…. and anyway - it would only be a temporary fix). But you shouldn't assume that everyone either uses x86's semantics (aka, each CPU gets an IPI), or the ARM semantics you described where the global TLB flush instruction has asynchronous effects. First, in ARM you still have to ensure other CPUs did what you asked them to (whenever the arch manual says you must do so). Second, it seems like ARM does not always behave in the way you described: http://lxr.free-electrons.com/source/arch/arm/kernel/smp.c?v=2.6.32#L630 Granted, this is just the same behaviour as x86, but noone guarantees you that some other operation in any of the multiple architectures supported by QEMU will never need a synchronous instruction with global effects. I understand the pressure of getting something running and work from that, but I think that having a framework for asynchronous cross-CPU messaging would be rather useful in the future. That can be then complemented with a mechanism to wait for these asynchronous messages. You can achieve any desired behaviour by composing these two. Cheers, Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth
Re: [Qemu-devel] Help on TLB Flush
Agreed Cheers Mark. On 13 Feb 2015, at 14:30, Lluís Vilanova vilan...@ac.upc.edu wrote: Mark Burton writes: On 13 Feb 2015, at 08:24, Peter Maydell peter.mayd...@linaro.org wrote: On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote: If the kernel is doing this - then effectively - for X86, each CPU only flush’s it’s own TLB (from the perspective of Qemu) - correct? (in which case, for Qemu itself - for x86) - we dont need to implement a global flush, and hence we dont need to build the mechanism to sync ? The semantics you need are flush the QEMU TLB for CPU X (where X may not be the CPU you're running on). This is what tlb_flush() does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.) We then use that to implement the target's required semantics (eg in ARM the tlbiall_is_write() function is handled by iterating through all CPUs and calling tlb_flush on them). What Lluis implied seemed to be that the kernel arranged to signal the CPU that would flush. Hence, (for X86), we would only ever flush our own TLB. That's correct. [...] For our immediate concern, in the interests of getting the thing working and making sure we’ve turned over all the stones, on ARM - it MAY help us to check that the flush has happened ‘in the next memory barrier’…. - I dont know if that will help us or not, and - even if it does, I agree with you, it would be more messy than it need be. However, in the interests of making sure that there are no other issues - we may ‘hack’ something before we put in place a more elegant solution…. (right now, we have some mutex issues, shifting the sync to the barrier MAY help us avoid that…. To Be Seen…. and anyway - it would only be a temporary fix). But you shouldn't assume that everyone either uses x86's semantics (aka, each CPU gets an IPI), or the ARM semantics you described where the global TLB flush instruction has asynchronous effects. First, in ARM you still have to ensure other CPUs did what you asked them to (whenever the arch manual says you must do so). Second, it seems like ARM does not always behave in the way you described: http://lxr.free-electrons.com/source/arch/arm/kernel/smp.c?v=2.6.32#L630 Granted, this is just the same behaviour as x86, but noone guarantees you that some other operation in any of the multiple architectures supported by QEMU will never need a synchronous instruction with global effects. I understand the pressure of getting something running and work from that, but I think that having a framework for asynchronous cross-CPU messaging would be rather useful in the future. That can be then complemented with a mechanism to wait for these asynchronous messages. You can achieve any desired behaviour by composing these two. Cheers, Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
Re: [Qemu-devel] Help on TLB Flush
On 12 Feb 2015, at 16:31, Dr. David Alan Gilbert dgilb...@redhat.com wrote: * Mark Burton (mark.bur...@greensocs.com) wrote: On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote: We are proposing to implement this by signalling all other CPU???s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus. I bet this will not be the only situation where you want to do an get all other CPUs to do $something and wait til they have done so kind of operation, so some lightweight but generic infrastructure for doing that would not be a bad plan. (Similarly get all other CPUs to stop, then I can do $something and let the others continue???.) We tried this - we ended up in knots. We had 2 CPU???s trying to flush at about the same time, both waiting for the other. We had CPU???s trying to get the global mutex to finish what they were doing, while being told to flush, We had CPU???s in the global mutex trying to do something that would cause a flush??? etc We had spaghetti with extra Bolognese sauce??? This is the hard problem of multithreaded emulation. You've always got to let CPUs get back to a point where you can invalidate a mapping/page quickly. Thus you've also got to be very careful about where any CPU might get into a loop or take another lock that would stop another CPU causing an invalidate. Either that or you need a way of somehow breaking locks or recovering from the situation. Indeed - for now - we’re building something which will likely be less than ideal. Once we have some sort of evidence that it works, and (hopefully) more reliably than the approach we have right now, then we come up with a more elegant scheme. We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ???asynchronously??? then our lives would be a LOT easier. e.g. - ask all CPU???s to ???exit and do something??? is easy - wait for them to do that is a whole other problem??? Which is why you've got to bound how long it might take those CPUs to get back to you, and optimise out cases where it's not really needed later. Our question is - do we need this ???sync??? (before the flush), or can we actually allow CPU???s to flush themselves asynchronously???. Always assume the worst. :-) Cheers Mark. Dave Cheers Mark. -- PMM +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
Re: [Qemu-devel] Help on TLB Flush
On 12 February 2015 at 15:19, Alexander Graf ag...@suse.de wrote: On 12.02.15 16:08, Mark Burton wrote: Our question is - do we need this ‘sync’ (before the flush), or can we actually allow CPU’s to flush themselves asynchronously…. The respective target architecture specs will tell you. And I very much doubt that it is ok in most cases. For ARM note that TLB maintenance operations do not have to complete synchronously. They can be reordered relative to other TLB maintenance ops or to loads or stores (by this CPU or by other CPUs if this is a global invalidate). The only requirement is that if the CPU that did the TLB maintenance op executes a DMB (barrier) then the TLB op must finish before the barrier completes execution. So you could split the kick off TLB invalidate and make sure all CPUs are done phases if you wanted. [cf v8 ARM ARM rev A.e section D4.7.2 and in particular the subsection on ordering and completion.] This only applies to ARM guests, of course. (Other CPU architectures are available. :-)) -- PMM
Re: [Qemu-devel] Help on TLB Flush
Mark Burton writes: On 12 Feb 2015, at 16:38, Alexander Graf ag...@suse.de wrote: On 12.02.15 15:58, Peter Maydell wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: almost nobody except x86 does global flushes All ARM TLB maintenance operations have both this CPU only and all TLBs in the Inner Shareable domain [that's ARM-speak for every CPU core in the cluster] variants (the latter being the TLB *IS operations). Looking at Linux's arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h most of the operations defined there use the IS variants. Wow, did anyone benchmark this? I know that PPC switched away from global flushes and instead tracks the CPUs a task was running on to limit the scope of CPUs that need to flush. Doesn’t that mean you have to signal a specific CPU to cause it to flush itself…. Isn’t that in itself expensive? Do you have to organise some sort of atomicity yourself around that too? Yup. AFAIR, Linux in x86-64 queues a request to a per-CPU request list, and uses IPIs to signal these types of operations to the target CPU: http://lxr.free-electrons.com/source/kernel/smp.c?v=2.6.32#L386 Waiting for completion is implemented on top by incrementing some counter from each CPU, and waiting for it to have the correct final value. If something were implemented on these lines, it could be used as a generic cross-CPU event messaging infrastructure (plus some interrupt bit in the CPU structure that TCG would check to break away from guest code; I believe something similar is already being used - icount? -). PS: To be honest, I still don't know which TLBs we're talking about here, and which cases trigger these TLB flush operations. Cheers, Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth
Re: [Qemu-devel] Help on TLB Flush
On 12 February 2015 at 15:38, Alexander Graf ag...@suse.de wrote: On 12.02.15 15:58, Peter Maydell wrote: All ARM TLB maintenance operations have both this CPU only and all TLBs in the Inner Shareable domain [that's ARM-speak for every CPU core in the cluster] variants (the latter being the TLB *IS operations). Looking at Linux's arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h most of the operations defined there use the IS variants. Wow, did anyone benchmark this? I know that PPC switched away from global flushes and instead tracks the CPUs a task was running on to limit the scope of CPUs that need to flush. That would be a valid implementation. The CPU has to behave as the spec says it must, but there's no reason you couldn't implement flush by ASID for all TLBs via some implementation specific tracking of ASID use per CPU to limit which cores you sent the flush request to, if you thought that was a better way to do it. -- PMM
Re: [Qemu-devel] Help on TLB Flush
Up top - thanks Peter, I think you may give us an idea ! On 12 Feb 2015, at 23:10, Lluís Vilanova vilan...@ac.upc.edu wrote: Mark Burton writes: On 12 Feb 2015, at 16:38, Alexander Graf ag...@suse.de wrote: On 12.02.15 15:58, Peter Maydell wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: almost nobody except x86 does global flushes All ARM TLB maintenance operations have both this CPU only and all TLBs in the Inner Shareable domain [that's ARM-speak for every CPU core in the cluster] variants (the latter being the TLB *IS operations). Looking at Linux's arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h most of the operations defined there use the IS variants. Wow, did anyone benchmark this? I know that PPC switched away from global flushes and instead tracks the CPUs a task was running on to limit the scope of CPUs that need to flush. Doesn’t that mean you have to signal a specific CPU to cause it to flush itself…. Isn’t that in itself expensive? Do you have to organise some sort of atomicity yourself around that too? Yup. AFAIR, Linux in x86-64 queues a request to a per-CPU request list, and uses IPIs to signal these types of operations to the target CPU: http://lxr.free-electrons.com/source/kernel/smp.c?v=2.6.32#L386 Waiting for completion is implemented on top by incrementing some counter from each CPU, and waiting for it to have the correct final value. If the kernel is doing this - then effectively - for X86, each CPU only flush’s it’s own TLB (from the perspective of Qemu) - correct? (in which case, for Qemu itself - for x86) - we dont need to implement a global flush, and hence we dont need to build the mechanism to sync ? If I understand correctly then - the processor that causes some pain is the ARM that has (and uses) global flush, but the mitigating factors is that those flushes can by asyncronous so long as they complete before a memory barrier…. Cheers Mark. If something were implemented on these lines, it could be used as a generic cross-CPU event messaging infrastructure (plus some interrupt bit in the CPU structure that TCG would check to break away from guest code; I believe something similar is already being used - icount? -). PS: To be honest, I still don't know which TLBs we're talking about here, and which cases trigger these TLB flush operations. Cheers, Lluis -- And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer. -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
Re: [Qemu-devel] Help on TLB Flush
On 13 Feb 2015, at 08:24, Peter Maydell peter.mayd...@linaro.org wrote: On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote: If the kernel is doing this - then effectively - for X86, each CPU only flush’s it’s own TLB (from the perspective of Qemu) - correct? (in which case, for Qemu itself - for x86) - we dont need to implement a global flush, and hence we dont need to build the mechanism to sync ? The semantics you need are flush the QEMU TLB for CPU X (where X may not be the CPU you're running on). This is what tlb_flush() does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.) We then use that to implement the target's required semantics (eg in ARM the tlbiall_is_write() function is handled by iterating through all CPUs and calling tlb_flush on them). What Lluis implied seemed to be that the kernel arranged to signal the CPU that would flush. Hence, (for X86), we would only ever flush our own TLB. If you don't want the pain of checking the semantics of every backend and figuring out a new set of primitives to implement, then what you need to do is continue to provide the guarantees the current tlb_flush function does: when it returns then the CPU it's supposed to have acted on has definitely done so. You can try and be cleverer if you want to, but personally I would recommend keeping the scope of your work simple where you can. yes - though keeping it simple (silly) seems to have some complexities in this case, which is why we are trying to reduce the guarantees that tlm_flush() provides. At present - the ‘foreach cpu, tlb_flush()’ is effectively atomic, as no other CPU will be executing at the same time. Adding multi-thread, we can already say - this ‘atomicity’ isn’t strictly required. As you say, the only thing tlb_flush needs to guarantee is that the CPU concerned has flushed. - that already helps. And I agree with you is the right place to take tlb_flush(). Of course, when only the current CPU is flushed things are much simpler (and already handled)... For our immediate concern, in the interests of getting the thing working and making sure we’ve turned over all the stones, on ARM - it MAY help us to check that the flush has happened ‘in the next memory barrier’…. - I dont know if that will help us or not, and - even if it does, I agree with you, it would be more messy than it need be. However, in the interests of making sure that there are no other issues - we may ‘hack’ something before we put in place a more elegant solution…. (right now, we have some mutex issues, shifting the sync to the barrier MAY help us avoid that…. To Be Seen…. and anyway - it would only be a temporary fix). Cheers Mark. -- PMM +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
Re: [Qemu-devel] Help on TLB Flush
On 13 February 2015 at 07:16, Mark Burton mark.bur...@greensocs.com wrote: If the kernel is doing this - then effectively - for X86, each CPU only flush’s it’s own TLB (from the perspective of Qemu) - correct? (in which case, for Qemu itself - for x86) - we dont need to implement a global flush, and hence we dont need to build the mechanism to sync ? The semantics you need are flush the QEMU TLB for CPU X (where X may not be the CPU you're running on). This is what tlb_flush() does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.) We then use that to implement the target's required semantics (eg in ARM the tlbiall_is_write() function is handled by iterating through all CPUs and calling tlb_flush on them). If you don't want the pain of checking the semantics of every backend and figuring out a new set of primitives to implement, then what you need to do is continue to provide the guarantees the current tlb_flush function does: when it returns then the CPU it's supposed to have acted on has definitely done so. You can try and be cleverer if you want to, but personally I would recommend keeping the scope of your work simple where you can. -- PMM
Re: [Qemu-devel] Help on TLB Flush
On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote: TLB Flush: We have spent a few days on this issue, and still haven’t resolved the best path. Our solution seems to work, most of the time, but we still have some strange issues - so I want to check that what we are proposing has a chance of working. Our plan is to allow all CPU’s to continue. Potentially one CPU will want to write to the TLBs. Subsequent to the write, it requests a TLB Flush. Local or global? For local TLB flushes you don't notify the other CPUs at all. For global ones, the semantics of the call usually dictate atomicity. We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus. FWIW TLBs are always CPU local. When there's a global TLB flush instruction, it pretty much does stall the CPU, notifies the others to also flush their TLBs, waits and then continues. If this really does become a performance bottleneck (which I doubt it does, almost nobody except x86 does global flushes), you can also do some nasty hacky tricks, such as (atomically) change the valid bit in remote CPUs TLB entries. But really only do this as a last resort if the clean version doesn't perform well. Alex This means - there is a theoretical period of time when one CPU is writing to the TLBs while other CPU’s are executing. Our belief is that this has to be handled by software anyway, and this should not be an issue from Qemu’s point of view. The alternative would be to force all other CPU’s to exit before writing the TLB’s - this is both expensive and very painful to organise (as we get into horrid deadlocks whichever way we turn)… We’d appreciate some thoughts on this... Cheers Mark. +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
[Qemu-devel] Help on TLB Flush
TLB Flush: We have spent a few days on this issue, and still haven’t resolved the best path. Our solution seems to work, most of the time, but we still have some strange issues - so I want to check that what we are proposing has a chance of working. Our plan is to allow all CPU’s to continue. Potentially one CPU will want to write to the TLBs. Subsequent to the write, it requests a TLB Flush. We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. This means - there is a theoretical period of time when one CPU is writing to the TLBs while other CPU’s are executing. Our belief is that this has to be handled by software anyway, and this should not be an issue from Qemu’s point of view. The alternative would be to force all other CPU’s to exit before writing the TLB’s - this is both expensive and very painful to organise (as we get into horrid deadlocks whichever way we turn)… We’d appreciate some thoughts on this... Cheers Mark. +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton applewebdata://3693B246-CDAA-4901-A9EC-AD07F4E94137/www.greensocs.com
Re: [Qemu-devel] Help on TLB Flush
On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote: We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus. I bet this will not be the only situation where you want to do an get all other CPUs to do $something and wait til they have done so kind of operation, so some lightweight but generic infrastructure for doing that would not be a bad plan. (Similarly get all other CPUs to stop, then I can do $something and let the others continue.) -- PMM
Re: [Qemu-devel] Help on TLB Flush
OK - Alex - your implication is that it has to be atomic, we need the sync… :-( I have a horrid feeling that the atomicity of global flush can’t be causing the (almost, but not quite reproducible) errors we’re seeing - but… anyway ;-) Cheers Mark. On 12 Feb 2015, at 15:45, Alexander Graf ag...@suse.de wrote: On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote: TLB Flush: We have spent a few days on this issue, and still haven’t resolved the best path. Our solution seems to work, most of the time, but we still have some strange issues - so I want to check that what we are proposing has a chance of working. Our plan is to allow all CPU’s to continue. Potentially one CPU will want to write to the TLBs. Subsequent to the write, it requests a TLB Flush. Local or global? For local TLB flushes you don't notify the other CPUs at all. For global ones, the semantics of the call usually dictate atomicity. We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus. FWIW TLBs are always CPU local. When there's a global TLB flush instruction, it pretty much does stall the CPU, notifies the others to also flush their TLBs, waits and then continues. If this really does become a performance bottleneck (which I doubt it does, almost nobody except x86 does global flushes), you can also do some nasty hacky tricks, such as (atomically) change the valid bit in remote CPUs TLB entries. But really only do this as a last resort if the clean version doesn't perform well. Alex This means - there is a theoretical period of time when one CPU is writing to the TLBs while other CPU’s are executing. Our belief is that this has to be handled by software anyway, and this should not be an issue from Qemu’s point of view. The alternative would be to force all other CPU’s to exit before writing the TLB’s - this is both expensive and very painful to organise (as we get into horrid deadlocks whichever way we turn)… We’d appreciate some thoughts on this... Cheers Mark. +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
Re: [Qemu-devel] Help on TLB Flush
On 12.02.15 16:08, Mark Burton wrote: On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote: We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus. I bet this will not be the only situation where you want to do an get all other CPUs to do $something and wait til they have done so kind of operation, so some lightweight but generic infrastructure for doing that would not be a bad plan. (Similarly get all other CPUs to stop, then I can do $something and let the others continue”.) We tried this - we ended up in knots. We had 2 CPU’s trying to flush at about the same time, both waiting for the other. We had CPU’s trying to get the global mutex to finish what they were doing, while being told to flush, We had CPU’s in the global mutex trying to do something that would cause a flush… etc We had spaghetti with extra Bolognese sauce… We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ‘asynchronously’ then our lives would be a LOT easier. e.g. - ask all CPU’s to “exit and do something” is easy - wait for them to do that is a whole other problem… Our question is - do we need this ‘sync’ (before the flush), or can we actually allow CPU’s to flush themselves asynchronously…. The respective target architecture specs will tell you. And I very much doubt that it is ok in most cases. Alex
Re: [Qemu-devel] Help on TLB Flush
* Mark Burton (mark.bur...@greensocs.com) wrote: On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote: We are proposing to implement this by signalling all other CPU???s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus. I bet this will not be the only situation where you want to do an get all other CPUs to do $something and wait til they have done so kind of operation, so some lightweight but generic infrastructure for doing that would not be a bad plan. (Similarly get all other CPUs to stop, then I can do $something and let the others continue???.) We tried this - we ended up in knots. We had 2 CPU???s trying to flush at about the same time, both waiting for the other. We had CPU???s trying to get the global mutex to finish what they were doing, while being told to flush, We had CPU???s in the global mutex trying to do something that would cause a flush??? etc We had spaghetti with extra Bolognese sauce??? This is the hard problem of multithreaded emulation. You've always got to let CPUs get back to a point where you can invalidate a mapping/page quickly. Thus you've also got to be very careful about where any CPU might get into a loop or take another lock that would stop another CPU causing an invalidate. Either that or you need a way of somehow breaking locks or recovering from the situation. We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ???asynchronously??? then our lives would be a LOT easier. e.g. - ask all CPU???s to ???exit and do something??? is easy - wait for them to do that is a whole other problem??? Which is why you've got to bound how long it might take those CPUs to get back to you, and optimise out cases where it's not really needed later. Our question is - do we need this ???sync??? (before the flush), or can we actually allow CPU???s to flush themselves asynchronously???. Always assume the worst. Dave Cheers Mark. -- PMM +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] Help on TLB Flush
On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: almost nobody except x86 does global flushes All ARM TLB maintenance operations have both this CPU only and all TLBs in the Inner Shareable domain [that's ARM-speak for every CPU core in the cluster] variants (the latter being the TLB *IS operations). Looking at Linux's arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h most of the operations defined there use the IS variants. -- PMM
Re: [Qemu-devel] Help on TLB Flush
On 12 Feb 2015, at 16:01, Peter Maydell peter.mayd...@linaro.org wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: On 12.02.2015, at 15:35, Mark Burton mark.bur...@greensocs.com wrote: We are proposing to implement this by signalling all other CPU’s to exit (and requesting they flush before re-starting). In other words, this would happen asynchronously. For global flushes, give them a pointer payload along with the flush request and tell all cpus to increment it atomically. In your main thread, wait until *ptr == nKickedCpus. I bet this will not be the only situation where you want to do an get all other CPUs to do $something and wait til they have done so kind of operation, so some lightweight but generic infrastructure for doing that would not be a bad plan. (Similarly get all other CPUs to stop, then I can do $something and let the others continue”.) We tried this - we ended up in knots. We had 2 CPU’s trying to flush at about the same time, both waiting for the other. We had CPU’s trying to get the global mutex to finish what they were doing, while being told to flush, We had CPU’s in the global mutex trying to do something that would cause a flush… etc We had spaghetti with extra Bolognese sauce… We eventually concluded, yes - in an infinite universe everything is possible, but if we could simply do this ‘asynchronously’ then our lives would be a LOT easier. e.g. - ask all CPU’s to “exit and do something” is easy - wait for them to do that is a whole other problem… Our question is - do we need this ‘sync’ (before the flush), or can we actually allow CPU’s to flush themselves asynchronously…. Cheers Mark. -- PMM +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton
Re: [Qemu-devel] Help on TLB Flush
On 12.02.15 15:58, Peter Maydell wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: almost nobody except x86 does global flushes All ARM TLB maintenance operations have both this CPU only and all TLBs in the Inner Shareable domain [that's ARM-speak for every CPU core in the cluster] variants (the latter being the TLB *IS operations). Looking at Linux's arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h most of the operations defined there use the IS variants. Wow, did anyone benchmark this? I know that PPC switched away from global flushes and instead tracks the CPUs a task was running on to limit the scope of CPUs that need to flush. Alex
Re: [Qemu-devel] Help on TLB Flush
On 12 Feb 2015, at 16:38, Alexander Graf ag...@suse.de wrote: On 12.02.15 15:58, Peter Maydell wrote: On 12 February 2015 at 14:45, Alexander Graf ag...@suse.de wrote: almost nobody except x86 does global flushes All ARM TLB maintenance operations have both this CPU only and all TLBs in the Inner Shareable domain [that's ARM-speak for every CPU core in the cluster] variants (the latter being the TLB *IS operations). Looking at Linux's arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h most of the operations defined there use the IS variants. Wow, did anyone benchmark this? I know that PPC switched away from global flushes and instead tracks the CPUs a task was running on to limit the scope of CPUs that need to flush. Doesn’t that mean you have to signal a specific CPU to cause it to flush itself…. Isn’t that in itself expensive? Do you have to organise some sort of atomicity yourself around that too? Cheers Mark. Alex +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton