On 18.12.14 10:12, Mark Burton wrote: > >> On 17 Dec 2014, at 17:39, Peter Maydell <peter.mayd...@linaro.org> wrote: >> >> On 17 December 2014 at 16:29, Mark Burton <mark.bur...@greensocs.com> wrote: >>>> On 17 Dec 2014, at 17:27, Peter Maydell <peter.mayd...@linaro.org> wrote: >>>> I think a mutex is fine, personally -- I just don't want >>>> to see fifteen hand-hacked mutexes in the target-* code. >>>> >>> >>> Which would seem to favour the helper function approach? >>> Or am I missing something? >> >> You need at least some support from QEMU core -- consider >> what happens with this patch if the ldrex takes a data >> abort, for instance. >> >> And if you need the "stop all other CPUs while I do this” > > It looks like a corner case, but working this through - the ’simple’ put a > mutex around the atomic instructions approach would indeed need to ensure > that no other core was doing anything - that just happens to be true for qemu > today (or - we would have to put a mutex around all writes); in order to > ensure the case where a store exclusive could potential fail if a non-atomic > instruction wrote (a different value) to the same address. This is currently > guarantee by the implementation in Qemu - how useful it is I dont know, but > if we break it, we run the risk that something will fail (at the least, we > could not claim to have kept things the same). > > This also has implications for the idea of adding TCG ops I think... > The ideal scenario is that we could ‘fallback’ on the same semantics that are > there today - allowing specific target/host combinations to be optimised (and > to improve their functionality). > But that means, from within the TCG Op, we would need to have a mechanism, to > cause other TCG’s to take an exit…. etc etc… In the end, I’m sure it’s > possible, but it feels so awkward.
That's the nice thing about transactions - they guarantee that no other CPU accesses the same cache line at the same time. So you're safe against other vcpus even without blocking them manually. For the non-transactional implementation we probably would need an "IPI others and halt them until we're done with the critical section" approach. But I really wouldn't concentrate on making things fast on old CPUs. Also keep in mind that for the UP case we can always omit all the magic - we only need to detect when we move into an SMP case (linux-user clone or -smp on system). > > To re-cap where we are (for my own benefit if nobody else): > We have several propositions in terms of implementing Atomic instructions > > 1/ We wrap the atomic instructions in a mutex using helper functions (this is > the approach others have taken, it’s simple, but it is not clean, as stated > above). This is horrible. Imagine you have this split approach with a load exclusive and then store whereas the load starts mutex usage and the store stop is. At that point if the store creates a segfault you'll be left with a dangling mutex. This stuff really belongs into the TCG core. > > 1.5/ We add a mechanism to ensure that when the mutex is taken, all other > cores are ‘stopped’. > > 2/ We add some TCG ops to effectively do the same thing, but this would give > us the benefit of being able to provide better implementations. This is > attractive, but we would end up needing ops to cover at least exclusive > load/store and atomic compare exchange. To me this looks less than elegant > (being pulled close to the target, rather than being able to generalise), but > it’s not clear how we would implement the operations as we would like, with a > machine instruction, unless we did split them out along these lines. This > approach also (probably) requires the 1.5 mechanism above. I'm still in favor of just forcing the semantics of transactions onto this. If the host doesn't implement transactions, tough luck - do the "halt all others" IPI. > > 3/ We have discussed a ‘h/w’ approach to the problem. In this case, all > atomic instructions are forced to take the slow path - and a additional flags > are added to the memory API. We then deal with the issue closer to the memory > where we can record who has a lock on a memory address. For this to work - we > would also either > a) need to add a mprotect type approach to ensure no ‘non atomic’ writes > occur - or > b) need to force all cores to mark the page with the exclusive memory as IO > or similar to ensure that all write accesses followed the slow path. > > 4/ There is an option to implement exclusive operations within the TCG using > mprotect (and signal handlers). I have some concerns on this : would we need > have to have support for each host O/S…. I also think we might end up the a > lot of protected regions causing a lot of SIGSEGV’s because an errant guest > doesn’t behave well - basically we will need to see the impact on performance > - finally - this will be really painful to deal with for cases where the > exclusive memory is held in what Qemu considers IO space !!! > In other words - putting the mprotect inside TCG looks to me like it’s > mutually exclusive to supporting a memory-based scheme like (3). Again, I don't think it's worth caring about legacy host systems too much. In a few years from now transactional memory will be commodity, just like KVM is today. Alex > My personal preference is for 3b) it is “safe” - its where the hardware is. > 3a is an optimization of that. > to me, (2) is an optimisation again. We are effectively saying, if you are > able to do this directly, then you dont need to pass via the slow path. > Otherwise, you always have the option of reverting to the slow path. > > Frankly - 1 and 1.5 are hacks - they are not optimisations, they are just > dirty hacks. However - their saving grace is that they are hacks that exist > and “work”. I dislike patching the hack, but it did seem to offer the fastest > solution to get around this problem - at least for now. I am no longer > convinced. > > 4/ is something I’d like other peoples views on too… Is it a better approach? > What about the slow path? > > I increasingly begin to feel that we should really approach this from the > other end, and provide a ‘correct’ solution using the memory - then worry > about making that faster… > > Cheers > > Mark. > > > > > > > > >> semantics linux-user currently uses then that definitely needs >> core code support. (Maybe linux-user is being over-zealous >> there; I haven't thought about it.) >> >> -- PMM > > > +44 (0)20 7100 3485 x 210 > +33 (0)5 33 52 01 77x 210 > > +33 (0)603762104 > mark.burton >