Re: [Qemu-devel] [RFC PATCH 0/6] Clock and power gating support

2018-07-30 Thread Mark Burton

Hi Pete, sorry for the tardy reply, Im not in the office.


Your right  we shoujd have co-ordinated better the 2 patches, sorry for that.

Regarding this new patchset, we think there is a degree of complementary to the
clocktree one.
Here we simply add a couple of generic power states to a device. We
introduce
an interface to control power and clock gating on any devices, with sensible
generic behaviour (reset on power on, disabling memory regions when
asleep or
off). The device can specialize this to adopt specific behaviours.

It allows easy implementation of commonly found "system controller", "pin
controllers", "power management unit", or similar blocks in SoCs.
Specialization of devices can be implemented as-needed when a machine
requires
it.


Cheers
Mark

On 27 July 2018 17:22:30 KONRAD Frederic  wrote:


Le 07/27/2018 à 04:59 PM, Peter Maydell a écrit :

On 27 July 2018 at 15:37, Damien Hedde  wrote:

This set of patches add support for power and clock gating in device objects.
It adds two booleans to the state, one for power state and one of clock state.

The state is controlled trough 2 functions, one for power and one for clock.
Two new methods *power_update* and *clock_update* is added to the device class
which are called on state change and can be overriden.


So, you folks at Greensocs had a series a couple of years ago to try
to add clocktree support (which included handling things like guests
configuring clock rates, some clocks being "downstream" of others, and
so on). It didn't make it into mainline, though it did get a fair amount
of design/code review. How does this approach fit into / compare with
that ?

thanks
-- PMM


Hi all,

I didn't have time to push a new version of the clock series and
I lost the track a little bit (was 13 months ago). Sorry for
that :(.. Would be still nice to have I think.

Cheers,
Fred







Re: [Qemu-devel] MTTCG next version?

2015-08-26 Thread Mark Burton
Just to remind everybody as well - we’ll have a call next Monday to co-ordinate.
It would be good to make sure everybody knows which bit of this everybody else 
is committing to do, so we avoid replication and treading on each others patch 
sets.

Cheers

Mark.

> On 26 Aug 2015, at 14:18, Frederic Konrad  wrote:
> 
> Hi everybody,
> 
> I'm trying to do the next version of the MTTCG work:
> 
> I would like to rebase on Alvise atomic instruction branch:
>  - Alvise can you rebase it on the 2.4.0 version without MTTCG support and 
> then
>point me to the MTTCG specific changes so I can include them in my tree?
> I will add Paolo's linux-user and signal free qemu_cpu_kick series as well.
> 
> About tb_flush we think to do that without exiting:
>  - Use two buffers for tbs.
>  - Use a per tb invalidated flag.
>  - when tb_flush just invalidate all tb from the buffer and swap to the 
> second buffer:
>VCPU which are executing code will discard their tb_jmp_cache when they 
> exit
>(eg: run_on_cpu).
> 
> We need also to fix emulated data barrier so tlb_flush are finished before the
> instruction is executed. (That might be only data barrier breaks the TB).
> 
> Protecting page->code_bitmap and cpu_breakpoint_insert changes will be 
> squashed in the tb_lock patch.
> 
> More tests must be done especially with gdbstub and icount.
> 
> Do that make sense?
> Fred


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton







Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)

2015-09-04 Thread Mark Burton

> On 4 Sep 2015, at 11:41, Edgar E. Iglesias  wrote:
> 
> On Fri, Sep 04, 2015 at 11:25:33AM +0200, Paolo Bonzini wrote:
>> 
>> 
>> On 04/09/2015 09:49, Alex Bennée wrote:
>>> * Signal free qemu_cpu_kick (Paolo)
>>> 
>>> I don't know much about this patch set but I assume this avoids the need
>>> to catch signals and longjmp about just to wake up?
>> 
>> It was part of Fred's patches, so I've extracted it to its own series.
>> Removing 150 lines of code can't hurt.
>> 
>>> * Memory barrier support (need RFC for discussion)
>>> 
>>> I came to KVM forum with a back of the envelope idea we could implement
>>> one or two barrier ops (acquire/release?). Various suggestions of other
>>> types of memory behaviour have been suggested.
>>> 
>>> I'll try to pull together an RFC patch with design outline for
>>> discussion. It would be nice to be able to demonstrate barrier failures
>>> in my test cases as well ;-)
>> 
>> Emilio has something about it in his own MTTCG implementation.
>> 
>>> * longjmp in cpu_exec
>>> 
>>> Paolo is fairly sure that if you take page faults while IRQs happening
>>> problems will occur with cpu->interrupt_request. Does it need to take
>>> the BQL?
>>> 
>>> I'd like to see if we can get a torture test to stress this code
>>> although it will require IPI support in the unit tests.
>> 
>> It's x86-specific (hardware interrupts push to the stack and can cause a
>> page fault or other exception), so a unit test can be written for it.
>> 
>>> * tlb_flush and dmb behaviour (am I waiting for TLB flush?)
>>> 
>>> I think this means we need explicit memory barriers to sync updates to
>>> the tlb.
>> 
>> Yes.
>> 
>>> * tb_find_fast outside the lock
>>> 
>>> Currently it is a big performance win as the tb_find_fast has a lot of
>>> contention with other threads. However there is concern it needs to be
>>> properly protected.
>> 
>> This, BTW, can be done for user-mode emulation first, so it can go in
>> early.  Same for RCU-ized code_gen_buffer.
>> 
>>> * What to do about icount?
>>> 
>>> What is the impact of multi-thread on icount? Do we need to disable it
>>> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
>>> 
>>> We need some input from the guys that use icount the most.
>> 
>> That means Edgar. :)
> 
> Hi!
> 
> IMO it would be nice if we could run the cores in some kind of lock-step
> with a configurable amount of instructions that they can run ahead
> of time (X).
> 
> For example, if X is 1, every thread/core would checkpoint at
> 1 insn boundaries and wait for other cores. Between these
> checkpoints, the cores will not be in sync. We might need to
> consider synchronizing at I/O accesses aswell to avoid weird
> timing issues when reading counter registers for example.
> 
> Of course the devil will be in the details but an approach roughly
> like that sounds useful to me.

And “works" in other domains.
Theoretically we dont need to sync at IO (Dynamic quantums), for most systems 
that have ’normal' IO its normally less efficient I believe. However, the 
trouble is, the user typically doesn’t know, and mucking about with quantum 
lengths, dynamic quantum switches etc is probably a royal pain in the butt. And 
if you dont set your quantum right, the thing will run really slowly (or will 
break)… 

The choices are a rock or a hard place. Dynamic quantums risk to be slow 
(you’ll be forcing an expensive ’sync’ - all CPU’s will have to exit etc) on 
each IO access from each core…. not great. Syncing with host time (e.g. each 
CPU tries to sync with host clock as best it can) will fail when one or other 
CPU can’t keep up…. In the end you end up with leaving the user with a nice 
long bit of string and a message saying “hang yourself here”. 

Cheers
Mark.

> 
> Are there any other ideas floating around that may be better?
> 
> BTW, where can I find the latest series? Is it on a git-repo/branch
> somewhere?
> 
> Best regards,
> Edgar


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton







Re: [Qemu-devel] MTTCG Tasks (kvmforum summary)

2015-09-04 Thread Mark Burton

> On 4 Sep 2015, at 14:38, Lluís Vilanova  wrote:
> 
> dovgaluk  writes:
> 
>> Hi!
>> Alex Bennée писал 2015-09-04 10:49:
>>> * What to do about icount?
>>> 
>>> What is the impact of multi-thread on icount? Do we need to disable it
>>> for MTTCG or can it be correct per-cpu? Can it be updated lock-step?
> 
>> Why can't we have separate icount for each CPU?
>> Then virtual timer will be assigned to one of them.
> 
> My understanding is that icount means by deasign that time should be
> synchronized between cpus, where the number of executed instructions is the 
> time
> unit. If all elements worked under this assumption (I'm afraid that's not the
> case for I/O devices), it should be possible to reproduce executions by 
> setting
> icount to 1.
> 
> Now, MTTCG faces the same icount accuracy problems that the current TCG
> implementation deals with (only at different scale). The naive implementation 
> is
> to execute 1 instruction per CPU in lockstep. TCG currently relaxes this at 
> the
> translation block level.
> 
> The MTTCG implementation could do something similar, but just at a different
> (configurable?) granularity. Every N per-cpu instructions, synchronize all 
> CPUs
> until each has, at least, arrived at that time step, then proceed with the 
> next
> batch. Ideally, this synchronization delay (N) could be adapted dynamically.

This is often called a Quantum.

Cheers
Mark

> 
> My half cent.
> 
> Lluis
> 
> -- 
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth


 +44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton







Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation

2015-07-10 Thread Mark Burton

To be clear, for a normal user (e.g. they boot linux, they run some apps, 
etc)..., if they use only one core, is it true that they will see no difference 
in performance?
For a ‘normal user’ who does use multi-core, are you saying that a typical boot 
is slower?

Cheers

Mark.

> On 10 Jul 2015, at 10:23, Alvise Rigo  wrote:
> 
> * Performance considerations
> This implementation shows good results while booting a Linux kernel,
> where tons of flushes affect the overall performance. A complete ARM
> Linux boot, without any filesystem, requires 30% longer if compared to
> the mttcg implementation, benefiting however of being capable to offer
> the infrastructure to handle atomic instructions on any architecture.
> Instead compared to the current TCG upstream, it is 40% faster with four
> vCPUs and 2.1 times faster with 8 vCPUs.
> In addition, there is still margin to improve such performance, since at
> the moment TLB is flushed quite often, probably more than the required.


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton







Re: [Qemu-devel] Summary MTTCG related patch sets

2015-07-20 Thread Mark Burton
Huge thanks Alex, really good summary
Cheers
Mark.

> On 20 Jul 2015, at 18:17, Alex Bennée  wrote:
> 
> Hi,
> 
> Following this afternoons call I thought I'd summarise the state of the
> various patch series and their relative dependencies. We re-stated the
> aim should be to get what is up-streamable through the review process
> and heading for merge so the delta for a full working MTTCG can be as
> low as possible. There was some concern about the practicality of
> submitting patches where the full benefit will not be seen until MTTCG
> is finally merged.
> 
> On the patch submission note could I encourage posting public git trees
> along with the patches for ease of review?
> 
> BQL lock breaking patches, Paolo/Jan
>  - required for working virt-io in MTTCG
>  - supersedes some of Fred's patches
>  - merged upstream as of -rc0
> 
> TCG async_safe_work, Fred
>  - http://git.greensocs.com/fkonrad/mttcg.git async_work_v3
>  - [1437144337-21442-1-git-send-email-fred.kon...@greensocs.com]
>  - split from earlier MTTCG patch series
>  - needed for cross-cpu sync mechanism for main series and slow-path
>  - candidate for upstreaming, but only MTTCG uses for now?
> 
> Slow-path for atomic instruction translation, Alvise
>  - [1436516626-8322-1-git-send-email-a.r...@virtualopensystems.com]
>  - Needs re-basing to use TCG async_safe_work
>  - Earlier part of series (pre MTTCG) could be upstreamed as is
>  - Concern about performance impact in non-MTTCG scenarios
>  - Single CPU thread impact may be minimal with latest version, needs
>  benchmarking
>  - Also incomplete backend support, would BACKEND_HAS_LLSC_OPS be
>  acceptable to maintainers while support added by more knowledgable
>  backend people for non-x86/arm backends?
> 
> Multi-threaded TCG V6, Fred
>  - g...@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v6
>  - [1435330053-18733-1-git-send-email-fred.kon...@greensocs.com]
>  - Needs re-basing on top of latest -rc (BQL breaking)
>  - Contains the rest of the MTTCG work (tb locking, tlb stuff etc)
>  - Currently target-arm only, other builds broken
> 
> As far as balancing the desire to get things upstreamed versus having a
> stable base for testing I suggest we try an approach like this:
> 
>  - select the current upstream -rc as the common base point
>  - create a branch from -rc with:
>- stuff submitted for upstream (reviewed, not nacked)
>- doesn't break any tree
>- has minimal performance impact 
> 
> Then both Fred and Alvise could base their trees of this point and we
> aim to rebase onto a new branch each time the patches get merged into a
> new upstream RC. The question then become how to deal with any
> cross-dependencies between the slow-path and the main MTTCG branches?
> 
> I suspect the initial common branch point would just be
> 2.4.0-rc1+safe_async.
> 
> Does that sound workable?
> 
> -- 
> Alex Bennée


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton







Re: [Qemu-devel] RFC Multi-threaded TCG design document

2015-06-15 Thread Mark Burton
I think we SHOUDL use the wiki - and keep it current. A lot of what you have is 
in the wiki too, but I’d like to see the wiki updated.
We will add our stuff there too…

Cheers
Mark.



> On 15 Jun 2015, at 12:06, Alex Bennée  wrote:
> 
> 
> Frederic Konrad  writes:
> 
>> On 12/06/2015 18:37, Alex Bennée wrote:
>>> Hi,
>> 
>> Hi Alex,
>> 
>> I've completed some of the points below. We will also work on a design 
>> decisions
>> document to add to this one.
>> 
>> We probably want to merge that with what we did on the wiki?
>> http://wiki.qemu.org/Features/tcg-multithread
> 
> Well hopefully there is cross-over as I started with the wiki as a basic
> ;-)
> 
> Do we want to just keep the wiki as the live design document or put
> pointers to the current drafts? I'm hoping eventually the page will just
> point to the design in the doc directory at git.qemu.org.
> 
>>> One thing that Peter has been asking for is a design document for the
>>> way we are going to approach multi-threaded TCG emulation. I started
>>> with the information that was captured on the wiki and tried to build on
>>> that. It's almost certainly incomplete but I thought it would be worth
>>> posting for wider discussion early rather than later.
>>> 
>>> One obvious omission at the moment is the lack of discussion about other
>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>> dirty page tracking bits, I'm sure there is more).
>>> 
>>> I've also deliberately tried to avoid documenting the design decisions
>>> made in the current Greensoc's patch series. This is so we can
>>> concentrate on the big picture before getting side-tracked into the
>>> implementation details.
>>> 
>>> I have now started digging into the Greensocs code in earnest and the
>>> plan is eventually the design and the implementation will converge on a
>>> final documented complete solution ;-)
>>> 
>>> Anyway as ever I look forward to the comments and discussion:
>>> 
>>> STATUS: DRAFTING
>>> 
>>> Introduction
>>> 
>>> 
>>> This document outlines the design for multi-threaded TCG emulation.
>>> The original TCG implementation was single threaded and dealt with
>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>> lot of things but became increasingly limited as systems being
>>> emulated gained additional cores and per-core performance gains for host
>>> systems started to level off.
>>> 
>>> Memory Consistency
>>> ==
>>> 
>>> Between emulated guests and host systems there are a range of memory
>>> consistency models. While emulating weakly ordered systems on strongly
>>> ordered hosts shouldn't cause any problems the same is not true for
>>> the reverse setup.
>>> 
>>> The proposed design currently does not address the problem of
>>> emulating strong ordering on a weakly ordered host although even on
>>> strongly ordered systems software should be using synchronisation
>>> primitives to ensure correct operation.
>>> 
>>> Memory Barriers
>>> ---
>>> 
>>> Barriers (sometimes known as fences) provide a mechanism for software
>>> to enforce a particular ordering of memory operations from the point
>>> of view of external observers (e.g. another processor core). They can
>>> apply to any memory operations as well as just loads or stores.
>>> 
>>> The Linux kernel has an excellent write-up on the various forms of
>>> memory barrier and the guarantees they can provide [1].
>>> 
>>> Barriers are often wrapped around synchronisation primitives to
>>> provide explicit memory ordering semantics. However they can be used
>>> by themselves to provide safe lockless access by ensuring for example
>>> a signal flag will always be set after a payload.
>>> 
>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>> 
>>> This would enforce a strong load/store ordering so all loads/stores
>>> complete at the memory barrier. On single-core non-SMP strongly
>>> ordered backends this could become a NOP.
>>> 
>>> There may be a case for further refinement if this causes performance
>>> bottlenecks.
>>> 
>>> Memory Control and Maintenance
>>> --
>>> 
>>> This includes a class of instructions for controlling system cache
>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>> are often seen when code modification has taken place to ensure the
>>> changes take effect.
>>> 
>>> Synchronisation Primitives
>>> --
>>> 
>>> There are two broad types of synchronisation primitives found in
>>> modern ISAs: atomic instructions and exclusive regions.
>>> 
>>> The first type offer a simple atomic instruction which will guarantee
>>> some sort of test and conditional store will be truly atomic w.r.t.
>>> other cores sharing access to the memory. The classic example is the
>>> x86 cmpxchg instruction.
>>> 
>>> The second type offer a pair of load/store instructions which offer a
>>> guarantee that an region of memory has not been touched between t

Re: [Qemu-devel] RFC Multi-threaded TCG design document

2015-06-17 Thread Mark Burton

> On 17 Jun 2015, at 18:57, Dr. David Alan Gilbert  wrote:
> 
> * Alex Benn?e (alex.ben...@linaro.org) wrote:
>> Hi,
> 
>> Shared Data Structures
>> ==
>> 
>> Global TCG State
>> 
>> 
>> We need to protect the entire code generation cycle including any post
>> generation patching of the translated code. This also implies a shared
>> translation buffer which contains code running on all cores. Any
>> execution path that comes to the main run loop will need to hold a
>> mutex for code generation. This also includes times when we need flush
>> code or jumps from the tb_cache.
>> 
>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>> and jump cache modification
> 
> I don't think that you require a shared translation buffer between
> cores to do this - although it *might* be the easiest way.
> You could have a per-core translation buffer, the only requirement is
> that most invalidation operations happen on all the buffers
> (although that might depend on the emulated architecture).
> With a per-core translation buffer, each core could generate new translations
> without locking the other cores as long as no one is doing invalidations.

I agree it’s not a design requirement - however we’ve kind of gone round this 
loop in terms of getting things to work.
Fred will doubtless fill in some details, but basically it looks like making 
the TCG so you could run several in parallel is a nightmare. We seem to get 
reasonable performance having just one CPU at a time generating TBs.  At the 
same time, of course, the way Qemu is constructed there are actually several 
‘layers’ of buffer - from the CPU local ones through to the TB ‘pool’. So, 
actually, my accident or design, we benefit from a sort of caching structure. 


> 
>> Memory maps and TLBs
>> 
>> 
>> The memory handling code is fairly critical to the speed of memory
>> access in the emulated system.
>> 
>>  - Memory regions (dividing up access to PIO, MMIO and RAM)
>>  - Dirty page tracking (for code gen, migration and display)
>>  - Virtual TLB (for translating guest address->real address)
>> 
>> There is a both a fast path walked by the generated code and a slow
>> path when resolution is required. When the TLB tables are updated we
>> need to ensure they are done in a safe way by bringing all executing
>> threads to a halt before making the modifications.
>> 
>> DESIGN REQUIREMENTS:
>> 
>>  - TLB Flush All/Page
>>- can be across-CPUs
>>- will need all other CPUs brought to a halt
>>  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>- This is a per-CPU table - by definition can't race
>>- updated by it's own thread when the slow-path is forced
>> 
>> Emulated hardware state
>> ---
>> 
>> Currently the hardware emulation has no protection against
>> multiple-accesses. However guest systems accessing emulated hardware
>> should be carrying out their own locking to prevent multiple CPUs
>> confusing the hardware. Of course there is no guarantee the there
>> couldn't be a broken guest that doesn't lock so you could get racing
>> accesses to the hardware.
>> 
>> There is the class of paravirtualized hardware (VIRTIO) that works in
>> a purely mmio mode. Often setting flags directly in guest memory as a
>> result of a guest triggered transaction.
>> 
>> DESIGN REQUIREMENTS:
>> 
>>  - Access to IO Memory should be serialised by an IOMem mutex
>>  - The mutex should be recursive (e.g. allowing pid to relock itself)
>> 
>> IO Subsystem
>> 
>> 
>> The I/O subsystem is heavily used by KVM and has seen a lot of
>> improvements to offload I/O tasks to dedicated IOThreads. There should
>> be no additional locking required once we reach the Block Driver.
>> 
>> DESIGN REQUIREMENTS:
>> 
>>  - The dataplane should continue to be protected by the iothread locks
> 
> Watch out for where DMA invalidates the translated code.
> 


need to check - that might be a great catch !

Cheers

Mark.

> Dave
> 
>> 
>> 
>> References
>> ==
>> 
>> [1] 
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>> 
>> 
>> 
>> -- 
>> Alex Bennée
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH V3] Use atomic cmpxchg to atomically check the exclusive value in a STREX

2015-06-18 Thread Mark Burton
for the 1< On 18 Jun 2015, at 17:56, Peter Maydell  wrote:
> 
> On 18 June 2015 at 16:44,   wrote:
>> +uint64_t oldval, *p;
>> +p = address_space_map(cs->as, paddr, &len, true);
>> +if (len == 8 << size) {
>> +oldval = (uint64_t)env->exclusive_val;
>> +result = (atomic_cmpxchg(p, oldval, (uint64_t)newval) == 
>> oldval);
> 
> You can't do an atomic operation on a type that's larger than
> the pointer size of the host system. That means that for
> code that isn't host specific, like this, in practice you
> can't do an atomic operation on a larger size than 4 bytes.
> 

I thought they were polymorphic across all types, I didn’t notice the caveat of 
the size, sorry about that. That makes things more entertaining :-)

Cheers
Mark.



> (It happens to work on x86, I think, but this won't build
> on ppc32, for instance.)
> 
> thanks
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH V3] Use atomic cmpxchg to atomically check the exclusive value in a STREX

2015-06-19 Thread Mark Burton

> On 18 Jun 2015, at 21:53, Peter Maydell  wrote:
> 
> On 18 June 2015 at 19:32, Mark Burton  wrote:
>> for the 1<> a little worrying - I’ll check.
>> 
>>> On 18 Jun 2015, at 17:56, Peter Maydell  wrote:
>>> 
>>> On 18 June 2015 at 16:44,   wrote:
>>>> +uint64_t oldval, *p;
>>>> +p = address_space_map(cs->as, paddr, &len, true);
>>>> +if (len == 8 << size) {
>>>> +oldval = (uint64_t)env->exclusive_val;
>>>> +result = (atomic_cmpxchg(p, oldval, (uint64_t)newval) == 
>>>> oldval);
>>> 
>>> You can't do an atomic operation on a type that's larger than
>>> the pointer size of the host system. That means that for
>>> code that isn't host specific, like this, in practice you
>>> can't do an atomic operation on a larger size than 4 bytes.
>>> 
>> 
>> I thought they were polymorphic across all types, I didn’t notice
>> the caveat of the size, sorry about that. That makes things more
>> entertaining :-)
> 
> It's polymorphic across most types... It's been suggested
> that the macros should refuse types with size > ptrsize
> on all systems, so you don't have to get bitten by a
> ppc32 compile failure after the fact, but I don't think that
> anybody's written the patch yet.
> 

That would be sensible.

In the meantime, 
It looks to me like (most implementations of) 32 bit x86 support double word 
cmpxchg - but I dont know if the library uses that, and I’d have to find a 
machine to try it on… 
In any case, we could support that relatively easily it seems.
So, we’re talking about 64bit ARM ldrex/strex, being run, in multi-thread mode, 
on a multi-core 32bit probably non x86 host…..
We can add mutex’s around the ld/strex, and effectively implement the mechanism 
as it is upstream today, (fixing where the address/data is stored). That would 
give some protection, but it would not have the advantage of the atomicity that 
gives us somewhat better protection against a non exclusive store breaking the 
lock.

If we were to do that, we would be saying - on multi-host 32 bit non x86 hosts, 
with a 64 bit multi-core, multi thread arm guest, and a guest that uses normal 
stores to break an exclusive lock, we would have a race condition.

I’m minded to say that we should simply not support that case for now. I 
propose that the multi-thread switch will only allow you to switch into 
multi-thread mode if the maximum cmpxchg that can be supported is the same or 
bigger than the guest ptrsize. 

Does that seem reasonable?

Do you have a better idea?

Does anybody know if the current atomic_cmpxchg will support 64 bit on a 
(normal) 32 bit x86, or do we need to special case that with cmpxchg8b ? (I get 
the impression that it will automatically use cmpxchg8b, but not cmpxchg16b - 
but I’m by no means sure).

Cheers

Mark.


> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH V3] Use atomic cmpxchg to atomically check the exclusive value in a STREX

2015-06-19 Thread Mark Burton

> On 19 Jun 2015, at 09:31, Paolo Bonzini  wrote:
> 
> 
> 
> On 19/06/2015 09:29, Mark Burton wrote:
>> Does anybody know if the current atomic_cmpxchg will support 64 bit
>> on a (normal) 32 bit x86, or do we need to special case that with
>> cmpxchg8b ? (I get the impression that it will automatically use
>> cmpxchg8b, but not cmpxchg16b - but I’m by no means sure).
> 
> Both cmpxchg8b and cmpxchg16b are used, respectively on 32-bit and
> 64-bit x86.
> 

Thanks Paolo, so  we are OK for x86, but we would need to disable multi-thread 
for other 32 bit hosts, and provide a correct implementation for non 
multi-thread…
You dont happen to know of a convenient macro we can use to test for ’32 bit 
hosts that dont support 64bit cmpxchg ….’

Cheers

Mark.



> Paolo



 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH V3] Use atomic cmpxchg to atomically check the exclusive value in a STREX

2015-06-19 Thread Mark Burton

> On 19 Jun 2015, at 09:42, Paolo Bonzini  wrote:
> 
> 
> 
> On 19/06/2015 09:40, Mark Burton wrote:
>>> On 19/06/2015 09:29, Mark Burton wrote:
>>>> Does anybody know if the current atomic_cmpxchg will support
>>>> 64 bit on a (normal) 32 bit x86, or do we need to special
>>>> case that with cmpxchg8b ? (I get the impression that it will
>>>> automatically use cmpxchg8b, but not cmpxchg16b - but I’m by
>>>> no means sure).
>>> 
>>> Both cmpxchg8b and cmpxchg16b are used, respectively on 32-bit
>>> and 64-bit x86.
>> 
>> Thanks Paolo, so  we are OK for x86, but we would need to disable
>> multi-thread for other 32 bit hosts, and provide a correct
>> implementation for non multi-thread…
> 
> But Alvise's implementation for example would work there.  It is just
> this optimization (that is also not architecturally correct on ARM) that
> is problematic.

Yes, that is exactly correct.
Cheers
Mark.

> 
> Paolo
> 
>> You dont happen to know of a
>> convenient macro we can use to test for ’32 bit hosts that dont
>> support 64bit cmpxchg ….’
> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH V6 15/18] cpu: introduce tlb_flush*_all.

2015-07-06 Thread Mark Burton
Paolo, Alex, Alexander,

Talking to Fred after the call about ways of avoiding the ‘stop the world’ (or 
rather ‘sync the world’) - we already discussed this on this thread.
One thing that would be very helpful would be some test cases around this. We 
could then use Fred’s code to check some of the possible solutions out….

I’m not sure if there is wiggle room in Peter’s statement below. Can the TLB 
operation be completed on one core, but not ‘seen’ by other cores until they 
hit an exit…..?

Cheers

Mark.


> On 26 Jun 2015, at 18:30, Frederic Konrad  wrote:
> 
> On 26/06/2015 18:08, Peter Maydell wrote:
>> On 26 June 2015 at 17:01, Paolo Bonzini  wrote:
>>> On 26/06/2015 17:54, Frederic Konrad wrote:
 So what happen is:
 An arm instruction want to clear tlb of all VCPUs eg: IS version of
 TLBIALL.
 The VCPU which execute the TLBIALL_IS can't flush tlb of other VCPU.
 It will just ask all VCPU thread to exit and to do tlb_flush hence the
 async_work.
 
 Maybe the big issue might be memory barrier instruction here which I didn't
 checked.
>>> Yeah, ISTR that in some cases you have to wait for other CPUs to
>>> invalidate the TLB before proceeding.  Maybe it's only when you have a
>>> dmb instruction, but it's probably simpler for QEMU to always do it
>>> synchronously.
>> Yeah, the ARM architectural requirement here is that the TLB
>> operation is complete after a DSB instruction executes. (True for
>> any TLB op, not just the all-CPUs ones). NB that we also call
>> tlb_flush() from target-arm/ code for some things like "we just
>> updated a system register"; some of those have "must take effect
>> immediately" semantics.
>> 
>> In any case, for generic code we have to also consider the
>> semantics of non-ARM guests...
>> 
>> thanks
>> -- PMM
> Yes this is not the case as I implemented it.
> 
> The rest of the TB will be executed before the tlb_flush work really happen.
> The old version did this, was slow and was a mess (if two VCPUs want to 
> tlb_flush
> at the same time and an other tlb_flush_page.. it becomes tricky..)
> 
> I think it's not really terrible if the other VCPU execute some stuff before 
> doing the
> tlb_flush.? So the solution would be only to cut the TranslationBlock after 
> instruction
> which require a tlb_flush?
> 
> Thanks,
> Fred
> 




Re: [Qemu-devel] [RFC] reverse execution.

2013-05-17 Thread Mark Burton
I wish I could say I understood it better, but at this point any insight would 
be gratefully received. However, what does seem clear is that the intent and 
purpose of Icount is subtly different, and possibly orthogonal to what we're 
trying to achieve.

And - actually, determinism (or the lack of it), is defiantly an issue, but - 
for now - we have spent much of this week finding a bit of code that avoids any 
non-determanistic behavior - simply so we can make sure the mechanisms work - 
THEN we will tackle the thorny subject of what is causing non-determanistic 
behavior (by which, I _suspect_ I mean, what devices or timers are not adhering 
to the icount mechanism).

To recap, as I understand things, setting the icount value in the command line 
is intended to give a rough "instructions per second" mechanism. One of the 
effects of that is to make things more deterministic.  Our overall intent is to 
allow the user that has hit a bug, to step backwards.

After much discussion (!) I'm convinced by the argument that I might in the end 
want both of these things. I might want to set some sort of instructions per 
second value (and change it between runs), and if/when I hit a bug, go 
backwards.

Thus far, so good. 

underneath the hood, icount keeps a counter in the TCG environment which is 
decremented (as Fred says) and the icount mechanism plays with it as it feels 
fit.
The bottom line is that, orthogonal to this, we need a separate 'counter' which 
is almost identical to the icount counter, in order to count instructions for 
the reverse execution mechanism.

We have looked at re-using the icount counter as Fred said, but that soon ends 
you up in a whole heap of pain. Our conclusion - it would be much cleaner to 
have a separate dedicated counter, then you can simply use either mechanism 
independent of the other.
On this subject - I would like to hear any and all views.

Having said all of that, in BOTH cases, we need determinism.

In our case, determinism is very tightly defined (which - I suspect may not be 
the case for icount). In our case, having returned to a snapshot, the 
subsequent execution must follow the EXACT SAME path that it did last time. no 
if's no buts. Not IO no income tax, no VAT, no money back no guarantee….

Right now, what Fred has found is that sometimes things 'drift'… we will (of 
course) be looking into that. But, for now, our principle concern is to take a 
simple bit of code, with no IO, and nothing that causes non-determanism - save 
a snapshot at the beginning of the sequence, run, hit a breakpoint, return to 
the breakpoint, and be able to _exactly_ return to the place we came from.

As Fred said, we imagined that we could do this based on TBs, at least as a 
'block' level (which actually may be good enough for us). However, our 
mechanism for counting TB's was badly broken. None the less, we learnt a lot 
about TB's - and about some of the non-determaistic behavior that will come to 
haunt us later. We also concluded that counting TBs is always going to be 
second rate, and if we're going to do this properly, we need to count 
instructions. Finally, we have concluded that re-using the icount counters is 
going to be very painful, we need to re-use the same mechanism, but we need 
dedicated counters…


Again, please, all - pitch in and say what you think. Fred and I have been 
scratching out head all week on this, and I'm not convinced we have come up 
with the right answers, so any input would be most welcome.


Cheers

Mark.






On 17 May 2013, at 19:54, Peter Maydell wrote:

> On 17 May 2013 18:23, KONRAD Frédéric  wrote:
>> It appeared that the replay is not deterministic even with icount:
>>- the whole icount mechanism is not saved with save_vm (which can be
>> achieved by moving qemu_icount to TimerState according to Paolo)
>>- replaying two times the same thing and stopping at a specific
>> breakpoint show two differents vmclock, so replaying the
>>same amount of time don't work, and we abandoned this idea.
> 
> Personally I think icount is supposed to be deterministic,
> and if it isn't then it should be fixed to be so. Does anybody
> who understands it better than me disagree?
> 
> thanks
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210
 707-356-0783 x 210
+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC] reverse execution.

2013-05-19 Thread Mark Burton
Spot on Peter,
The (simplistic) plan is simply to take a snapshot at regular intervals, when 
you want to step backwards, you return to a snapshot, and then re-run forwards 
to 'just before you started'.

To answer Blauwirbel, we can't "approximate" this - or 'binary search' for the 
right place - we absolutely have to know exactly how to stop exactly 'just 
before'. To do that we need a measure of where we are. Time isn't really what 
we want (It has a nasty habit of moving forwards even when we are pretending to 
reverse it!). Ideally we need an instruction count. (I say ideally, we could 
actually get away with something like a block count, but if blocks are linked 
together etc, we would have to be careful).

I still favor a dedicated counter for this, it seems to make more sense. But - 
I am wondering about a 'basic block counter' rather than an 'instruction 
counter'.
Note - what I understand by a basic block is something that ends in a 
jump/branch of some description. Hence, one thing I think you can say about a 
basic block is that each PC value within it is unique. Hence, if I know the 
number of basic blocks executed, and the current PC, then I should be able to 
re-run to there (assuming a deterministic system of course).

I'd be interested to know (a) if there is a sensible place for adding a basic 
block counter, and (b) if people like this route better or worse than an 
instruction counter?

Cheers

Mark.


On 19 May 2013, at 09:21, Peter Maydell wrote:

> On 19 May 2013 05:37, Rob Landley  wrote:
>> On 05/17/2013 12:23:51 PM, KONRAD Frédéric wrote:
>>> It appeared that the replay is not deterministic even with icount:
> 
>> You're aware that reverse execution means you have the "come from" problem,
>> right? (The opposite of goto.)
>> 
>> You literally _can't_ figure out your control flow by running the code
>> backwards. It's equivalent to solving the halting problem. The best you can
>> do is log and replay.
> 
> Yes, of course -- 'reverse execution' is just the usual phrase
> for the user-visible effect. However if your *forwards* replay
> isn't deterministic then it all goes pear-shaped, which is
> what Fred is complaining about.
> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210
 707-356-0783 x 210
+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC] reverse execution.

2013-05-20 Thread Mark Burton


On 19 May 2013, at 23:20, Peter Maydell  wrote:

> On 19 May 2013 21:09, Mark Burton  wrote:
>>Note - what I understand by a basic block is something that ends in a
>> jump/branch of some description. Hence, one thing I think you can say about a
>> basic block is that each PC value within it is unique. Hence, if I know the
>> number of basic blocks executed, and the current PC, then I should be able to
>> re-run to there (assuming a deterministic system of course).
> 
> Assuming your rerun is strictly deterministic so you always exit
> the basic block the same way each time, then yes, this amounts
> to optimising the "instruction count" by doing it as "X basic
> blocks then Y instructions".

Yes that's exactly what I'm saying 
But
I guess it's not quite so simple...

> You could actually have this
> really do the instruction count for you, if you track insns
> per block at translation time. (There is some fiddling necessary
> when we take an unexpected exception in the middle of a block
> due to a load/store fault.)


To be clear, we're talking about eg a memory exception that effectively causes 
an unexpected branch in the middle of a basic block.

The basic blocks used to execute the exception will be counted of course.

The issue is the instruction count of executed instructions as you enter the 
exception will be wrong.

However it will be consistent and deterministic, so long as the exception 
itself is deterministic... Which it should be.

The rule that the pc will be unique will be maintained.
Hence if we choose the 'count blocks' and then find the pc I believe we might 
'get away' with this case?
If we count the instructions themselves then we have some fixing to do, as you 
say.

> 
>> I'd be interested to know (a) if there is a sensible place for
>> adding a basic block counter, and (b) if people like this route
>> better or worse than an instruction counter?
> 
> I think you're probably best off getting the instruction counter
> working properly before trying to optimise it...
> 

I agree, 

The advantage of your approach is that the instruction counter may have other 
uses, and it's robust.

We can start, as you say, simply, and then use a basic block approach to 
optimise it if need be...

Still, if people see no use in an instruction count, and if people can point at 
a decent place to annotate basic blocks, then we can do without it

Cheers

Mark
> thanks
> -- PMM



Re: [Qemu-devel] [RFC] reverse execution.

2013-05-20 Thread Mark Burton


On 19 May 2013, at 23:39, Rob Landley  wrote:

> On 05/19/2013 03:09:14 PM, Mark Burton wrote:
>> Spot on Peter,
>> The (simplistic) plan is simply to take a snapshot at regular intervals,
>> when you want to step backwards, you return to a snapshot, and then re-run
>> forwards to 'just before you started'.
> 
> You'd have to snapshot all of memory because any of it could be used for 
> branching decisions. You'd have to

Yes. Qemu helps us there we believe.

> snapshot the state of I/O devices for the same reason.

Exactly.
> This includes serial ports and keyboards and your hardware random number 
> generator and the timer interrupt and disk interrupts, all of which you have 
> to log and replay the input from, and get the timing exactly right for the 
> interrupts they

Yes. We are not there yet , but, 
A) Icount seems to help make some of this more deterministic.
B) we can record the io queues activities, this has to be done for migration 
too(as I understand it)

But - as I say, we're not there yet...

> generate. (It has to happen at the right spot because it's used to update the 
> random number pool, it can affect scheduling decisions...)
> 
> Good luck with that.
> 
> A horrid thing you might do is log the instruction pointer every time it 
> changes into a (giant) ring buffer. Possibly instrument tcg to write up that 
> register every time it has to actually know it (jumps and when interrupts 
> happen). (You don't have to know "advanced to next instruction". You do have 
> to know "advanced to something other than next instruction".) It'll be slow 
> and painful, but might be possible.
> 

Actually I don't believe this is enough - when the code accesses the device it 
needs to get the right values , it's not good enough to force the processor to 
a specific address...

But, maybe I misunderstand your idea?

Cheers

Mark

> Then again to make it work you'd have to log not just where you went but 
> where you came _from_ so you could see that you got there and it's time to 
> jump again (interrupts again, doesn't mean it's a normal departure point, 
> could be a signal). And the problem is do you record the target's idea of 
> "from" or the translated host idea of "from"...
> 
> Rob



Re: [Qemu-devel] [RFC 02/10] use a different translation block list for each cpu.

2015-01-29 Thread Mark Burton
I’ll let Fred answer the other points you make - which might help explain what 
we’re finding..
But - for this one…

The idea for now is to keep things simple and have a thread per CPU and a 
‘cache’ per thread. (Later we can look at reducing the caches).

What we mean by a ‘cache’ needs to be clearer. I dont think ‘cache’ is a 
helpful word, but it’s the one thats been used elsewhere… anyway Actually - 
what we (think we) mean is this first_tb list (in each page block). In other 
words, as things stand, each CPU that is going to use this page is also going 
to build it’s own translation block list. We end up with pages x CPU first_tb 
lists…

Does that make sense?

Cheers

Mark.







> On 29 Jan 2015, at 16:24, Peter Maydell  wrote:
> 
>> +TranslationBlock *first_tb[MAX_CPUS];
> 
> Do we really need to know this for every CPU, or just for
> the one that's using this PageDesc? I am assuming we're going to make
> the l1_map be per-CPU.


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH v8 04/21] replay: internal functions for replay log

2015-01-30 Thread Mark Burton
I believe thats what we concluded too
Cheers
Mark.

> On 30 Jan 2015, at 14:06, Paolo Bonzini  wrote:
> 
> 
> 
> On 30/01/2015 13:56, Pavel Dovgaluk wrote:
>>> Could this be static? (I haven't checked).
>> 
>> No, because it is used from several replay files.
> 
> I wonder if that's a layering violation.
> 
>>> Perhaps qemu_system_vmstop_request_prepare +
>>> qemu_system_vmstop_request(RUN_STATE_PAUSED) instead of exit?  Those two
>>> functions are thread-safe.
>> 
>> There is no need to stop when replay file is over (because we cannot replay 
>> more).
>> Should we send shutdown request instead?
> 
> I thought about it.  I think no, because shutdown is irreversible (see
> runstate_needs_reset).  Just pausing seemed to be the right compromise,
> and then the next "cont" can run the VM out of replay mode.
> 
> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] CPU TLB flush with multithread TCG.

2015-02-11 Thread Mark Burton

> On 11 Feb 2015, at 04:33, Alex Bennée  wrote:
> 
> 
> Frederic Konrad  writes:
> 
>> Hi everybody,
>> 
>> In multithread tlb_flush is broken as CPUA can flush an other CPUB and 
>> CPUB can be
>> executing code, and fixing this can be quite hard:
>>   * We need to exit the CPU which is flushed.
>>   * Makes sure the CPU is stopped.
>>   * Then we can flush tlb.
>> The big issues are:
>>   * Two threads can be doing a flush at the same time.
>>   * Something can restart the CPU during the flush.
>> 
>> A better idea I think is that instead of flushing tlb we can put a flag 
>> in CPUState such
>> as flush_request and ask the cpu to exit.
>> Then later once the CPU is exited we can flush tlbs if flush_request is set.
>> It will ensure that the CPU won't execute code as it's associated thread 
>> will be
>> flushing.
>> 
>> Can this work?
> 
> Does this imply deferring the work? Surely if we don't flush when
> instructed things could break down very quickly?
> 

thats exactly the question. 
Cheers
Mark.

> 
> 
>> 
>> Thanks,
>> Fred
> 
> -- 
> Alex Bennée


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




[Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

TLB Flush:

We have spent a few days on this issue, and still haven’t resolved the best 
path.

Our solution seems to work, most of the time, but we still have some strange 
issues - so I want to check that what we are proposing has a chance of working.


Our plan is to allow all CPU’s to continue. Potentially one CPU will want to 
write to the TLBs. Subsequent to the write, it requests a TLB Flush. We are 
proposing to implement this by signalling all other CPU’s to exit (and 
requesting they flush before re-starting). In other words, this would happen 
asynchronously.

This means - there is a theoretical period of time when one CPU is writing to 
the TLBs while other CPU’s are executing.  Our belief is that this has to be 
handled by software anyway, and this should not be an issue from Qemu’s point 
of view. 
The alternative would be to force all other CPU’s to exit before writing the 
TLB’s - this is both expensive and very painful to organise (as we get into 
horrid deadlocks whichever way we turn)…

We’d appreciate some thoughts on this...

Cheers

Mark.



 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton
 


[Qemu-devel] TCG Multithreading performance improvement

2014-11-26 Thread Mark Burton


Hi all,

We are now actively going to pursue TCG Multithreading to improve the 
performance of the TCG for Qemu models that include multiple cores.

We have set up a wiki page to track the project 
http://wiki.qemu.org/Features/tcg-multithread 


At this point, I would like to invite everybody to email us ideas about how the 
project should progress, and ideas that might be useful (e.g. people who have 
tried this before, source code that might be helpful, what order we should 
attack things in etc)…

So - PLEASE let us know if you have interest in this topic, or information that 
might help

Cheers

Mark.





 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton
 


Re: [Qemu-devel] TCG Multithreading performance improvement

2014-11-26 Thread Mark Burton

> On 26 Nov 2014, at 20:12, Lluís Vilanova  wrote:
> 
> Mark Burton writes:
> 
>> Hi all,
>> We are now actively going to pursue TCG Multithreading to improve the
>> performance of the TCG for Qemu models that include multiple cores.
> 
>> We have set up a wiki page to track the project 
>> http://wiki.qemu.org/Features/tcg-multithread
> 
>> At this point, I would like to invite everybody to email us ideas about how 
>> the
>> project should progress, and ideas that might be useful (e.g. people who have
>> tried this before, source code that might be helpful, what order we should
>> attack things in etc)…
> 
>> So - PLEASE let us know if you have interest in this topic, or information 
>> that
>> might help
> 
> This is great news! I've added a new subsection on the wiki to hold some
> problems that should be kept in mind while designing a solution.
> 
> For now, I've added memory consistency and instruction atomicity issues in it.

Thanks !  - Those are the two big issues I think :-)))


Cheers

Mark.

> 
> 
> Best,
>  Lluis
> 
> -- 
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




[Qemu-devel] Update on TCG Multithreading

2014-12-01 Thread Mark Burton

All - first a huge thanks for those who have contributed, and those who have 
expressed an interest in helping out.

One issue I’d like to see more opinions on is the question of a cache per core, 
or a shared cache.
I have heard anecdotal evidence that a shared cache gives a major performance 
benefit….
Does anybody have anything more concrete?
(of course we will get numbers in the end if we implement the hybrid scheme as 
suggested in the wiki - but I’d still appreciate any feedback).

Our next plan is to start putting an implementation plan together. Probably 
quite sketchy at this point, and we hope to start coding shortly.


Cheers

Mark.





 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton
 


[Qemu-devel] MultiThread TCG mail list

2014-12-03 Thread Mark Burton

All - to make things easier to track, there is now a mail list specifically for 
MultiThread development issues
mt...@listserver.greensocs.com 
You can subscribe etc here:
http://listserver.greensocs.com/wws/info/mttcg 


If you send to this mail list, please make sure to copy qemu-devel as well.


Cheers

Mark.





Re: [Qemu-devel] KVM call for agenda for 2014-12-08

2014-12-03 Thread Mark Burton
Hi Juan, is this for the 9th, or did I get the day wrong

Anyway - I would like to talk about Multi-core - a huge thank you to everybody 
for your feedback, we’ll be starting work on this, and I’d like to bring a 
proposal in terms of the path we’ll take and get consensus on the first steps.

Cheers
Mark.

> On 2 Dec 2014, at 20:56, Juan Quintela  wrote:
> 
> 
> Hi
> 
> Please, send any topic that you are interested in covering.
> 
> hanks, Juan.
> 
> By popular demand, a google calendar public entry with it
> 
> https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
> 
> (Let me know if you have any problems with the calendar entry)
> 
> If you need phone number details,  contact me privately
> 
> Thanks, Juan.
> 
> PD.  Use the google calendar entry to now the time, I gave up at getting
> three timezones right.
> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




[Qemu-devel] Atomic Instructions - comments please

2014-12-15 Thread Mark Burton
Comments please….

Choices for Atomic instructions:
The current approach (for ARM at least) for Ld and St exclusive inside Qemu 
simply records the address and the value that atomic read instructions attempt 
to read from. When an atomic write happens, it checks the value and address 
remain the same, otherwise it fails.

This just doesn’t match the architecturally defined functionality. For 
instance, a write will succeed if another thread intervenes with a write at the 
same address with the same (old) value.

However - for most people, most of the time, this semantic seems to work.

One proposal is ‘simply’ to add a mutex around this code, such that 
multi-threaded TCG will correctly update/read these saved address/values.
This _should_ maintain the status-quo. Things that were broken before will 
remain broken, nothing new should break. The concern is that the fact that the 
TCG was previously uni-threaded MAY be masking problems with this code that we 
are not taking into account.

A second proposal is to somehow re-use the dirty bit mechanism.
However - the dirty bit approach seems to be too corse grained (per 
page), and (I think) only covers writes, which seems like a limitation which is 
not ideal….

A third proposal is to mark pages as IO when a ld/st ex is performed to them. 
Then to augment the memory API to indicate that a ld/st ex is in hand, allowing 
the memory chain to decide if it should allow the write or cause a fault. This 
would seem to be the closest approach to the real H/W. However it means marking 
pages as IO (which could cause us issues if there was code on that page?, or 
slow things down) - and it means adding to the memory API.

Cheers

Mark.





 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton
 


Re: [Qemu-devel] TCG multithread plan.

2014-12-15 Thread Mark Burton
(please note the address of the match list server is 
mt...@listserver.greensocs.com )

> On 9 Dec 2014, at 18:57, Lluís Vilanova  wrote:
> 
> Frederic Konrad writes:
> 
>> Hi everybody,
>> Here is the plan we will follow:
> 
>> We will be focusing - from the outset - on the end goal of multi-threaded 
>> TCG in full system emulation mode. On the way, we expect this will ‘fix’ 
>> user mode.
> 
>> The plan is:
> 
>> * Create one cache per CPU as a first step. We can do more next and share a 
>> cache.
>> * Update tb_* to add a pointer to their cache.
>> * Add atomic instruction support to the TGC (first on ARM).
>> * Make tb_invalidate work between all cache.
>> * Modify main-loop for multi-thread.
>> * Memory access (eg: for device) are not thread safe that need to be fixed. 
>> Initial plan simply globally mutex memory accesses - this may be optimised 
>> later.
>> * For now, irq handler for CPU seems ok but we need to check.
> 
>> We will discuss this during the call tomorrow.
> 
> Is there any plan to have the notion of "memory coherency domains"? (shortened
> as MCD in the wiki)
> 
> Even though it's not that useful now, it could be used in the future to relax
> the atomicity of operations when different devices operate on different MCDs 
> and
> TCG is not able to map that into an atomic host operation (aka, has to use
> locks). A system with a CPU and a GPU would be a good example of that.
> 

This would - I think - be a vote for doing atomicity via the ‘io’ path I 
believe?

Cheers

Mark.

> 
> Thanks,
> Lluis
> 
> -- 
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth




Re: [Qemu-devel] Atomic Instructions - comments please

2014-12-15 Thread Mark Burton
We will roll a patch for this approach shortly.
For the ‘better’ approach - I think it’s something we will consider doing…. but 
as you say, one thing at a time.
I dont think it will be too bad to implement, given what already exists in the 
tlb’s - (except if we have to protect (for some architecture or other) against 
non-atomic reads from an address marked atomic, I think).  I think we can treat 
this independently (unless we discover that it won’t work without :-) )


Cheers

Mark.

> On 15 Dec 2014, at 14:28, Peter Maydell  wrote:
> 
> On 15 December 2014 at 12:56, Mark Burton  wrote:
>> One proposal is ‘simply’ to add a mutex around this code, such
>> that multi-threaded TCG will correctly update/read these saved
>> address/values.
>> This _should_ maintain the status-quo. Things that were broken
>> before will remain broken, nothing new should break. The concern
>> is that the fact that the TCG was previously uni-threaded MAY be
>> masking problems with this code that we are not taking into account.
> 
> Personally I would start out with this approach. We're going to
> need a "do this whole sequence atomically wrt other guest CPUs"
> mechanism anyway, so it's not implementing something we wouldn't
> otherwise need. And it's the simple thing to do. It's certainly
> possible to do a more architecturally correct ld/st exclusive
> implementation along the lines of how we manage TB invalidation
> with the dirty bitmap, but if we can do without it then we
> should try to keep the scope of this project constrained: it's
> a big enough job as it is.
> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Atomic Instructions - comments please

2014-12-15 Thread Mark Burton
(not address of mttcg list server)

> On 15 Dec 2014, at 14:32, Paolo Bonzini  wrote:
> 
> 
> 
> On 15/12/2014 14:28, Peter Maydell wrote:
>> Personally I would start out with this approach. We're going to
>> need a "do this whole sequence atomically wrt other guest CPUs"
>> mechanism anyway, so it's not implementing something we wouldn't
>> otherwise need. And it's the simple thing to do. It's certainly
>> possible to do a more architecturally correct ld/st exclusive
>> implementation along the lines of how we manage TB invalidation
>> with the dirty bitmap, but if we can do without it then we
>> should try to keep the scope of this project constrained: it's
>> a big enough job as it is.
> 
> How would "add a mutex" work unless you add a mutex or CAS to each and
> every qemu_st operation?

it’ll do what it does now…. e.g. it won’t ‘work’ the way the architecture is 
defined…. but it will work as well as it does now….

Cheers

Mark

> 
> Paolo


 +44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Atomic Instructions - comments please

2014-12-15 Thread Mark Burton

> On 15 Dec 2014, at 14:39, Peter Maydell  wrote:
> 
> [I'm getting bounces from mt...@greensocs.com so have taken
> them off cc:
> 550 5.1.1 : Recipient address rejected: User
> unknown in virtual mailbox table]
> 

the address should be: mt...@listserver.greensocs.com
Not sure who introduced the other address..

Cheers
Mark.



> On 15 December 2014 at 13:32, Paolo Bonzini  wrote:
>> 
>> 
>> On 15/12/2014 14:28, Peter Maydell wrote:
>>> Personally I would start out with this approach. We're going to
>>> need a "do this whole sequence atomically wrt other guest CPUs"
>>> mechanism anyway, so it's not implementing something we wouldn't
>>> otherwise need. And it's the simple thing to do. It's certainly
>>> possible to do a more architecturally correct ld/st exclusive
>>> implementation along the lines of how we manage TB invalidation
>>> with the dirty bitmap, but if we can do without it then we
>>> should try to keep the scope of this project constrained: it's
>>> a big enough job as it is.
>> 
>> How would "add a mutex" work unless you add a mutex or CAS to each and
>> every qemu_st operation?
> 
> Same way our current approach works -- we simply don't implement
> "stores interrupt ll/sc operations": only a store-conditional
> can break a load-locked's lock. In practice this works ok
> because the stereotypical sequences that guests use don't rely
> on this part of the spec.
> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-17 Thread Mark Burton

> On 17 Dec 2014, at 11:28, Alexander Graf  wrote:
> 
> 
> 
> On 17.12.14 11:27, Frederic Konrad wrote:
>> On 16/12/2014 17:37, Peter Maydell wrote:
>>> On 16 December 2014 at 09:13,   wrote:
 From: KONRAD Frederic 
 
 This adds a lock to avoid multiple exclusive access at the same time
 in case of
 TCG multithread.
>> Hi Peter,
>> 
>>> This feels to me like it's not really possible to review on
>>> its own, since you can't see how it fits into the design of
>>> the rest of the multithreading support.
>> true the only thing we observe is that it didn't change anything right now.
>> 
>>> The other approach here rather than having a pile of mutexes
>>> in the target-* code would be to have TCG IR support for
>>> "begin critical section"/"end critical section". Then you
>>> could have the main loop ensure that no other CPU is running
>>> at the same time as the critical-section code. (linux-user
>>> already has an ad-hoc implementation of this for the
>>> exclusives.)
>>> 
>>> -- PMM
>>> 
>> What do you mean by TCG IR?
> 
> TCP ops. The nice thing is that TCG could translate those into
> transactions if the host supports them as well.
> 

Hows that different in reality from what we have now?
Cheers
Mark.

> 
> Alex


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-17 Thread Mark Burton

> On 17 Dec 2014, at 11:45, Alexander Graf  wrote:
> 
> 
> 
> On 17.12.14 11:31, Mark Burton wrote:
>> 
>>> On 17 Dec 2014, at 11:28, Alexander Graf  wrote:
>>> 
>>> 
>>> 
>>> On 17.12.14 11:27, Frederic Konrad wrote:
>>>> On 16/12/2014 17:37, Peter Maydell wrote:
>>>>> On 16 December 2014 at 09:13,   wrote:
>>>>>> From: KONRAD Frederic 
>>>>>> 
>>>>>> This adds a lock to avoid multiple exclusive access at the same time
>>>>>> in case of
>>>>>> TCG multithread.
>>>> Hi Peter,
>>>> 
>>>>> This feels to me like it's not really possible to review on
>>>>> its own, since you can't see how it fits into the design of
>>>>> the rest of the multithreading support.
>>>> true the only thing we observe is that it didn't change anything right now.
>>>> 
>>>>> The other approach here rather than having a pile of mutexes
>>>>> in the target-* code would be to have TCG IR support for
>>>>> "begin critical section"/"end critical section". Then you
>>>>> could have the main loop ensure that no other CPU is running
>>>>> at the same time as the critical-section code. (linux-user
>>>>> already has an ad-hoc implementation of this for the
>>>>> exclusives.)
>>>>> 
>>>>> -- PMM
>>>>> 
>>>> What do you mean by TCG IR?
>>> 
>>> TCP ops. The nice thing is that TCG could translate those into
>>> transactions if the host supports them as well.
>>> 
>> 
>> Hows that different in reality from what we have now?
>> Cheers
>> Mark.
> 
> The current code can't optimize things in TCG. There's a good chance
> your TCG host implementation can have an optimization pass that creates
> host cmpxchg instructions or maybe even transaction blocks out of the
> critical sections.
> 
> 


Ok - I get it - I see the value - so long as it’s possible to do. It would 
solve a lot of problems...

We were not (yet) trying to fix that, we were simply asking the question, if we 
add these mutex’s - do we have any detrimental impact on anything.
Seems like the answer is that adding the mutex’s is fine - it doesn’t seem to 
have a performance impact or anything. Good.

But - I see what you mean - if we implemented this as an op, then it would be 
much simpler to optimise/fix properly afterwards - and - that “fix” might not 
even need to deal with the whole memory chain issue - maybe….. 

Cheers

Mark.









Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-17 Thread Mark Burton
Actually - I dont see any other option.
Playing with the ideas - it seems to me that if we were to implement ‘generic’ 
Lock/unlock instructions which could then somehow we ‘combined’ with 
loads/stores then we would be relying on an optimisation step to ‘notice’ that 
this could be combined into e.g. a store EX on ARM, or whatever. That strikes 
me as risky .

But then - if we add load/store exclusive type operations - thats great for 
e.g. ARM on X86, but does it really cover other architectures well?

I am worried that if we go this path, we will soon end up with a lot of 
architecturally specific TCG ops….

Cheers

Mark.

> On 17 Dec 2014, at 12:25, Paolo Bonzini  wrote:
> 
> 
> 
> On 17/12/2014 12:18, Alexander Graf wrote:
>> 
>> So I think the best way to go forward would be to add transaction_start
>> and transaction_end opcodes to TCG and implement them as mutex locks
>> today. When you get the chance to get yourself a machine that supports
>> actual TM, try to replace them with transaction start/end blocks and
>> have the normal mutex code as fallback if the transaction fails.
> 
> Or implement load_locked/store_conditional TCG ops.  They can be
> implemented as transactions, hardware ll/sc, or something slow that uses
> the MMU.
> 
> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-17 Thread Mark Burton
Sorry - I should have replied to this Peter
I agree with you - I dont know how much overlap we’ll find with different 
architectures.
But if we stick to the more generic ‘lock/unlock’, I dont see how this is going 
to help us output thread safe code without going thought a mutex - at which 
point we are back to square 1.

Cheers
Mark.

> On 17 Dec 2014, at 12:36, Peter Maydell  wrote:
> 
> On 17 December 2014 at 11:25, Paolo Bonzini  wrote:
>> 
>> 
>> On 17/12/2014 12:18, Alexander Graf wrote:
>>> 
>>> So I think the best way to go forward would be to add transaction_start
>>> and transaction_end opcodes to TCG and implement them as mutex locks
>>> today. When you get the chance to get yourself a machine that supports
>>> actual TM, try to replace them with transaction start/end blocks and
>>> have the normal mutex code as fallback if the transaction fails.
>> 
>> Or implement load_locked/store_conditional TCG ops.  They can be
>> implemented as transactions, hardware ll/sc, or something slow that uses
>> the MMU.
> 
> You'd need to compare the semantics of ll/sc across the various
> architectures to find out if there was anything you could
> actually meaningfully define as useful TCG op semantics there.
> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-17 Thread Mark Burton

> On 17 Dec 2014, at 17:27, Peter Maydell  wrote:
> 
> On 17 December 2014 at 16:17, Mark Burton  wrote:
>> Sorry - I should have replied to this Peter
>> I agree with you - I dont know how much overlap we’ll find with different 
>> architectures.
>> But if we stick to the more generic ‘lock/unlock’, I dont see how this is 
>> going to help us output thread safe code without going thought a mutex - at 
>> which point we are back to square 1.
> 
> I think a mutex is fine, personally -- I just don't want
> to see fifteen hand-hacked mutexes in the target-* code.
> 

Which would seem to favour the helper function approach?
Or am I missing something?
If we can’t arrange for the target code to optimise the mutex away and use host 
native instructions, then I dont really see the benefit of complicating the IR 
and the target code?

Cheers

Mark.

> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-18 Thread Mark Burton

> On 17 Dec 2014, at 17:39, Peter Maydell  wrote:
> 
> On 17 December 2014 at 16:29, Mark Burton  wrote:
>>> On 17 Dec 2014, at 17:27, Peter Maydell  wrote:
>>> I think a mutex is fine, personally -- I just don't want
>>> to see fifteen hand-hacked mutexes in the target-* code.
>>> 
>> 
>> Which would seem to favour the helper function approach?
>> Or am I missing something?
> 
> You need at least some support from QEMU core -- consider
> what happens with this patch if the ldrex takes a data
> abort, for instance.
> 
> And if you need the "stop all other CPUs while I do this”

It looks like a corner case, but working this through - the ’simple’ put a 
mutex around the atomic instructions approach would indeed need to ensure that 
no other core was doing anything - that just happens to be true for qemu today 
(or - we would have to put a mutex around all writes); in order to ensure the 
case where a store exclusive could potential fail if a non-atomic instruction 
wrote (a different value) to the same address. This is currently guarantee by 
the implementation in Qemu - how useful it is I dont know, but if we break it, 
we run the risk that something will fail (at the least, we could not claim to 
have kept things the same).

This also has implications for the idea of adding TCG ops I think...
The ideal scenario is that we could ‘fallback’ on the same semantics that are 
there today - allowing specific target/host combinations to be optimised (and 
to improve their functionality). 
But that means, from within the TCG Op, we would need to have a mechanism, to 
cause other TCG’s to take an exit…. etc etc… In the end, I’m sure it’s 
possible, but it feels so awkward.

To re-cap where we are (for my own benefit if nobody else):
We have several propositions in terms of implementing Atomic instructions

1/ We wrap the atomic instructions in a mutex using helper functions (this is 
the approach others have taken, it’s simple, but it is not clean, as stated 
above).

1.5/ We add a mechanism to ensure that when the mutex is taken, all other cores 
are ‘stopped’.

2/ We add some TCG ops to effectively do the same thing, but this would give us 
the benefit of being able to provide better implementations. This is 
attractive, but we would end up needing ops to cover at least exclusive 
load/store and atomic compare exchange. To me this looks less than elegant 
(being pulled close to the target, rather than being able to generalise), but 
it’s not clear how we would implement the operations as we would like, with a 
machine instruction, unless we did split them out along these lines. This 
approach also (probably) requires the 1.5 mechanism above.

3/ We have discussed a ‘h/w’ approach to the problem. In this case, all atomic 
instructions are forced to take the slow path - and a additional flags are 
added to the memory API. We then deal with the issue closer to the memory where 
we can record who has a lock on a memory address. For this to work - we would 
also either
a) need to add a mprotect type approach to ensure no ‘non atomic’ writes occur 
- or
b) need to force all cores to mark the page with the exclusive memory as IO or 
similar to ensure that all write accesses followed the slow path.

4/ There is an option to implement exclusive operations within the TCG using 
mprotect (and signal handlers). I have some concerns on this : would we need 
have to have support for each host O/S…. I also think we might end up the a lot 
of protected regions causing a lot of SIGSEGV’s because an errant guest doesn’t 
behave well - basically we will need to see the impact on performance - finally 
- this will be really painful to deal with for cases where the exclusive memory 
is held in what Qemu considers IO space !!!
In other words - putting the mprotect inside TCG looks to me like it’s 
mutually exclusive to supporting a memory-based scheme like (3).


My personal preference is for 3b) it  is “safe” - its where the hardware is.
3a is an optimization of that.
to me, (2) is an optimisation again. We are effectively saying, if you are able 
to do this directly, then you dont need to pass via the slow path. Otherwise, 
you always have the option of reverting to the slow path.

Frankly - 1 and 1.5 are hacks - they are not optimisations, they are just dirty 
hacks. However - their saving grace is that they are hacks that exist and 
“work”. I dislike patching the hack, but it did seem to offer the fastest 
solution to get around this problem - at least for now. I am no longer 
convinced.

4/ is something I’d like other peoples views on too… Is it a better approach? 
What about the slow path?

I increasingly begin to feel that we should really approach this from the other 
end, and provide a ‘correct’ solution using the memory - then worry about 
making that faster…

Cheers

Mark.








> semantics linux-user currently uses t

Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-18 Thread Mark Burton

> On 18 Dec 2014, at 13:24, Alexander Graf  wrote:
> 
> 
> 
> On 18.12.14 10:12, Mark Burton wrote:
>> 
>>> On 17 Dec 2014, at 17:39, Peter Maydell  wrote:
>>> 
>>> On 17 December 2014 at 16:29, Mark Burton  wrote:
>>>>> On 17 Dec 2014, at 17:27, Peter Maydell  wrote:
>>>>> I think a mutex is fine, personally -- I just don't want
>>>>> to see fifteen hand-hacked mutexes in the target-* code.
>>>>> 
>>>> 
>>>> Which would seem to favour the helper function approach?
>>>> Or am I missing something?
>>> 
>>> You need at least some support from QEMU core -- consider
>>> what happens with this patch if the ldrex takes a data
>>> abort, for instance.
>>> 
>>> And if you need the "stop all other CPUs while I do this”
>> 
>> It looks like a corner case, but working this through - the ’simple’ put a 
>> mutex around the atomic instructions approach would indeed need to ensure 
>> that no other core was doing anything - that just happens to be true for 
>> qemu today (or - we would have to put a mutex around all writes); in order 
>> to ensure the case where a store exclusive could potential fail if a 
>> non-atomic instruction wrote (a different value) to the same address. This 
>> is currently guarantee by the implementation in Qemu - how useful it is I 
>> dont know, but if we break it, we run the risk that something will fail (at 
>> the least, we could not claim to have kept things the same).
>> 
>> This also has implications for the idea of adding TCG ops I think...
>> The ideal scenario is that we could ‘fallback’ on the same semantics that 
>> are there today - allowing specific target/host combinations to be optimised 
>> (and to improve their functionality). 
>> But that means, from within the TCG Op, we would need to have a mechanism, 
>> to cause other TCG’s to take an exit…. etc etc… In the end, I’m sure it’s 
>> possible, but it feels so awkward.
> 
> That's the nice thing about transactions - they guarantee that no other
> CPU accesses the same cache line at the same time. So you're safe
> against other vcpus even without blocking them manually.
> 
> For the non-transactional implementation we probably would need an "IPI
> others and halt them until we're done with the critical section"
> approach. But I really wouldn't concentrate on making things fast on old
> CPUs.
> 
> Also keep in mind that for the UP case we can always omit all the magic
> - we only need to detect when we move into an SMP case (linux-user clone
> or -smp on system).
> 
>> 
>> To re-cap where we are (for my own benefit if nobody else):
>> We have several propositions in terms of implementing Atomic instructions
>> 
>> 1/ We wrap the atomic instructions in a mutex using helper functions (this 
>> is the approach others have taken, it’s simple, but it is not clean, as 
>> stated above).
> 
> This is horrible. Imagine you have this split approach with a load
> exclusive and then store whereas the load starts mutex usage and the
> store stop is. At that point if the store creates a segfault you'll be
> left with a dangling mutex.

Sorry - probably a misunderstanding here - the plan for (1) was to simply wrap 
the individual instructions - where today they set/check the address/value pair 
used to determine if a store exclusive can succeed. Adding the mutex gives you 
protection against other atomic operations on the same memory but not against 
other non-atomic instructions - which is a bad thing.

Cheers
Mark.

> 
> This stuff really belongs into the TCG core.
> 
>> 
>> 1.5/ We add a mechanism to ensure that when the mutex is taken, all other 
>> cores are ‘stopped’.
>> 
>> 2/ We add some TCG ops to effectively do the same thing, but this would give 
>> us the benefit of being able to provide better implementations. This is 
>> attractive, but we would end up needing ops to cover at least exclusive 
>> load/store and atomic compare exchange. To me this looks less than elegant 
>> (being pulled close to the target, rather than being able to generalise), 
>> but it’s not clear how we would implement the operations as we would like, 
>> with a machine instruction, unless we did split them out along these lines. 
>> This approach also (probably) requires the 1.5 mechanism above.
> 
> I'm still in favor of just forcing the semantics of transactions onto
> this. If the host doesn't implement transactions, tough luck - do the
> "halt all others" IPI.
> 
>> 
>> 3/ 

Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-18 Thread Mark Burton


> On 18/12/2014 13:24, Alexander Graf wrote:
>> That's the nice thing about transactions - they guarantee that no other
>> CPU accesses the same cache line at the same time. So you're safe
>> against other vcpus even without blocking them manually.
>> 
>> For the non-transactional implementation we probably would need an "IPI
>> others and halt them until we're done with the critical section"
>> approach. But I really wouldn't concentrate on making things fast on old
>> CPUs.
> 
> The non-transactional implementation can use softmmu to trap access to
> the page from other VCPUs.  This makes it possible to implement (at the
> cost of speed) the same semantics on all hosts.
> 
> Paolo

I believe what your describing, using transactional memory, or using softmmu 
amounts to either option 3 below or option 4.
Relying on it totally was option 4. 

Seems to me, the problem with that option is that support for some hosts will 
be a pain, and covering everything will take some time :-(

Option 3 suggests that we build a ‘slow path’ mechanism first - make sure that 
works (as a backup), and then add optimisations for specific hosts/guests 
afterwards. To me that still seems preferable?

Cheers

Mark.




> On 18 Dec 2014, at 13:24, Alexander Graf  wrote:
> 
> 
> 
> On 18.12.14 10:12, Mark Burton wrote:
>> 
>>> On 17 Dec 2014, at 17:39, Peter Maydell  wrote:
>>> 
>>> On 17 December 2014 at 16:29, Mark Burton  wrote:
>>>>> On 17 Dec 2014, at 17:27, Peter Maydell  wrote:
>>>>> I think a mutex is fine, personally -- I just don't want
>>>>> to see fifteen hand-hacked mutexes in the target-* code.
>>>>> 
>>>> 
>>>> Which would seem to favour the helper function approach?
>>>> Or am I missing something?
>>> 
>>> You need at least some support from QEMU core -- consider
>>> what happens with this patch if the ldrex takes a data
>>> abort, for instance.
>>> 
>>> And if you need the "stop all other CPUs while I do this”
>> 
>> It looks like a corner case, but working this through - the ’simple’ put a 
>> mutex around the atomic instructions approach would indeed need to ensure 
>> that no other core was doing anything - that just happens to be true for 
>> qemu today (or - we would have to put a mutex around all writes); in order 
>> to ensure the case where a store exclusive could potential fail if a 
>> non-atomic instruction wrote (a different value) to the same address. This 
>> is currently guarantee by the implementation in Qemu - how useful it is I 
>> dont know, but if we break it, we run the risk that something will fail (at 
>> the least, we could not claim to have kept things the same).
>> 
>> This also has implications for the idea of adding TCG ops I think...
>> The ideal scenario is that we could ‘fallback’ on the same semantics that 
>> are there today - allowing specific target/host combinations to be optimised 
>> (and to improve their functionality). 
>> But that means, from within the TCG Op, we would need to have a mechanism, 
>> to cause other TCG’s to take an exit…. etc etc… In the end, I’m sure it’s 
>> possible, but it feels so awkward.
> 
> That's the nice thing about transactions - they guarantee that no other
> CPU accesses the same cache line at the same time. So you're safe
> against other vcpus even without blocking them manually.
> 
> For the non-transactional implementation we probably would need an "IPI
> others and halt them until we're done with the critical section"
> approach. But I really wouldn't concentrate on making things fast on old
> CPUs.
> 
> Also keep in mind that for the UP case we can always omit all the magic
> - we only need to detect when we move into an SMP case (linux-user clone
> or -smp on system).
> 
>> 
>> To re-cap where we are (for my own benefit if nobody else):
>> We have several propositions in terms of implementing Atomic instructions
>> 
>> 1/ We wrap the atomic instructions in a mutex using helper functions (this 
>> is the approach others have taken, it’s simple, but it is not clean, as 
>> stated above).
> 
> This is horrible. Imagine you have this split approach with a load
> exclusive and then store whereas the load starts mutex usage and the
> store stop is. At that point if the store creates a segfault you'll be
> left with a dangling mutex.
> 
> This stuff really belongs into the TCG core.
> 
>> 
>> 1.5/ We add a mechanism to ensure that when the mutex is taken, all other 
>> cores

Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-18 Thread Mark Burton

> On 18 Dec 2014, at 15:44, Alexander Graf  wrote:
> 
> 
> 
> On 18.12.14 15:20, Mark Burton wrote:
>> 
>> 
>>> On 18/12/2014 13:24, Alexander Graf wrote:
>>>> That's the nice thing about transactions - they guarantee that no other
>>>> CPU accesses the same cache line at the same time. So you're safe
>>>> against other vcpus even without blocking them manually.
>>>> 
>>>> For the non-transactional implementation we probably would need an "IPI
>>>> others and halt them until we're done with the critical section"
>>>> approach. But I really wouldn't concentrate on making things fast on old
>>>> CPUs.
>>> 
>>> The non-transactional implementation can use softmmu to trap access to
>>> the page from other VCPUs.  This makes it possible to implement (at the
>>> cost of speed) the same semantics on all hosts.
>>> 
>>> Paolo
>> 
>> I believe what your describing, using transactional memory, or using softmmu 
>> amounts to either option 3 below or option 4.
>> Relying on it totally was option 4. 
>> 
>> Seems to me, the problem with that option is that support for some hosts 
>> will be a pain, and covering everything will take some time :-(
>> 
>> Option 3 suggests that we build a ‘slow path’ mechanism first - make sure 
>> that works (as a backup), and then add optimisations for specific 
>> hosts/guests afterwards. To me that still seems preferable?
> 
> Yes, the only thing I'm in favor for here is to align the semantics with
> what transactional memory would give you. That way moving to the fast
> implementation will be easy and people would actually want to use
> multi-threaded TCG ;)

In other words — the back-end (slow path) memory interface should look 
‘transactional’…?

Cheers

Mark.




> 
> 
> Alex


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] target-arm: protect cpu_exclusive_*.

2014-12-18 Thread Mark Burton
>> 
>> In other words — the back-end (slow path) memory interface should look 
>> ‘transactional’…?
> 
> Yeah, the semantics should be tied to what TM would give you. We can
> always be more safe than TM in our fallback implementation, but I
> wouldn't want to see semantic optimizations tied to the MMIO
> implementation put in.
> 
> This is mostly theory though, try to write the code and see where things
> fall apart, then we'll be in a much better position to rationalize on
> where to do things differently.
> 
> 

absolutely. 
Seems like this is a good way forward… thanks all for your input.

Cheers
Mark.




> Alex


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

> On 12 Feb 2015, at 16:01, Peter Maydell  wrote:
> 
> On 12 February 2015 at 14:45, Alexander Graf  wrote:
>> 
>>> On 12.02.2015, at 15:35, Mark Burton  wrote:
>>> We are proposing to implement this by signalling all other CPU’s
>>> to exit (and requesting they flush before re-starting). In other
>>> words, this would happen asynchronously.
>> 
>> For global flushes, give them a pointer payload along with the flush
>> request and tell all cpus to increment it atomically. In your main
>> thread, wait until *ptr == nKickedCpus.
> 
> I bet this will not be the only situation where you want to
> do an "get all other CPUs to do $something and wait til they
> have done so" kind of operation, so some lightweight but generic
> infrastructure for doing that would not be a bad plan. (Similarly
> "get all other CPUs to stop, then I can do $something and let
> the others continue”.)

We tried this - we ended up in knots.
We had 2 CPU’s trying to flush at about the same time, both waiting for the 
other.
We had CPU’s trying to get the global mutex to finish what they were doing, 
while being told to flush, 
We had CPU’s in the global mutex trying to do something that would cause a 
flush… etc
We had spaghetti with extra Bolognese sauce…

We eventually concluded, yes - in an infinite universe everything is possible, 
but if we could simply do this ‘asynchronously’ then our lives would be a LOT 
easier.
e.g.  - ask all CPU’s to “exit and do something” is easy -  wait for them to do 
that is a whole other problem…

Our question is - do we need this ‘sync’ (before the flush), or can we actually 
allow CPU’s to flush themselves asynchronously….

Cheers

Mark.



> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton
OK - Alex - your implication is that it has to be atomic, we need the sync…

:-(

I have a horrid feeling that the atomicity of global flush can’t be causing the 
(almost, but not quite reproducible) errors we’re seeing - but… anyway ;-)

Cheers

Mark.

> On 12 Feb 2015, at 15:45, Alexander Graf  wrote:
> 
> 
>> On 12.02.2015, at 15:35, Mark Burton  wrote:
>> 
>> 
>> TLB Flush:
>> 
>> We have spent a few days on this issue, and still haven’t resolved the best 
>> path.
>> 
>> Our solution seems to work, most of the time, but we still have some strange 
>> issues - so I want to check that what we are proposing has a chance of 
>> working.
>> 
>> 
>> Our plan is to allow all CPU’s to continue. Potentially one CPU will want to 
>> write to the TLBs. Subsequent to the write, it requests a TLB Flush.
> 
> Local or global? For local TLB flushes you don't notify the other CPUs at 
> all. For global ones, the semantics of the call usually dictate atomicity.
> 
>> We are proposing to implement this by signalling all other CPU’s to exit 
>> (and requesting they flush before re-starting). In other words, this would 
>> happen asynchronously.
> 
> For global flushes, give them a pointer payload along with the flush request 
> and tell all cpus to increment it atomically. In your main thread, wait until 
> *ptr == nKickedCpus.
> 
> FWIW TLBs are always CPU local. When there's a "global TLB flush" 
> instruction, it pretty much does stall the CPU, notifies the others to also 
> flush their TLBs, waits and then continues.
> 
> If this really does become a performance bottleneck (which I doubt it does, 
> almost nobody except x86 does global flushes), you can also do some nasty 
> hacky tricks, such as (atomically) change the valid bit in remote CPUs TLB 
> entries. But really only do this as a last resort if the clean version 
> doesn't perform well.
> 
> 
> Alex
> 
>> This means - there is a theoretical period of time when one CPU is writing 
>> to the TLBs while other CPU’s are executing.  Our belief is that this has to 
>> be handled by software anyway, and this should not be an issue from Qemu’s 
>> point of view. 
>> The alternative would be to force all other CPU’s to exit before writing the 
>> TLB’s - this is both expensive and very painful to organise (as we get into 
>> horrid deadlocks whichever way we turn)…
>> 
>> We’d appreciate some thoughts on this...
>> 
>> Cheers
>> 
>> Mark.
>> 
>> 
>> 
>>   +44 (0)20 7100 3485 x 210
>> +33 (0)5 33 52 01 77x 210
>> 
>>  +33 (0)603762104
>>  mark.burton
>> 
>> 
> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

> On 12 Feb 2015, at 16:38, Alexander Graf  wrote:
> 
> 
> 
> On 12.02.15 15:58, Peter Maydell wrote:
>> On 12 February 2015 at 14:45, Alexander Graf  wrote:
>>> almost nobody except x86 does global flushes
>> 
>> All ARM TLB maintenance operations have both "this CPU only"
>> and "all TLBs in the Inner Shareable domain" [that's ARM-speak
>> for "every CPU core in the cluster"] variants (the latter
>> being the TLB *IS operations). Looking at Linux's
>> arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
>> most of the operations defined there use the IS variants.
> 
> Wow, did anyone benchmark this? I know that PPC switched away from
> global flushes and instead tracks the CPUs a task was running on to
> limit the scope of CPUs that need to flush.

Doesn’t that mean you have to signal a specific CPU to cause it to flush 
itself…. Isn’t that in itself expensive? Do you have to organise some sort of 
atomicity yourself around that too?

Cheers

Mark.



> 
> 
> Alex


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

> On 12 Feb 2015, at 16:31, Dr. David Alan Gilbert  wrote:
> 
> * Mark Burton (mark.bur...@greensocs.com) wrote:
>> 
>>> On 12 Feb 2015, at 16:01, Peter Maydell  wrote:
>>> 
>>> On 12 February 2015 at 14:45, Alexander Graf  wrote:
>>>> 
>>>>> On 12.02.2015, at 15:35, Mark Burton  wrote:
>>>>> We are proposing to implement this by signalling all other CPU???s
>>>>> to exit (and requesting they flush before re-starting). In other
>>>>> words, this would happen asynchronously.
>>>> 
>>>> For global flushes, give them a pointer payload along with the flush
>>>> request and tell all cpus to increment it atomically. In your main
>>>> thread, wait until *ptr == nKickedCpus.
>>> 
>>> I bet this will not be the only situation where you want to
>>> do an "get all other CPUs to do $something and wait til they
>>> have done so" kind of operation, so some lightweight but generic
>>> infrastructure for doing that would not be a bad plan. (Similarly
>>> "get all other CPUs to stop, then I can do $something and let
>>> the others continue???.)
>> 
>> We tried this - we ended up in knots.
>> We had 2 CPU???s trying to flush at about the same time, both waiting for 
>> the other.
>> We had CPU???s trying to get the global mutex to finish what they were 
>> doing, while being told to flush, 
>> We had CPU???s in the global mutex trying to do something that would cause a 
>> flush??? etc
>> We had spaghetti with extra Bolognese sauce???
> 
> This is the hard problem of multithreaded emulation.
> You've always got to let CPUs get back to a point where you can
> invalidate a mapping/page quickly.
> 
> Thus you've also got to be very careful about where any CPU might
> get into a loop or take another lock that would stop another CPU
> causing an invalidate.  Either that or you need a way of somehow
> breaking locks or recovering from the situation.

Indeed - 
for now - we’re building something which will likely be less than ideal. Once 
we have some sort of evidence that it works, and (hopefully) more reliably than 
the approach we have right now, then we come up with a more elegant scheme.


> 
>> We eventually concluded, yes - in an infinite universe everything is 
>> possible, but if we could simply do this ???asynchronously??? then our lives 
>> would be a LOT easier.
>> e.g.  - ask all CPU???s to ???exit and do something??? is easy -  wait for 
>> them to do that is a whole other problem???
> 
> Which is why you've got to bound how long it might take
> those CPUs to get back to you, and optimise out cases where
> it's not really needed later.
> 
>> Our question is - do we need this ???sync??? (before the flush), or can we 
>> actually allow CPU???s to flush themselves asynchronously???.
> 
> Always assume the worst.

:-)

Cheers
Mark.

> 
> Dave
> 
>> 
>> Cheers
>> 
>> Mark.
>> 
>> 
>> 
>>> 
>>> -- PMM
>> 
>> 
>>   +44 (0)20 7100 3485 x 210
>> +33 (0)5 33 52 01 77x 210
>> 
>>  +33 (0)603762104
>>  mark.burton
>> 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton
Up top - thanks Peter, I think you may give us an idea !

> On 12 Feb 2015, at 23:10, Lluís Vilanova  wrote:
> 
> Mark Burton writes:
> 
>>> On 12 Feb 2015, at 16:38, Alexander Graf  wrote:
>>> 
>>> 
>>> 
>>> On 12.02.15 15:58, Peter Maydell wrote:
>>>> On 12 February 2015 at 14:45, Alexander Graf  wrote:
>>>>> almost nobody except x86 does global flushes
>>>> 
>>>> All ARM TLB maintenance operations have both "this CPU only"
>>>> and "all TLBs in the Inner Shareable domain" [that's ARM-speak
>>>> for "every CPU core in the cluster"] variants (the latter
>>>> being the TLB *IS operations). Looking at Linux's
>>>> arch/arm64/mm/tlb.S and arch/arm64/include/asm/tlbflush.h
>>>> most of the operations defined there use the IS variants.
>>> 
>>> Wow, did anyone benchmark this? I know that PPC switched away from
>>> global flushes and instead tracks the CPUs a task was running on to
>>> limit the scope of CPUs that need to flush.
> 
>> Doesn’t that mean you have to signal a specific CPU to cause it to flush 
>> itself…. Isn’t that in itself expensive? Do you have to organise some sort 
>> of atomicity yourself around that too?
> 
> Yup. AFAIR, Linux in x86-64 queues a request to a per-CPU request list, and 
> uses
> IPIs to signal these types of operations to the target CPU:
> 
>  http://lxr.free-electrons.com/source/kernel/smp.c?v=2.6.32#L386
> 
> Waiting for completion is implemented on top by incrementing some counter from
> each CPU, and waiting for it to have the correct final value.

If the kernel is doing this - then effectively - for X86, each CPU only flush’s 
it’s own TLB (from the perspective of Qemu) - correct?
(in which case, for Qemu itself - for x86) - we dont need to implement a global 
flush, and hence we dont need to build the mechanism to sync ?

If I understand correctly then - the processor that causes some pain is the ARM 
that has (and uses) global flush, but the mitigating factors is that those 
flushes can by asyncronous so long as they complete before a memory barrier….

Cheers

Mark.


> 
> If something were implemented on these lines, it could be used as a generic
> cross-CPU event messaging infrastructure (plus some interrupt bit in the CPU
> structure that TCG would check to break away from guest code; I believe
> something similar is already being used - icount? -).
> 
> PS: To be honest, I still don't know which TLBs we're talking about here, and
>which cases trigger these TLB flush operations.
> 
> 
> Cheers,
>  Lluis
> 
> -- 
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-12 Thread Mark Burton

> On 13 Feb 2015, at 08:24, Peter Maydell  wrote:
> 
> On 13 February 2015 at 07:16, Mark Burton  wrote:
>> If the kernel is doing this - then effectively - for X86, each CPU only
>> flush’s it’s own TLB (from the perspective of Qemu) - correct?
>> (in which case, for Qemu itself - for x86) - we dont need to implement
>> a global flush, and hence we dont need to build the mechanism to sync ?

> The semantics you need are "flush the QEMU TLB for CPU X" (where
> X may not be the CPU you're running on). This is what tlb_flush()
> does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
> We then use that to implement the target's required semantics
> (eg in ARM the tlbiall_is_write() function is handled by iterating
> through all CPUs and calling tlb_flush on them).

What Lluis implied seemed to be that the kernel arranged to signal the CPU that 
would flush. Hence, (for X86), we would only ever flush our own TLB.

> 
> If you don't want the pain of checking the semantics of every
> backend and figuring out a new set of primitives to implement,
> then what you need to do is continue to provide the guarantees
> the current tlb_flush function does: when it returns then the
> CPU it's supposed to have acted on has definitely done so.
> 
> You can try and be cleverer if you want to, but personally
> I would recommend keeping the scope of your work simple
> where you can.

yes - though keeping it simple (silly) seems to have some complexities in this 
case, which is why we are trying to reduce the guarantees that tlm_flush() 
provides. 

At present - the ‘foreach cpu, tlb_flush()’ is effectively atomic, as no other 
CPU will be executing at the same time.
Adding multi-thread, we can already say - this ‘atomicity’ isn’t strictly 
required. As you say, the only thing tlb_flush needs to guarantee is that the 
CPU concerned has flushed. 
- that already helps. And I agree with you is the right place to take 
tlb_flush().

Of course, when only the current CPU is flushed things are much simpler (and 
already handled)...


For our immediate concern, in the interests of getting the thing working and 
making sure we’ve turned over all the stones, on ARM - it MAY help us to check 
that the flush has happened ‘in the next memory barrier’….
- I dont know if that will help us or not, and - even if it does, I 
agree with you, it would be more messy than it need be.
However, in the interests of making sure that there are no other issues - we 
may ‘hack’ something before we put in place a more elegant solution…. 
(right now, we have some mutex issues, shifting the sync to the barrier MAY 
help us avoid that…. To Be Seen…. and anyway - it would only be a temporary 
fix).

Cheers

Mark.



> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-13 Thread Mark Burton
the memory barrier is on the cpu requesting the flush isn’t it (not on the CPU 
that is being flushed)?
Cheers
Mark.

> On 13 Feb 2015, at 10:34, Paolo Bonzini  wrote:
> 
> 
> 
> On 12/02/2015 22:57, Peter Maydell wrote:
>> The only
>> requirement is that if the CPU that did the TLB maintenance
>> op executes a DMB (barrier) then the TLB op must finish
>> before the barrier completes execution. So you could split
>> the "kick off TLB invalidate" and "make sure all CPUs
>> are done" phases if you wanted. [cf v8 ARM ARM rev A.e
>> section D4.7.2 and in particular the subsection on
>> "ordering and completion".]
> 
> You can just make DMB start a new translation block.  Then when the TLB
> flush helpers call cpu_exit() or cpu_interrupt() the flush request is
> serviced.
> 
> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] Help on TLB Flush

2015-02-13 Thread Mark Burton
Agreed
Cheers
Mark.

> On 13 Feb 2015, at 14:30, Lluís Vilanova  wrote:
> 
> Mark Burton writes:
> 
>>> On 13 Feb 2015, at 08:24, Peter Maydell  wrote:
>>> 
>>> On 13 February 2015 at 07:16, Mark Burton  wrote:
>>>> If the kernel is doing this - then effectively - for X86, each CPU only
>>>> flush’s it’s own TLB (from the perspective of Qemu) - correct?
>>>> (in which case, for Qemu itself - for x86) - we dont need to implement
>>>> a global flush, and hence we dont need to build the mechanism to sync ?
> 
>>> The semantics you need are "flush the QEMU TLB for CPU X" (where
>>> X may not be the CPU you're running on). This is what tlb_flush()
>>> does: it takes a CPU argument to act on. (Ditto tlb_flush_page, etc.)
>>> We then use that to implement the target's required semantics
>>> (eg in ARM the tlbiall_is_write() function is handled by iterating
>>> through all CPUs and calling tlb_flush on them).
> 
>> What Lluis implied seemed to be that the kernel arranged to signal the CPU 
>> that would flush. Hence, (for X86), we would only ever flush our own TLB.
> 
> That's correct.
> 
> [...]
>> For our immediate concern, in the interests of getting the thing working and
>> making sure we’ve turned over all the stones, on ARM - it MAY help us to 
>> check
>> that the flush has happened ‘in the next memory barrier’….
>>  - I dont know if that will help us or not, and - even if it does, I 
>> agree with you, it would be more messy than it need be.
>> However, in the interests of making sure that there are no other issues - we 
>> may ‘hack’ something before we put in place a more elegant solution…. 
>> (right now, we have some mutex issues, shifting the sync to the barrier MAY 
>> help us avoid that…. To Be Seen…. and anyway - it would only be a temporary 
>> fix).
> 
> But you shouldn't assume that everyone either uses x86's semantics (aka, each
> CPU gets an IPI), or the ARM semantics you described where the global TLB 
> flush
> instruction has asynchronous effects. First, in ARM you still have to ensure
> other CPUs did what you asked them to (whenever the arch manual says you must 
> do
> so). Second, it seems like ARM does not always behave in the way you 
> described:
> 
>  http://lxr.free-electrons.com/source/arch/arm/kernel/smp.c?v=2.6.32#L630
> 
> Granted, this is just the same behaviour as x86, but noone guarantees you that
> some other operation in any of the multiple architectures supported by QEMU 
> will
> never need a synchronous instruction with global effects.
> 
> I understand the pressure of getting something running and work from that, 
> but I
> think that having a framework for asynchronous cross-CPU messaging would be
> rather useful in the future. That can be then complemented with a mechanism to
> wait for these asynchronous messages. You can achieve any desired behaviour by
> composing these two.
> 
> 
> Cheers,
>  Lluis
> 
> -- 
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 01/10] target-arm: protect cpu_exclusive_*.

2015-02-26 Thread Mark Burton

> On 26 Feb 2015, at 23:56, Peter Maydell  wrote:
> 
> On 27 February 2015 at 03:09, Frederic Konrad  
> wrote:
>> On 29/01/2015 16:17, Peter Maydell wrote:
>>> 
>>> On 16 January 2015 at 17:19,   wrote:
 
 From: KONRAD Frederic 
 
 This adds a lock to avoid multiple exclusive access at the same time in
 case of
 TCG multithread.
> 
>>> All the same comments I had on this patch earlier still apply:
>>> 
>>>  * I think adding mutex handling code to all the target-*
>>>frontends rather than providing facilities in common
>>>code for them to use is the wrong approach
>>>  * You will fail to unlock the mutex if the ldrex or strex
>>>takes a data abort
>>>  * This is making no attempt to learn from or unify with
>>>the existing attempts at handling exclusives in linux-user.
>>>When we've done this work we should have a single
>>>mechanism for handling exclusives in a multithreaded
>>>host environment which is used by both softmmu and useronly
>>>configs
> 
>> We decided to implement the whole atomic instruction inside an helper
> 
> ...which is a different approach which still isn't really
> addressing any of my remarks in the list above…

We agree on the above point. For atomic instructions, I think we discussed at 
length what to do. However we chose to ‘ignore’ the problem for now and to 
‘hack’ something just to get it working initially. Basically at this stage we 
want something mostly working so we can work on individual bits of the code and 
address them in more careful detail. Our issue with Atomic-ness is that the 
‘hack’ we put in place seems to allow a race condition, and we can’t see why :-(
But - overall, the plan is absolutely to provide a better implementation….

> 
>> but is
>> that
>> possible to get the data with eg: cpu_physical_memory_rw instead of the
>> normal
>> generated code?
> 
> cpu_physical_memory_rw would bypass the TLB and so be much slower.
> Make sure you use the functions which go via the TLB if you do
> this in a helper (and remember that they will longjmp out on a
> tlb miss!)

At this point speed isn’t our main concern - it’s simplicity of implementation 
- we want it to work, then we can worry about a better implementation (which 
likely should not go this path at all - as discussed above).
Given that - isn’t it reasonable to pass through cpu_physical_memory_rw - and 
hence not have to worry about the long jump ? Or am I missing something?

> 
>> One other thing which looks suspicious it seems there is one pair of
>> exclusive_addr/exclusive_val per CPU is that normal?
> 
> Pretty sure we've already discussed how the current ldrex/strex
> implementation is not architecturally correct. I think this is
> another of those areas.

We have indeed discussed this - but this is a surprise. What  we’ve found is 
that the ‘globals’ (which, when we discussed it, we assumed were indeed global) 
seem not to be global at all. 
if 2 CPU’s both wanted to write the same strex, and both wrote the address and 
current values to these variables, right now I believe it would be 
theoretically possible that both CPU’s would end up successfully completing the 
strex. If the TB exited on the second branch, I suspect you could have a race 
condition which would lead to a failure? However I guess this is unlikely.


> 
> In general I'd be much happier seeing a proper sketch of your
> design, what data structures etc you intend to share between
> CPUs and which are per-CPU, what generic mechanisms you plan
> to provide to allow targets to implement atomic instructions, etc.
> It's quite hard to see the whole picture at the moment.


Agreed - at this point - as I say - we are trying to get something to be ‘just' 
stable enough that we can then use it to move forward from.
But I totally agree - we should be clearer about the overall picture, (once we 
can see the wood for the trees)

Cheers

Mark.


> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 01/10] target-arm: protect cpu_exclusive_*.

2015-03-03 Thread Mark Burton
Hi Peter, thanks for the feedback

So - what we have discovered - using the slow path as discussed has a 
significant performance hit (no surprise), likewise the user-mode exit 
mechanism. However we are investigating that in parallel...

However, for now, we have left the TCG doing the majority of the work. 
(This was supposed to be the ‘quick an easy’ approach - we have already 
discussed better approaches to atomic instructions at length - but they are 
even more invasive!)

we have tried several schemes, and not found anything totally satisfactory:  we 
have a gremlin. There seems to be a race condition somewhere effectively means 
that the guest ‘ticket’ mechanism inside the guest kernel goes off by one, and 
therefore the guest kernel stalls as it never gets the ticket it’s looking 
for…. (it’s off by one in the ‘one too few’ sense, both CPU’s end up looking 
for tickets that are higher than the current finished ticket)...

The mechanism to hand out the tickets is of course a LDREX/STREX, leading us to 
believe that is the cause of our problems, however I am increasingly sceptical.


Our scheme is actually quite simple now:
We keep the same overall scheme as exits upstream today - we store the address 
and value.
We provide a lock which is used for LDREX, STREX, CLREX, and in the 
raise_exception code.
For LDREX we check that no other CPU is tagging the address. Because of the 
lock, we can be sure no STREX is executing so we are always safe to ‘steal’ a 
tag from another CPU (which is what the ARM ARM defines).
STREX and CLREX are similar to upstream...
We are also careful to invalidate the address tag if we take an exception.

The drawback of this scheme is of course that we are not totally protected 
against non exclusive writes, which could potentially fall between our reading 
a value and checking it against our saved value. But I am not convinced that is 
our problem (I have checked to make sure we dont get non exclusive writes).

This scheme would seem to be elegant and simple - but it suffers from the one 
small drawback of still having the issue as above…


Cheers

Mark.

ps. on our bug - we believe somehow the STREX is being marked as failed, but 
actually succeeds to write something.  There are only 3 ways the strex can fail:
1/ the address doesn't match - in which case no ST will be attempted
2/ the value doesn't match - which means - no ST attempted
3/ the store starts, but causes a TLB fill/exception…

The 3rd has 2 possibilities - the TLB is filled, and the store goes ahead 
totally normally - there should be no ‘fail’ - or an exception is generated in 
which case we will long jump away and never return. 

Am I missing something?



>> 
>> 


> On 2 Mar 2015, at 13:27, Peter Maydell  wrote:
> 
> On 27 February 2015 at 16:54, Mark Burton  wrote:
>> 
>>> On 26 Feb 2015, at 23:56, Peter Maydell  wrote:
>>> cpu_physical_memory_rw would bypass the TLB and so be much slower.
>>> Make sure you use the functions which go via the TLB if you do
>>> this in a helper (and remember that they will longjmp out on a
>>> tlb miss!)
>> 
>> At this point speed isn’t our main concern - it’s simplicity of 
>> implementation - we want it to work, then we can worry about a better 
>> implementation (which likely should not go this path at all - as discussed 
>> above).
>> Given that - isn’t it reasonable to pass through cpu_physical_memory_rw - 
>> and hence not have to worry about the long jump ? Or am I missing something?
> 
> If you use cpu_physical_memory_rw you need to do the
> virt-to-phys translation by hand (preferably via the TLB).
> That might be something you needed to do anyway if we want
> to have architecturally correct monitors that work on
> physaddrs rather than vaddrs, but if not then the two
> step process is a bit awkward.
> 
>>> Pretty sure we've already discussed how the current ldrex/strex
>>> implementation is not architecturally correct. I think this is
>>> another of those areas.
>> 
>> We have indeed discussed this - but this is a surprise.
> 
> You're right that I didn't specifically realise this exact
> part of our incorrectness earlier.
> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 01/10] target-arm: protect cpu_exclusive_*.

2015-03-03 Thread Mark Burton

> On 3 Mar 2015, at 16:32, Paolo Bonzini  wrote:
> 
> 
> 
> On 03/03/2015 16:29, Mark Burton wrote:
>> 
>> ps. on our bug - we believe somehow the STREX is being marked as
>> failed, but actually succeeds to write something.  There are only 3
>> ways the strex can fail: 1/ the address doesn't match - in which case
>> no ST will be attempted 2/ the value doesn't match - which means - no
>> ST attempted 3/ the store starts, but causes a TLB fill/exception…
>> 
>> The 3rd has 2 possibilities - the TLB is filled, and the store goes
>> ahead totally normally - there should be no ‘fail’ - or an exception
>> is generated in which case we will long jump away and never return.
> 
> When do you release the lock?
> 
(Thanks Paolo!)

We release the lock in either
a) the end of the strex
or
b) in the ‘raise_exception’

Cheers

Mark.

> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 01/10] target-arm: protect cpu_exclusive_*.

2015-03-03 Thread Mark Burton

we’ll try and clean a patch up to show just this…….
THANKS!


Cheers
Mark.

> On 3 Mar 2015, at 16:34, Paolo Bonzini  wrote:
> 
> 
> 
> On 03/03/2015 16:33, Mark Burton wrote:
>> 
>>> On 3 Mar 2015, at 16:32, Paolo Bonzini  wrote:
>>> 
>>> 
>>> 
>>> On 03/03/2015 16:29, Mark Burton wrote:
>>>> 
>>>> ps. on our bug - we believe somehow the STREX is being marked as
>>>> failed, but actually succeeds to write something.  There are only 3
>>>> ways the strex can fail: 1/ the address doesn't match - in which case
>>>> no ST will be attempted 2/ the value doesn't match - which means - no
>>>> ST attempted 3/ the store starts, but causes a TLB fill/exception…
>>>> 
>>>> The 3rd has 2 possibilities - the TLB is filled, and the store goes
>>>> ahead totally normally - there should be no ‘fail’ - or an exception
>>>> is generated in which case we will long jump away and never return.
>>> 
>>> When do you release the lock?
>>> 
>> (Thanks Paolo!)
>> 
>> We release the lock in either
>> a) the end of the strex
>> or
>> b) in the ‘raise_exception’
> 
> That seems right... Can you post the patch?
> 
> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC] Adding multithreads to LDREX/STREX.

2015-03-03 Thread Mark Burton

> On 3 Mar 2015, at 18:09, Paolo Bonzini  wrote:
> 
> 
> 
> On 03/03/2015 17:47, Mark Burton wrote:
>> +inline void arm_exclusive_lock(void)
>> +{
>> +if (!cpu_have_exclusive_lock) {
>> +qemu_mutex_lock(&cpu_exclusive_lock);
>> +cpu_have_exclusive_lock = true;
>> +}
>> +}
>> +
>> +inline void arm_exclusive_unlock(void)
>> +{
>> +if (cpu_have_exclusive_lock) {
>> +cpu_have_exclusive_lock = false;
>> +qemu_mutex_unlock(&cpu_exclusive_lock);
>> +}
>> +}
>> +
> 
> This is almost but not quite a recursive mutex.  What about changing it
> like this:
> 
> - arm_exclusive_lock just takes the lock and sets the flag; no "if"
> 
> - arm_exclusive_unlock does the opposite, again no "if"
> 
> - raise_exception checks the flag and skips "arm_exclusive_lock()" if
> already set
> 

yes - thats better - though I would rather live without the if all-together. 
There are only a couple of places where they are ‘needed’ and I should have 
checked explicitly there — call it laziness ;-)

> The only other question I have is this:
> 
>> +gen_helper_atomic_check(success, cpu_env, addr);
>> +tcg_gen_brcondi_i32(TCG_COND_EQ, success, 0, done_label);
> 
> Are you setting cpu_R[rd] correctly in this case?  That is, should you
> be jumping to fail_label instead?  That could case a failure to be
> reported as a success.


I almost jumped on plain and came and kissed you —— almost…. 
but - actually we set R[rd] to 1 above the branch and only reset it to 0 
afterwards… so I think the functionality is correct… :-(((

thanks though - your help is much appreciated.

Cheers

Mark.



> 
> Paolo
> 
>> +tcg_temp_free_i32(success);
>> +
>> +/* Store shoudl be OK lets check the value */
>> +tmp = load_reg(s, rt);
>> +TCGv_i64 val64=tcg_temp_new_i64();
>>switch (size) {
>>case 0:
>>gen_aa32_ld8u(tmp, addr, get_mem_index(s));
>> @@ -7450,21 +7447,19 @@ static void gen_store_exclusive(DisasContext *s,
>> int rd, int rt, int rt2,
>>abort();
>>}
>> 
>> -val64 = tcg_temp_new_i64();
>>if (size == 3) {
>>TCGv_i32 tmp2 = tcg_temp_new_i32();
>>TCGv_i32 tmp3 = tcg_temp_new_i32();
>> +
>>tcg_gen_addi_i32(tmp2, addr, 4);
>>gen_aa32_ld32u(tmp3, tmp2, get_mem_index(s));
>> -tcg_temp_free_i32(tmp2);
>>tcg_gen_concat_i32_i64(val64, tmp, tmp3);
>> -tcg_temp_free_i32(tmp3);
>> +tcg_temp_free_i32(tmp2);
>>} else {
>> -tcg_gen_extu_i32_i64(val64, tmp);
>> +  tcg_gen_extu_i32_i64(val64, tmp);
>>}
>>tcg_temp_free_i32(tmp);
>> -
>> -tcg_gen_brcond_i64(TCG_COND_NE, val64, cpu_exclusive_val, fail_label);
>> +tcg_gen_brcond_i64(TCG_COND_NE, cpu_exclusive_val, val64, fail_label);
>>tcg_temp_free_i64(val64);
>> 
>>tmp = load_reg(s, rt);
>> @@ -7489,12 +7484,16 @@ static void gen_store_exclusive(DisasContext *s,
>> int rd, int rt, int rt2,
>>gen_aa32_st32(tmp, addr, get_mem_index(s));
>>tcg_temp_free_i32(tmp);
>>}
>> +
>> +gen_helper_atomic_release(cpu_env);
>>tcg_gen_movi_i32(cpu_R[rd], 0);
>>tcg_gen_br(done_label);
>> +
>>gen_set_label(fail_label);
>> +gen_helper_atomic_release(cpu_env);
>>tcg_gen_movi_i32(cpu_R[rd], 1);
>> +
>>gen_set_label(done_label);
>> -tcg_gen_movi_i64(cpu_exclusive_addr, -1);


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




[Qemu-devel] Fwd: [RFC] Adding multithreads to LDREX/STREX.

2015-03-03 Thread Mark Burton
Paolo - here is a partially cleaned up patch - it’s still not quite right - but 
maybe its enough so that you can see what we’re doing.
There are changes in here that wouldn’t be sensible to upstream - like changing 
the address from a 64 to a 32 bit and needlessly moving it about - thats just 
because we’ve been trying to play with things to get them working.

Hopefully you can see the wood for the trees

Cheers

Mark.



> Begin forwarded message:
> 
> From: fred.kon...@greensocs.com
> To: mark.bur...@greensocs.com
> Cc: fred.kon...@greensocs.com
> Subject: [RFC] Adding multithreads to LDREX/STREX.
> Date: 3 March 2015 17:34:36 CET
> 
> From: KONRAD Frederic 
> 
> ---
> target-arm/cpu.c   | 21 +++
> target-arm/cpu.h   |  7 -
> target-arm/helper.c|  4 +++
> target-arm/helper.h|  7 +
> target-arm/machine.c   |  2 +-
> target-arm/op_helper.c | 70 +-
> target-arm/translate.c | 55 +++
> 7 files changed, 135 insertions(+), 31 deletions(-)
> 
> diff --git a/target-arm/cpu.c b/target-arm/cpu.c
> index 285947f..e08c486 100644
> --- a/target-arm/cpu.c
> +++ b/target-arm/cpu.c
> @@ -31,6 +31,26 @@
> #include "sysemu/kvm.h"
> #include "kvm_arm.h"
> 
> +/* Protect cpu_exclusive_* variable .*/
> +__thread bool cpu_have_exclusive_lock = false;
> +QemuMutex cpu_exclusive_lock;
> +
> +inline void arm_exclusive_lock(void)
> +{
> +if (!cpu_have_exclusive_lock) {
> +qemu_mutex_lock(&cpu_exclusive_lock);
> +cpu_have_exclusive_lock = true;
> +}
> +}
> +
> +inline void arm_exclusive_unlock(void)
> +{
> +if (cpu_have_exclusive_lock) {
> +cpu_have_exclusive_lock = false;
> +qemu_mutex_unlock(&cpu_exclusive_lock);
> +}
> +}
> +
> static void arm_cpu_set_pc(CPUState *cs, vaddr value)
> {
> ARMCPU *cpu = ARM_CPU(cs);
> @@ -374,6 +394,7 @@ static void arm_cpu_initfn(Object *obj)
> cpu->psci_version = 2; /* TCG implements PSCI 0.2 */
> if (!inited) {
> inited = true;
> +qemu_mutex_init(&cpu_exclusive_lock);
> arm_translate_init();
> }
> }
> diff --git a/target-arm/cpu.h b/target-arm/cpu.h
> index cd7a9e8..6b4c38e 100644
> --- a/target-arm/cpu.h
> +++ b/target-arm/cpu.h
> @@ -450,8 +450,10 @@ typedef struct CPUARMState {
> float_status fp_status;
> float_status standard_fp_status;
> } vfp;
> -uint64_t exclusive_addr;
> +uint32_t exclusive_addr;
> uint64_t exclusive_val;
> +bool exclusive_in_store;
> +bool exclusive_store_fail;
> uint64_t exclusive_high;
> #if defined(CONFIG_USER_ONLY)
> uint64_t exclusive_test;
> @@ -1819,4 +1821,7 @@ enum {
> QEMU_PSCI_CONDUIT_HVC = 2,
> };
> 
> +void arm_exclusive_lock(void);
> +void arm_exclusive_unlock(void);
> +
> #endif
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 151f96b..11a8d9a 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -4256,6 +4256,10 @@ void arm_cpu_do_interrupt(CPUState *cs)
> 
> arm_log_exception(cs->exception_index);
> 
> +arm_exclusive_lock();
> +env->exclusive_addr = -1;
> +arm_exclusive_unlock();
> +
> if (arm_is_psci_call(cpu, cs->exception_index)) {
> arm_handle_psci_call(cpu);
> qemu_log_mask(CPU_LOG_INT, "...handled as PSCI call\n");
> diff --git a/target-arm/helper.h b/target-arm/helper.h
> index dec3728..fd8eb71 100644
> --- a/target-arm/helper.h
> +++ b/target-arm/helper.h
> @@ -529,6 +529,13 @@ DEF_HELPER_2(dc_zva, void, env, i64)
> DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
> DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
> 
> +DEF_HELPER_0(exclusive_lock, void)
> +DEF_HELPER_0(exclusive_unlock, void)
> +DEF_HELPER_2(atomic_check, i32, env, i32)
> +DEF_HELPER_1(atomic_release, void, env)
> +DEF_HELPER_1(atomic_clear, void, env)
> +DEF_HELPER_2(atomic_claim, void, env, i32)
> +
> #ifdef TARGET_AARCH64
> #include "helper-a64.h"
> #endif
> diff --git a/target-arm/machine.c b/target-arm/machine.c
> index c29e7a2..3f9c9c8 100644
> --- a/target-arm/machine.c
> +++ b/target-arm/machine.c
> @@ -270,7 +270,7 @@ const VMStateDescription vmstate_arm_cpu = {
> VMSTATE_VARRAY_INT32(cpreg_vmstate_values, ARMCPU,
>  cpreg_vmstate_array_len,
>  0, vmstate_info_uint64, uint64_t),
> -VMSTATE_UINT64(env.exclusive_addr, ARMCPU),
> +//VMSTATE_UINT64(env.exclusive_addr, ARMCPU),
> VMSTATE_UINT64(env.exclusive_val, ARMCPU),
> VMSTATE_UINT64(env.exclusive_high, ARMCPU),
> VMSTATE_UINT64(env.features, ARMCPU),
> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
> index 2bed914..4f0fcc1 100644
> --- a/target-arm/op_helper.c
> +++ b/target-arm/op_helper.c
> @@ -29,10 +29,78 @@ static void raise_exception(CPUARMState *env, int tt)
> ARMCPU *cpu = arm_env_get_cpu(env);
> 

Re: [Qemu-devel] global_mutex and multithread.

2015-01-15 Thread Mark Burton
I think we call that flag “please dont reallocate this TB until at least after 
a CPU has exited and we do a global flush”… So if we sync and get all cpu’s to 
exit on a global flush, this flag is only there as a figment of our imagination…
e.g. we’re safe without it?

Wish I could say the same of global_mutex :-(


Cheers

Mark.

> On 15 Jan 2015, at 14:30, Frederic Konrad  wrote:
> 
> On 15/01/2015 12:14, Alexander Graf wrote:
>> 
>> On 15.01.15 12:12, Paolo Bonzini wrote:
>>> [now with correct listserver address]
>>> 
>>> On 15/01/2015 11:25, Frederic Konrad wrote:
 Hi everybody,
 
 In case of multithread TCG what is the best way to handle
 qemu_global_mutex?
 We though to have one mutex per vcpu and then synchronize vcpu threads when
 they exit (eg: in tcg_exec_all).
 
 Is that making sense?
>>> The basic ideas from Jan's patch in
>>> http://article.gmane.org/gmane.comp.emulators.qemu/118807 still apply.
>>> 
>>> RAM block reordering doesn't exist anymore, having been replaced with
>>> mru_block.
>>> 
>>> The patch reacquired the lock when entering MMIO or PIO emulation.
>>> That's enough while there is only one VCPU thread.
>>> 
>>> Once you have >1 VCPU thread you'll need the RCU work that I am slowly
>>> polishing and sending out.  That's because one device can change the
>>> memory map, and that will cause a tlb_flush for all CPUs in tcg_commit,
>>> and that's not thread-safe.
>> You'll have a similar problem for tb_flush() if you use a single tb
>> cache. Just introduce a big hammer function for now that IPIs all the
>> other threads, waits until they halted, do the atomic instruction (like
>> change the memory map or flush the tb cache), then let them continue.
>> 
>> We can later one-by-one get rid of all callers of this.
>> 
>> 
>> Alex
> Maybe we can put a flag in the tb to say it's being executed so tb_alloc 
> won't try to
> realloc it?
> 
> Maybe it's a bad idea and will be actually slower than exiting and waiting 
> all the other
> cpu.
> 
> Fred


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] global_mutex and multithread.

2015-01-15 Thread Mark Burton
Still in agony on this issue - I’ve CC’d Jan as his patch looks important…

the patch below would seem to offer by far and away the best result here. (If 
only we could get it working ;-) )
it allows threads to proceed as we want them to, it means we dont have 
to ‘count’ the number of CPU’s that are executing code (and could therefor 
potentially access IO space)…

However - if we go this route -the current patch is only for x86. (apart from 
the fact that we still seem to land in a deadlock…)

One thing I wonder - why do we need to go to the extent of mutexing in the TCG 
like this? Why can’t you simply put a mutex get/release on the slow path? If 
the core is going to do ‘fast path’ access to the memory, - even if that memory 
was IO mapped - would it matter if it didn’t have the mutex?

(It would help - I think - if we understood why you believed this patch 
wouldn’t work with SMP - I thought that was to do with the ‘round-robin’ 
mechanism - we’ve removed that for multi-thread anyway - but I guess we may 
have missed something there?)

Cheers

Mark.


> On 15 Jan 2015, at 12:12, Paolo Bonzini  wrote:
> 
> [now with correct listserver address]
> 
> On 15/01/2015 11:25, Frederic Konrad wrote:
>> Hi everybody,
>> 
>> In case of multithread TCG what is the best way to handle
>> qemu_global_mutex?
>> We though to have one mutex per vcpu and then synchronize vcpu threads when
>> they exit (eg: in tcg_exec_all).
>> 
>> Is that making sense?
> 
> The basic ideas from Jan's patch in
> http://article.gmane.org/gmane.comp.emulators.qemu/118807 still apply.
> 
> RAM block reordering doesn't exist anymore, having been replaced with
> mru_block.
> 
> The patch reacquired the lock when entering MMIO or PIO emulation.
> That's enough while there is only one VCPU thread.
> 
> Once you have >1 VCPU thread you'll need the RCU work that I am slowly
> polishing and sending out.  That's because one device can change the
> memory map, and that will cause a tlb_flush for all CPUs in tcg_commit,
> and that's not thread-safe.
> 
> And later on, once devices start being converted to run outside the BQL,
> that can be changed to use new functions address_space_rw_unlocked /
> io_mem_read_unlocked / io_mem_write_unlocked.  Something like that is
> already visible at https://github.com/bonzini/qemu/commits/rcu (ignore
> patches after "kvm: Switch to unlocked MMIO").
> 
> Paolo
> 
> 
> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] global_mutex and multithread.

2015-01-15 Thread Mark Burton

> On 15 Jan 2015, at 21:27, Paolo Bonzini  wrote:
> 
> 
> 
> On 15/01/2015 20:07, Mark Burton wrote:
>> However - if we go this route -the current patch is only for x86.
>> (apart from the fact that we still seem to land in a deadlock…)
> 
> Jan said he had it working at least on ARM (MusicPal).

yeah - our problem is when we enable multi-threads - which I dont believe Jan 
did…
Indeed - he specifically says that doesn’t work…. :-)

> 
>> One thing I wonder - why do we need to go to the extent of mutexing
>> in the TCG like this? Why can’t you simply put a mutex get/release on
>> the slow path? If the core is going to do ‘fast path’ access to the
>> memory, - even if that memory was IO mapped - would it matter if it
>> didn’t have the mutex?
> 
> Because there is no guarantee that the memory map isn't changed by a
> core under the feet of another.  The TLB (in particular the "iotlb") is
> only valid with reference to a particular memory map.

> 
> Changes to the memory map certainly happen in the slow path, but lookups
> are part of the fast path.  Even an rwlocks is too slow for a fast path,
> hence the plan of going with RCU.

Could we arrange the world such that lookups ‘succeed’ (the wheels dont fall 
off) -ether getting the old value, or the new, but not getting rubbish - and we 
still only take the mutex if we are going to make alterations to the MM itself? 
(I have’t looked at the code around that… so sorry if the question is 
ridiculous).

Cheers

Mark.

> 
> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] global_mutex and multithread.

2015-01-15 Thread Mark Burton

> On 15 Jan 2015, at 22:41, Paolo Bonzini  wrote:
> 
> 
> 
> On 15/01/2015 21:53, Mark Burton wrote:
>>> Jan said he had it working at least on ARM (MusicPal).
>> 
>> yeah - our problem is when we enable multi-threads - which I dont believe 
>> Jan did…
> 
> Multithreaded TCG, or single-threaded TCG with SMP?

He mentions SMP, - I assume thats single-threaded ….

> 
>>>> One thing I wonder - why do we need to go to the extent of mutexing
>>>> in the TCG like this? Why can’t you simply put a mutex get/release on
>>>> the slow path? If the core is going to do ‘fast path’ access to the
>>>> memory, - even if that memory was IO mapped - would it matter if it
>>>> didn’t have the mutex?
>>> 
>>> Because there is no guarantee that the memory map isn't changed by a
>>> core under the feet of another.  The TLB (in particular the "iotlb") is
>>> only valid with reference to a particular memory map.
>> 
>>> 
>>> Changes to the memory map certainly happen in the slow path, but lookups
>>> are part of the fast path.  Even an rwlocks is too slow for a fast path,
>>> hence the plan of going with RCU.
>> 
>> Could we arrange the world such that lookups ‘succeed’ (the wheels
>> dont fall off) -ether getting the old value, or the new, but not getting
>> rubbish - and we still only take the mutex if we are going to make
>> alterations to the MM itself? (I have’t looked at the code around that…
>> so sorry if the question is ridiculous).
> 
> That's the definition of RCU. :)  Look at the docs in
> http://permalink.gmane.org/gmane.comp.emulators.qemu/313929 for more
> information. :)

Ahh - I see !

> 
> It's still not trivial to make it 100% correct, but at the same time
> it's not too hard to prepare something decent to play with.  Also, most
> of the work can be done with KVM so it's more or less independent from
> what you guys have been doing so far.

Yes - the issue is if we end up relying on it.
But - I see what you mean - these 2 things can ‘dovetail’ together 
“independently” - so - Jan’s patch will be good for now, and then later we can 
use RCU to make it work more generally (and more efficiently).

So - our only small problem is getting Jan’s patch to work for multi-thread :-))

Cheers

Mark.

> 
> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] global_mutex and multithread.

2015-01-16 Thread Mark Burton

> On 16 Jan 2015, at 09:07, Jan Kiszka  wrote:
> 
> On 2015-01-16 08:25, Mark Burton wrote:
>> 
>>> On 15 Jan 2015, at 22:41, Paolo Bonzini  wrote:
>>> 
>>> 
>>> 
>>> On 15/01/2015 21:53, Mark Burton wrote:
>>>>> Jan said he had it working at least on ARM (MusicPal).
>>>> 
>>>> yeah - our problem is when we enable multi-threads - which I dont believe 
>>>> Jan did…
>>> 
>>> Multithreaded TCG, or single-threaded TCG with SMP?
>> 
>> He mentions SMP, - I assume thats single-threaded ….
> 
> Yes, I didn't patched anything towards multi-threaded SMP. Main reason:
> there was no answer on how to emulated the memory models of that target
> architecture over the host one which is mandatory if you let the
> emulated CPUs run unsynchronized in parallel. Did this change?
> 

No - we just decided to stop pressing the button…
I think this is the ‘x86 on ARM’ issue ? - our plan is to get ARM of x86 
working (or that class), and the worry about the other way round.

I dont see why SMP ARM on X86 wouldn’t work with your patch?

Cheers

Mark.

>> 
>>> 
>>>>>> One thing I wonder - why do we need to go to the extent of mutexing
>>>>>> in the TCG like this? Why can’t you simply put a mutex get/release on
>>>>>> the slow path? If the core is going to do ‘fast path’ access to the
>>>>>> memory, - even if that memory was IO mapped - would it matter if it
>>>>>> didn’t have the mutex?
>>>>> 
>>>>> Because there is no guarantee that the memory map isn't changed by a
>>>>> core under the feet of another.  The TLB (in particular the "iotlb") is
>>>>> only valid with reference to a particular memory map.
>>>> 
>>>>> 
>>>>> Changes to the memory map certainly happen in the slow path, but lookups
>>>>> are part of the fast path.  Even an rwlocks is too slow for a fast path,
>>>>> hence the plan of going with RCU.
>>>> 
>>>> Could we arrange the world such that lookups ‘succeed’ (the wheels
>>>> dont fall off) -ether getting the old value, or the new, but not getting
>>>> rubbish - and we still only take the mutex if we are going to make
>>>> alterations to the MM itself? (I have’t looked at the code around that…
>>>> so sorry if the question is ridiculous).
>>> 
>>> That's the definition of RCU. :)  Look at the docs in
>>> http://permalink.gmane.org/gmane.comp.emulators.qemu/313929 for more
>>> information. :)
>> 
>> Ahh - I see !
>> 
>>> 
>>> It's still not trivial to make it 100% correct, but at the same time
>>> it's not too hard to prepare something decent to play with.  Also, most
>>> of the work can be done with KVM so it's more or less independent from
>>> what you guys have been doing so far.
>> 
>> Yes - the issue is if we end up relying on it.
>> But - I see what you mean - these 2 things can ‘dovetail’ together 
>> “independently” - so - Jan’s patch will be good for now, and then later we 
>> can use RCU to make it work more generally (and more efficiently).
>> 
>> So - our only small problem is getting Jan’s patch to work for multi-thread 
>> :-))
> 
> See above regarding the potential dimension.
> 
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT RTC ITP SES-DE
> Corporate Competence Center Embedded Linux


 +44 (0)20 7100 3485 x 210
+33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] Use atomic cmpxchg to atomically check the exclusive value in a STREX

2015-06-09 Thread Mark Burton

> On 9 Jun 2015, at 11:12, Alex Bennée  wrote:
> 
> 
> fred.kon...@greensocs.com writes:
> 
>> From: KONRAD Frederic 
>> 
>> This mechanism replaces the existing load/store exclusive mechanism which 
>> seems
>> to be broken for multithread.
>> It follows the intention of the existing mechanism and stores the target 
>> address
>> and data values during a load operation and checks that they remain unchanged
>> before a store.
>> 
>> In common with the older approach, this provides weaker semantics than 
>> required
>> in that it could be that a different processor writes the same value as a
>> non-exclusive write, however in practise this seems to be irrelevant.
> 
> You can WFE on the global monitor and use it for a lockless sleep on a
> flag value. I don't think the Linux kernel does it but certainly
> anything trying to be power efficient may use it.
> 

I dont see how this changes the logic actually. 
The assumption is - I believe - that you spin lock waiting for a mutex to be 
free, using ld/strex, if you fail to get the mutex you execute a WFE to lower 
your power consumption. I dont see how that changes the semantics of the 
ld/strex part of the equation?

As far as I can see, the semantics we have are still robust enough to handle 
that.


>> 
>> The old implementation didn’t correctly store it’s values as globals, but 
>> rather
>> kept a local copy per CPU.
>> 
>> This new mechanism stores the values globally and also uses the atomic 
>> cmpxchg
>> macros to ensure atomicity - it is therefore very efficient and threadsafe.
>> 
>> Signed-off-by: KONRAD Frederic 
>> ---
>> target-arm/cpu.c   |  21 +++
>> target-arm/cpu.h   |   6 ++
>> target-arm/helper.c|  12 
>> target-arm/helper.h|   6 ++
>> target-arm/op_helper.c | 146 
>> -
>> target-arm/translate.c |  98 ++---
>> 6 files changed, 207 insertions(+), 82 deletions(-)
>> 
>> diff --git a/target-arm/cpu.c b/target-arm/cpu.c
>> index 4a888ab..0400061 100644
>> --- a/target-arm/cpu.c
>> +++ b/target-arm/cpu.c
>> @@ -31,6 +31,26 @@
>> #include "sysemu/kvm.h"
>> #include "kvm_arm.h"
>> 
>> +/* Protect cpu_exclusive_* variable .*/
>> +__thread bool cpu_have_exclusive_lock;
>> +QemuMutex cpu_exclusive_lock;
>> +
>> +inline void arm_exclusive_lock(void)
>> +{
>> +if (!cpu_have_exclusive_lock) {
>> +qemu_mutex_lock(&cpu_exclusive_lock);
>> +cpu_have_exclusive_lock = true;
>> +}
>> +}
>> +
>> +inline void arm_exclusive_unlock(void)
>> +{
>> +if (cpu_have_exclusive_lock) {
>> +cpu_have_exclusive_lock = false;
>> +qemu_mutex_unlock(&cpu_exclusive_lock);
>> +}
>> +}
> 
> I don't quite follow. If these locks are mean to be protecting access to
> variables then how do they do that? The lock won't block if another
> thread is currently messing with the protected values.
> 

We are only protecting the stored exclusive values here… As the current 
implementation has them, but this time at least they are CPU specific.
it’s not protecting the actual memory address.


>> +
>> static void arm_cpu_set_pc(CPUState *cs, vaddr value)
>> {
>> ARMCPU *cpu = ARM_CPU(cs);
>> @@ -425,6 +445,7 @@ static void arm_cpu_initfn(Object *obj)
>> cpu->psci_version = 2; /* TCG implements PSCI 0.2 */
>> if (!inited) {
>> inited = true;
>> +qemu_mutex_init(&cpu_exclusive_lock);
>> arm_translate_init();
>> }
>> }
>> diff --git a/target-arm/cpu.h b/target-arm/cpu.h
>> index 21b5b8e..257ed05 100644
>> --- a/target-arm/cpu.h
>> +++ b/target-arm/cpu.h
>> @@ -506,6 +506,9 @@ static inline bool is_a64(CPUARMState *env)
>> int cpu_arm_signal_handler(int host_signum, void *pinfo,
>>void *puc);
>> 
>> +int arm_get_phys_addr(CPUARMState *env, target_ulong address, int 
>> access_type,
>> +  hwaddr *phys_ptr, int *prot, target_ulong *page_size);
>> +
>> /**
>>  * pmccntr_sync
>>  * @env: CPUARMState
>> @@ -1922,4 +1925,7 @@ enum {
>> QEMU_PSCI_CONDUIT_HVC = 2,
>> };
>> 
>> +void arm_exclusive_lock(void);
>> +void arm_exclusive_unlock(void);
>> +
>> #endif
>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>> index 3da0c05..31ee1e5 100644
>> --- a/target-arm/helper.c
>> +++ b/target-arm/helper.c
>> @@ -23,6 +23,14 @@ static inline int get_phys_addr(CPUARMState *env, 
>> target_ulong address,
>> #define PMCRE   0x1
>> #endif
>> 
>> +int arm_get_phys_addr(CPUARMState *env, target_ulong address, int 
>> access_type,
>> +  hwaddr *phys_ptr, int *prot, target_ulong *page_size)
>> +{
>> +MemTxAttrs attrs = {};
>> +return get_phys_addr(env, address, access_type, cpu_mmu_index(env),
>> + phys_ptr, &attrs, prot, page_size);
>> +}
>> +
>> static int vfp_gdb_get_reg(CPUARMState *env, uint8_t *buf, int reg)
>> {
>> int nregs;
>> @@ -4731,6 +4739,10 @@ void arm_cpu_do_interrupt(CPUState *c

Re: [Qemu-devel] [RFC PATCH] Use atomic cmpxchg to atomically check the exclusive value in a STREX

2015-06-09 Thread Mark Burton

> On 9 Jun 2015, at 15:55, Alex Bennée  wrote:
> 
> 
> Alex Bennée  writes:
> 
>> fred.kon...@greensocs.com writes:
>> 
>>> From: KONRAD Frederic 
>>> 
>>> This mechanism replaces the existing load/store exclusive mechanism which 
>>> seems
>>> to be broken for multithread.
>>> It follows the intention of the existing mechanism and stores the target 
>>> address
>>> and data values during a load operation and checks that they remain 
>>> unchanged
>>> before a store.
>>> 
>>> In common with the older approach, this provides weaker semantics than 
>>> required
>>> in that it could be that a different processor writes the same value as a
>>> non-exclusive write, however in practise this seems to be irrelevant.
>> 
> 
>>> 
>>> +/* Protect cpu_exclusive_* variable .*/
>>> +__thread bool cpu_have_exclusive_lock;
>>> +QemuMutex cpu_exclusive_lock;
>>> +
>>> +inline void arm_exclusive_lock(void)
>>> +{
>>> +if (!cpu_have_exclusive_lock) {
>>> +qemu_mutex_lock(&cpu_exclusive_lock);
>>> +cpu_have_exclusive_lock = true;
>>> +}
>>> +}
>>> +
>>> +inline void arm_exclusive_unlock(void)
>>> +{
>>> +if (cpu_have_exclusive_lock) {
>>> +cpu_have_exclusive_lock = false;
>>> +qemu_mutex_unlock(&cpu_exclusive_lock);
>>> +}
>>> +}
>> 
>> I don't quite follow. If these locks are mean to be protecting access to
>> variables then how do they do that? The lock won't block if another
>> thread is currently messing with the protected values.
> 
> Having re-read after coffee I'm still wondering why we need the
> per-thread bool? All the lock/unlock pairs are for critical sections so
> don't we just want to serialise on the qemu_mutex_lock(), what do the
> flags add apart from allowing you to next locks that shouldn't happen?
> 
> 

you mean the “cpu_have_exclusive_lock” bools?

Cheers
Mark.

> -- 
> Alex Bennée


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC PATCH] Use atomic cmpxchg to atomically check the exclusive value in a STREX

2015-06-09 Thread Mark Burton

> On 9 Jun 2015, at 15:59, Alex Bennée  wrote:
> 
> 
> fred.kon...@greensocs.com writes:
> 
>> From: KONRAD Frederic 
>> 



>> +DEF_HELPER_4(atomic_cmpxchg64, i32, env, i32, i64, i32)
>> +DEF_HELPER_2(atomic_check, i32, env, i32)
>> +DEF_HELPER_1(atomic_release, void, env)
> 
> Apologies for being split over several mails, I didn't notice before

No problem - personally prefer to keep the threads separate.

> that atomic_check/release don't seem to be called at all in this patch.
> Is there later work that uses it?

I suspect that this is just a case of not cleaning up- I think we used check 
and release, and then ditched them… I’ll check (and release :-) )

Cheers
Mark.





Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.

2015-03-29 Thread Mark Burton
to add some detail,
Unfortunately Fred is away this week, so we won’t get this patch set to you ask 
quickly as I’d have liked.

We have a ‘working’ implementation - where ‘working’ is limited to a couple of 
SMP cores, booting and running Dhrystone. The performance improvement we get is 
close to the 2x that you might expect (especially for such a simple test as 
Dhrystone). HOWEVER…

We still have the ‘hacked’ version of tb_first (if you remember, we have made 
that into an array, addressed by current_cpu). This has two effects (at least):
1/ it’s probably un-necissary, but we DO need to protect cpu’s from 
each other when they invalidate tb_first (probably not when they write, as the 
current implementation only allows 1 thread to generate TB’s)
2/ it breaks GDB as current_cpu isn’t set in the IO thread, so that 
just set-faults and dies…
We are in the process of working out how to fix this mess cleanly. 

So - Fred is unwilling to send the patch set as it stands, because frankly this 
part is totally broken.

There is an independent patch set that needs splitting out which deals with 
just the atomic instruction issue - specifically for ARM (though I guess it’s 
applicable across the board)…

So - in short - I HOPE to get the patch set onto the reflector sometime next 
week, and I’m sorry for the delay.

When we do send it - it will be sent RFC.
It is possible that we all agree that the atomic instruction fix is applicable, 
but the rest is almost certainly not. It’s there to show progress and the 
direction we’re going and give everybody a basis from which to discuss the 
issues…. not more.


Cheers

Mark.



> On 27 Mar 2015, at 11:37, Frederic Konrad  wrote:
> 
> Hi,
> 
> Yes a v2 will come soon.
> Actually I'm trying to unbreak GDB stub and a strange segfault with smp > 2.
> 
> Fred
> 
> On 27/03/2015 11:08, Alex Bennée wrote:
>> fred.kon...@greensocs.com writes:
>> 
>>> From: KONRAD Frederic 
>>> 
>>> Hi everybody,
>>> 
>>> This is the start of our work on TCG multithread.
>> It's been a while since the first RFC are we likely to see a v2 of the
>> patch series any time soon?
>> 
>>> We send it for comment to be sure we are taking the right direction.
>>> We already discussed the first patch but we keep it for simplicity.
>>> 
>>> We choice to keep a common tbs array for all VCPU but protect it with the
>>> tb_lock from TBContext. Then for each PageDesc we have a tb list per VCPU.
>>> 
>>> Known issues:
>>>   * Some random deadlock.
>>>   * qemu_pause_cond is broken we can't quit QEMU.
>>>   * tb_flush is broken we must make sure no VCPU are executing code.
>>> 
>>> Jan Kiszka (1):
>>>   Drop global lock during TCG code execution
>>> 
>>> KONRAD Frederic (9):
>>>   target-arm: protect cpu_exclusive_*.
>>>   use a different translation block list for each cpu.
>>>   replace spinlock by QemuMutex.
>>>   remove unused spinlock.
>>>   extract TBContext from TCGContext.
>>>   protect TBContext with tb_lock.
>>>   tcg: remove tcg_halt_cond global variable.
>>>   cpu: remove exit_request global.
>>>   tcg: switch on multithread.
>>> 
>>>  cpu-exec.c|  47 +++
>>>  cpus.c| 122 +++
>>>  cputlb.c  |   5 ++
>>>  exec.c|  25 ++
>>>  include/exec/exec-all.h   |   4 +-
>>>  include/exec/spinlock.h   |  49 ---
>>>  include/qom/cpu.h |   1 +
>>>  linux-user/main.c |   6 +-
>>>  scripts/checkpatch.pl |   9 +-
>>>  softmmu_template.h|   6 ++
>>>  target-arm/cpu.c  |  14 
>>>  target-arm/cpu.h  |   3 +
>>>  target-arm/helper.h   |   3 +
>>>  target-arm/op_helper.c|  10 +++
>>>  target-arm/translate.c|   6 ++
>>>  target-i386/mem_helper.c  |  16 +++-
>>>  target-i386/misc_helper.c |  27 +-
>>>  tcg/i386/tcg-target.c |   8 ++
>>>  tcg/tcg.h |   3 +-
>>>  translate-all.c   | 208 
>>> +++---
>>>  vl.c  |   6 ++
>>>  21 files changed, 335 insertions(+), 243 deletions(-)
>>>  delete mode 100644 include/exec/spinlock.h
> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.

2015-03-30 Thread Mark Burton
understood.
Cheers
Mark.

> On 30 Mar 2015, at 23:46, Peter Maydell  wrote:
> 
> On 30 March 2015 at 07:52, Mark Burton  wrote:
>> So - Fred is unwilling to send the patch set as it stands, because frankly 
>> this part is totally broken.
>> 
>> There is an independent patch set that needs splitting out which deals with 
>> just the atomic instruction issue - specifically for ARM (though I guess 
>> it’s applicable across the board)…
>> 
>> So - in short - I HOPE to get the patch set onto the reflector sometime next 
>> week, and I’m sorry for the delay.
> 
> What I really want to see is not so much the patch set
> but the design sketch I asked for that lists the
> various data structures and indicates which ones
> are going to be per-cpu, which ones will be shared
> (and with what locking), etc.
> 
> -- PMM


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation

2015-05-06 Thread Mark Burton
A massive thank you for doing this work Alvise,

On our side, the patch we suggested is only applicable for ARM, though the 
mechanism would work for any CPU, 
- BUT
It doesn’t force atomic instructions out through the slow path. This is either 
a very good thing (it’s much faster), or a very bad thing (it doesn’t allow you 
to treat them in the IO space), depending on your point of view.

Depending on what the rest of the community thinks, it seems to me we should 
apply both patches so that e.g. ARM’s existing atomic instructions run much 
faster and above all more ‘accurately’ - (with the patch we’ve provided),  and 
the same mechanism can be applied to all other architectures - but we can - 
somehow - swap for this more ‘controllable’ implementation when e.g. the mutex 
is located in IO space….

Cheers

Mark.

> On 6 May 2015, at 17:38, Alvise Rigo  wrote:
> 
> This patch series provides an infrastructure for atomic
> instruction implementation in QEMU, paving the way for TCG multi-threading.
> The adopted design does not rely on host atomic
> instructions and is intended to propose a 'legacy' solution for
> translating guest atomic instructions.
> 
> The underlying idea is to provide new TCG instructions that guarantee
> atomicity to some memory accesses or in general a way to define memory
> transactions. More specifically, a new pair of TCG instructions are
> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
> LoadLink and StoreConditional primitives (only 32 bit variant
> implemented).  In order to achieve this, a new bitmap is added to the
> ram_list structure (always unique) which flags all memory pages that
> could not be accessed directly through the fast-path, due to previous
> exclusive operations. This new bitmap is coupled with a new TLB flag
> which forces the slow-path exectuion. All stores which take place
> between an LL/SC operation by other vCPUs in the same memory page, will
> fail the subsequent StoreConditional.
> 
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.
> 
> The new slow-path is implemented such that:
> - the LoadLink behaves as a normal load slow-path, except for cleaning
>  the dirty flag in the bitmap. The TLB entries created from now on will
>  force the slow-path. To ensure it, we flush the TLB cache for the
>  other vCPUs
> - the StoreConditional behaves as a normal store slow-path, except for
>  checking the state of the dirty bitmap and returning 0 or 1 whether or
>  not the StoreConditional succeeded (0 when no vCPU has touched the
>  same memory in the mean time).
> 
> All those write accesses that are forced to follow the 'legacy'
> slow-path will set the accessed memory page to dirty.
> 
> In this series only the ARM ldrex/strex instructions are implemented.
> The code was tested with bare-metal test cases and with Linux, using
> upstream QEMU.
> 
> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
> 
> Alvise Rigo (5):
>  exec: Add new exclusive bitmap to ram_list
>  Add new TLB_EXCL flag
>  softmmu: Add helpers for a new slow-path
>  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>  target-arm: translate: implement qemu_ldlink and qemu_stcond ops
> 
> cputlb.c|  11 ++-
> include/exec/cpu-all.h  |   1 +
> include/exec/cpu-defs.h |   2 +
> include/exec/memory.h   |   3 +-
> include/exec/ram_addr.h |  19 +++-
> softmmu_llsc_template.h | 233 
> softmmu_template.h  |  52 ++-
> target-arm/translate.c  |  94 ++-
> tcg/arm/tcg-target.c| 105 --
> tcg/tcg-be-ldst.h   |   2 +
> tcg/tcg-op.c|  20 +
> tcg/tcg-op.h|   3 +
> tcg/tcg-opc.h   |   4 +
> tcg/tcg.c   |   2 +
> tcg/tcg.h   |  20 +
> 15 files changed, 538 insertions(+), 33 deletions(-)
> create mode 100644 softmmu_llsc_template.h
> 
> -- 
> 2.4.0
> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation

2015-05-06 Thread Mark Burton
By the way - would it help to send the atomic patch we did separately from the 
whole MTTCG patch set?
Or have you already taken a look at that - it’s pretty short.

Cheers

Mark.


> On 6 May 2015, at 17:51, Paolo Bonzini  wrote:
> 
> On 06/05/2015 17:38, Alvise Rigo wrote:
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>> 
>> The underlying idea is to provide new TCG instructions that guarantee
>> atomicity to some memory accesses or in general a way to define memory
>> transactions. More specifically, a new pair of TCG instructions are
>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>> LoadLink and StoreConditional primitives (only 32 bit variant
>> implemented).  In order to achieve this, a new bitmap is added to the
>> ram_list structure (always unique) which flags all memory pages that
>> could not be accessed directly through the fast-path, due to previous
>> exclusive operations. This new bitmap is coupled with a new TLB flag
>> which forces the slow-path exectuion. All stores which take place
>> between an LL/SC operation by other vCPUs in the same memory page, will
>> fail the subsequent StoreConditional.
>> 
>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>> can be used to properly handle atomic instructions on any architecture.
>> 
>> The new slow-path is implemented such that:
>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>  the dirty flag in the bitmap. The TLB entries created from now on will
>>  force the slow-path. To ensure it, we flush the TLB cache for the
>>  other vCPUs
>> - the StoreConditional behaves as a normal store slow-path, except for
>>  checking the state of the dirty bitmap and returning 0 or 1 whether or
>>  not the StoreConditional succeeded (0 when no vCPU has touched the
>>  same memory in the mean time).
>> 
>> All those write accesses that are forced to follow the 'legacy'
>> slow-path will set the accessed memory page to dirty.
>> 
>> In this series only the ARM ldrex/strex instructions are implemented.
>> The code was tested with bare-metal test cases and with Linux, using
>> upstream QEMU.
>> 
>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>> 
>> Alvise Rigo (5):
>>  exec: Add new exclusive bitmap to ram_list
>>  Add new TLB_EXCL flag
>>  softmmu: Add helpers for a new slow-path
>>  tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>  target-arm: translate: implement qemu_ldlink and qemu_stcond ops
> 
> That's pretty cool.
> 
> Paolo


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] [RFC 0/5] Slow-path for atomic instruction translation

2015-05-06 Thread Mark Burton

> On 6 May 2015, at 18:19, alvise rigo  wrote:
> 
> Hi Mark,
> 
> Firstly, thank you for your feedback.
> 
> On Wed, May 6, 2015 at 5:55 PM, Mark Burton  wrote:
>> A massive thank you for doing this work Alvise,
>> 
>> On our side, the patch we suggested is only applicable for ARM, though the 
>> mechanism would work for any CPU,
>>- BUT
>> It doesn’t force atomic instructions out through the slow path. This is 
>> either a very good thing (it’s much faster), or a very bad thing (it doesn’t 
>> allow you to treat them in the IO space), depending on your point of view.
> 
> Indeed, this is for sure a more invasive approach, but it's made on
> purpose to have control over those non-atomic stores that might modify
> the 'linked' memory.
> 

exactly

:-)
Cheers
Mark.

>> 
>> Depending on what the rest of the community thinks, it seems to me we should 
>> apply both patches so that e.g. ARM’s existing atomic instructions run much 
>> faster and above all more ‘accurately’ - (with the patch we’ve provided),  
>> and the same mechanism can be applied to all other architectures - but we 
>> can - somehow - swap for this more ‘controllable’ implementation when e.g. 
>> the mutex is located in IO space….
> 
> Yes, this makes sense.
> 
> Thank you,
> alvise
> 
>> 
>> Cheers
>> 
>> Mark.
>> 
>>> On 6 May 2015, at 17:38, Alvise Rigo  wrote:
>>> 
>>> This patch series provides an infrastructure for atomic
>>> instruction implementation in QEMU, paving the way for TCG multi-threading.
>>> The adopted design does not rely on host atomic
>>> instructions and is intended to propose a 'legacy' solution for
>>> translating guest atomic instructions.
>>> 
>>> The underlying idea is to provide new TCG instructions that guarantee
>>> atomicity to some memory accesses or in general a way to define memory
>>> transactions. More specifically, a new pair of TCG instructions are
>>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>>> LoadLink and StoreConditional primitives (only 32 bit variant
>>> implemented).  In order to achieve this, a new bitmap is added to the
>>> ram_list structure (always unique) which flags all memory pages that
>>> could not be accessed directly through the fast-path, due to previous
>>> exclusive operations. This new bitmap is coupled with a new TLB flag
>>> which forces the slow-path exectuion. All stores which take place
>>> between an LL/SC operation by other vCPUs in the same memory page, will
>>> fail the subsequent StoreConditional.
>>> 
>>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>>> can be used to properly handle atomic instructions on any architecture.
>>> 
>>> The new slow-path is implemented such that:
>>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>> the dirty flag in the bitmap. The TLB entries created from now on will
>>> force the slow-path. To ensure it, we flush the TLB cache for the
>>> other vCPUs
>>> - the StoreConditional behaves as a normal store slow-path, except for
>>> checking the state of the dirty bitmap and returning 0 or 1 whether or
>>> not the StoreConditional succeeded (0 when no vCPU has touched the
>>> same memory in the mean time).
>>> 
>>> All those write accesses that are forced to follow the 'legacy'
>>> slow-path will set the accessed memory page to dirty.
>>> 
>>> In this series only the ARM ldrex/strex instructions are implemented.
>>> The code was tested with bare-metal test cases and with Linux, using
>>> upstream QEMU.
>>> 
>>> This work has been sponsored by Huawei Technologies Dusseldorf GmbH.
>>> 
>>> Alvise Rigo (5):
>>> exec: Add new exclusive bitmap to ram_list
>>> Add new TLB_EXCL flag
>>> softmmu: Add helpers for a new slow-path
>>> tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>> target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>> 
>>> cputlb.c|  11 ++-
>>> include/exec/cpu-all.h  |   1 +
>>> include/exec/cpu-defs.h |   2 +
>>> include/exec/memory.h   |   3 +-
>>> include/exec/ram_addr.h |  19 +++-
>>> softmmu_llsc_template.h | 233 
>>> 
>>> softmmu_template.h  |  52 ++-
>>> target-arm/translate.c  |  94 ++-
>>> tcg/arm/tcg-target.c| 105 --
>>> tcg/tcg-be-ldst.h   |   2 +
>>> tcg/tcg-op.c|  20 +
>>> tcg/tcg-op.h|   3 +
>>> tcg/tcg-opc.h   |   4 +
>>> tcg/tcg.c   |   2 +
>>> tcg/tcg.h   |  20 +
>>> 15 files changed, 538 insertions(+), 33 deletions(-)
>>> create mode 100644 softmmu_llsc_template.h
>>> 
>>> --
>>> 2.4.0
>>> 
>> 
>> 
>> +44 (0)20 7100 3485 x 210
>> +33 (0)5 33 52 01 77x 210
>> 
>>+33 (0)603762104
>>mark.burton
>> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton




Re: [Qemu-devel] MTTCG Sync-up call today? Agenda items?

2016-04-11 Thread Mark Burton
Good plan :-)
Cheers
Mark.

> On 11 Apr 2016, at 13:21, Alex Bennée  wrote:
> 
> Hi,
> 
> It's been awhile since we synced-up with quite weeks and Easter out of
> the way are we good for a call today?
> 
> Some items I can think would be worth covering:
> 
>  - State of MTTCG enabled LL/SC
> 
>I think Alvise was looking at some run-loop changes for the MTTCG
>enabled part of his LL/SC patch set. I haven't heard anything for a
>while.
> 
>  - Linaro's current efforts
> 
>Sergey is currently working with us in up-streaming MTTCG related
>patches. We've taken stuff from Paolo and Fred's trees and push
>several series to the list for review:
>  - various TCG clean-ups
>  - atomic jump patching
>  - base enabling patches
> 
>  - Memory ordering work
> 
>I put this up as a suggested project for GSoC and we had several
>applicants. We are currently awaiting Google's decision on slot
>allocations so hopefully we'll have an extra pair of hands on this
>chunk.
> 
>  - Emilio's work
> 
>Emilio posted a series last year with some alternative approaches to
>solving some of the MTTCG problems. He's back and working on this
>again and has posted his qht series as a precursor to his next
>revision of the MTTCG tree.
> 
>  - TCG maintainers view
> 
>It would be useful if we had a view from the TCG maintainers of how
>we are doing and if the approaches have a chance of getting merged.
> 
>  - Timescales
> 
>Internally at Linaro we've been discussing potential timescales. We are
>currently aiming to have all major pieces of MTTCG posted on list and
>being iterated through reviews before this years KVM Forum. This will
>leave us KVM forum to resolve any remaining barriers to eventual
>up-streaming.
> 
> Anything else worthy of discussion? If you need the number to call ping
> me off list.
> 
> --
> Alex Bennée


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] MTTCG Sync-up call today? Agenda items?

2016-04-11 Thread Mark Burton
So see you all online - on the normal number

Cheers
Mark.

> On 11 Apr 2016, at 13:45, alvise rigo  wrote:
> 
> Hi Alex,
> 
> On Mon, Apr 11, 2016 at 1:21 PM, Alex Bennée  wrote:
>> 
>> Hi,
>> 
>> It's been awhile since we synced-up with quite weeks and Easter out of
>> the way are we good for a call today?
> 
> 
> Indeed, it has been a while.
> 
>> 
>> 
>> Some items I can think would be worth covering:
>> 
>>  - State of MTTCG enabled LL/SC
>> 
>>I think Alvise was looking at some run-loop changes for the MTTCG
>>enabled part of his LL/SC patch set. I haven't heard anything for a
>>while.
> 
> 
> I've been quite busy lately, but expect the v8 of the LL/SC patch
> series by the end of this week. Sorry for the delay.
> 
> In any case, I'm in if any call will be held.
> 
> Regards,
> alvise
> 
>> 
>> 
>>  - Linaro's current efforts
>> 
>>Sergey is currently working with us in up-streaming MTTCG related
>>patches. We've taken stuff from Paolo and Fred's trees and push
>>several series to the list for review:
>>  - various TCG clean-ups
>>  - atomic jump patching
>>  - base enabling patches
>> 
>>  - Memory ordering work
>> 
>>I put this up as a suggested project for GSoC and we had several
>>applicants. We are currently awaiting Google's decision on slot
>>allocations so hopefully we'll have an extra pair of hands on this
>>chunk.
>> 
>>  - Emilio's work
>> 
>>Emilio posted a series last year with some alternative approaches to
>>solving some of the MTTCG problems. He's back and working on this
>>again and has posted his qht series as a precursor to his next
>>revision of the MTTCG tree.
>> 
>>  - TCG maintainers view
>> 
>>It would be useful if we had a view from the TCG maintainers of how
>>we are doing and if the approaches have a chance of getting merged.
>> 
>>  - Timescales
>> 
>>Internally at Linaro we've been discussing potential timescales. We are
>>currently aiming to have all major pieces of MTTCG posted on list and
>>being iterated through reviews before this years KVM Forum. This will
>>leave us KVM forum to resolve any remaining barriers to eventual
>>up-streaming.
>> 
>> Anything else worthy of discussion? If you need the number to call ping
>> me off list.
>> 
>> --
>> Alex Bennée


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] MTTCG Sync-up call today? Agenda items?

2016-04-11 Thread Mark Burton
Thanks Alex,
sorry I was so ‘mute’ - seems I have gremlins on the phone line again!
Cheers

Good to see Sergey on, I’ve added him to the list

Mark.

> On 11 Apr 2016, at 15:38, Alex Bennée  wrote:
> 
> 
> Alex Bennée  writes:
> 
>> Hi,
>> 
>> It's been awhile since we synced-up with quite weeks and Easter out of
>> the way are we good for a call today?
>> 
>> Some items I can think would be worth covering:
> 
> Well it mostly seemed to be me talking but here are my notes:
> 
>  - LL/SC Patch Status (Alvise)
>- v8 will be posted this week, follow-up MTTCG due
>  - will require async_safe_work
>  - AJB will post his revised re-based safe work series
>  - Linaro's approach going forward (Alex)
>- Take the uncontroversial clean-ups, push the upstream (mostly Sergey)
>- Provide a common set of base patches (enables MTTCG but not arch support)
>- (ADDED) Deferred work patches, needed separately for LL/SC MTTCG work
>- Following up with individual architecture enabling series (ARMv7, v8, 
> onwards)
>  - Memory ordering work (Alex)
>- Several applicants to proposed GSoC project
>- Will move forward once Google confirm funding
>  - Emilio's work
>- Emilio is back working on MTTCG, based on common base-patches
>- Others have seen his QHT proposals, review feedback from TCG devs looks 
> good
>  - TCG Maintainers view
>- Not had much feedback on later series, Alex will reach out
>  - Timescales
>- Linaro is committed to having all major work on list by KVM Forum 2016
>- Hopefully by KVM Forum we can work on roadmap to merging
>- Alex will be there, too early for most of the others to confirm
>  - Any other business
>- Mark commented it was nice to see things moving forward again ;-)
> 
> --
> Alex Bennée


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] MTTCG Sync-up call today?

2016-04-25 Thread Mark Burton
Fred’s away this week too

Cheers
Mark.

> On 25 Apr 2016, at 12:32, alvise rigo  wrote:
> 
> Hi Alex,
> 
> On Mon, Apr 25, 2016 at 11:53 AM, Alex Bennée  > wrote:
> Hi,
> 
> We are due to have a sync-up call today but I don't think I'll be able
> to make it thanks to a very rough voice courtesy of my
> petri-dishes/children. However since the last call:
> 
>  * Posted final parts of the MTTCG patch set
>  * Lots of reviews
> 
> Please welcome Pranith to the group who is participating as a GSoC
> student. His project will be looking at the modelling of memory barriers
> in MTTCG.
> 
> My plan for the next week is look in more detail at the tb_find_fast
> lock breaking patch. I had to drop it from the original series due to
> regressions but it has a massive effect on performance that means it
> needs sorting. I'll probably start re-building a tree with Emilio's QHT
> patches included which will allow for lockless lookups in the fast path
> and then start re-basing the enabling and ARMv7 patches on top of that.
> 
> I'll also do another review pass of Alvise' LL/SC patches but a fresh
> pair of eyes on them will be appreciated.
> 
> Alvise, how are you doing with the MTTCG version?
> 
> I'm working on this, mostly I'm investigating some issues related to the 
> dirty bitmaps' code that lately changed a bit.
> Apart from that, the branch is almost ready.
> 
> Best regards,
> alvise
> 
> 
> --
> Alex Bennée
> 


 +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

+33 (0)603762104
mark.burton
 





signature.asc
Description: Message signed with OpenPGP using GPGMail