Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-11 Thread Paolo Bonzini
On 11/07/2017 00:13, Emilio G. Cota wrote:
> On Mon, Jul 10, 2017 at 17:33:07 -0400, Paolo Bonzini wrote:
>>
>>> I agree that it would be nice to have the same mechanism for all.
>>>
>>> The main hurdle I see is how to allow for concurrent code generation while
>>> minimizing flushes of the single, fixed-size[*] code_gen_buffer.
>>> In user-mode this is tricky because there is no way to bound the number
>>> of threads that might be spawned by the guest code (I don't think reading
>>> /proc/sys/kernel/threads-max is a viable solution here).
>>>
>>> Switching to a "__thread *tcg_ctx_ptr" model will help minimize
>>> user-mode/softmmu differences though. The only remaining difference would be
>>> that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
>>> wouldn't, but everything else would be the same.
>>
>> Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you 
>> wouldn't
>> get any parallelism.  On the other hand, you could just say that the 
>> fixed-size
>> code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu.
> 
> Yes. tb_lock/mmap_lock, or like they're called in some asserts, memory_lock.
> 
> A way to get some parallelism in user-mode given the constraints
> would be to share regions among TCG threads. Threads would still need to take
> a per-region lock, but it wouldn't be a global lock so that would scale 
> better.
> 
> I'm not sure we really need that much parallelism for code generation in 
> user-mode,
> though. So I wouldn't focus on this until seeing benchmarks that have a clear
> bottleneck due to "memory_lock".

I agree.  Still, we could minimize the differences by protecting
tb_gen_code only with mmap_lock, instead of mmap_lock+tb_lock.

Paolo



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-10 Thread Emilio G. Cota
On Mon, Jul 10, 2017 at 17:33:07 -0400, Paolo Bonzini wrote:
> 
> > I agree that it would be nice to have the same mechanism for all.
> > 
> > The main hurdle I see is how to allow for concurrent code generation while
> > minimizing flushes of the single, fixed-size[*] code_gen_buffer.
> > In user-mode this is tricky because there is no way to bound the number
> > of threads that might be spawned by the guest code (I don't think reading
> > /proc/sys/kernel/threads-max is a viable solution here).
> > 
> > Switching to a "__thread *tcg_ctx_ptr" model will help minimize
> > user-mode/softmmu differences though. The only remaining difference would be
> > that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
> > wouldn't, but everything else would be the same.
> 
> Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you 
> wouldn't
> get any parallelism.  On the other hand, you could just say that the 
> fixed-size
> code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu.

Yes. tb_lock/mmap_lock, or like they're called in some asserts, memory_lock.

A way to get some parallelism in user-mode given the constraints
would be to share regions among TCG threads. Threads would still need to take
a per-region lock, but it wouldn't be a global lock so that would scale better.

I'm not sure we really need that much parallelism for code generation in 
user-mode,
though. So I wouldn't focus on this until seeing benchmarks that have a clear
bottleneck due to "memory_lock".

E.



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-10 Thread Paolo Bonzini

> I agree that it would be nice to have the same mechanism for all.
> 
> The main hurdle I see is how to allow for concurrent code generation while
> minimizing flushes of the single, fixed-size[*] code_gen_buffer.
> In user-mode this is tricky because there is no way to bound the number
> of threads that might be spawned by the guest code (I don't think reading
> /proc/sys/kernel/threads-max is a viable solution here).
> 
> Switching to a "__thread *tcg_ctx_ptr" model will help minimize
> user-mode/softmmu differences though. The only remaining difference would be
> that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
> wouldn't, but everything else would be the same.

Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you 
wouldn't
get any parallelism.  On the other hand, you could just say that the fixed-size
code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu.

Paolo



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-10 Thread Emilio G. Cota
On Mon, Jul 10, 2017 at 14:05:01 +0200, Paolo Bonzini wrote:
> On 09/07/2017 09:50, Emilio G. Cota wrote:
> > User-mode is kept out of this: contention due to concurrent translation
> > is more commonly found in full-system mode.
> 
> Out of curiosity, is it harder or you just didn't try?  It would be nice
> if the commit message mentioned the problems (if any) in addition to the
> reason why you didn't do it.
> 
> Having similar policies for user and softmmu emulation is much more
> maintainable (for an earlier example, see the unification of user mode
> emulation's start/end_exclusive logic with softmmu's "safe work").

I agree that it would be nice to have the same mechanism for all.

The main hurdle I see is how to allow for concurrent code generation while
minimizing flushes of the single, fixed-size[*] code_gen_buffer.
In user-mode this is tricky because there is no way to bound the number
of threads that might be spawned by the guest code (I don't think reading
/proc/sys/kernel/threads-max is a viable solution here).

Switching to a "__thread *tcg_ctx_ptr" model will help minimize
user-mode/softmmu differences though. The only remaining difference would be
that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
wouldn't, but everything else would be the same.

E.

[*] Note that in user-mode we use code_gen_buffer defined at compile-time
as a static buffer[].



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-10 Thread Paolo Bonzini
On 09/07/2017 09:50, Emilio G. Cota wrote:
> User-mode is kept out of this: contention due to concurrent translation
> is more commonly found in full-system mode.

Out of curiosity, is it harder or you just didn't try?  It would be nice
if the commit message mentioned the problems (if any) in addition to the
reason why you didn't do it.

Having similar policies for user and softmmu emulation is much more
maintainable (for an earlier example, see the unification of user mode
emulation's start/end_exclusive logic with softmmu's "safe work").

Paolo



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-09 Thread Emilio G. Cota
On Sun, Jul 09, 2017 at 11:48:53 -1000, Richard Henderson wrote:
> On 07/09/2017 11:29 AM, Emilio G. Cota wrote:
(snip)
> >Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and
> >go all the time, so this doesn't work well with having a single
> >code_gen_buffer, which I assumed was non-negotiable.
> 
> Ah, yes.  For any subdivision N of code_gen_buffer that we choose, at some
> point we may have N+1 threads and need to do Something Else.
> 
> That's probably something worth commenting somewhere with the "first"
> #ifndef CONFIG_USER_ONLY.

Yep, will do.

Thanks,

Emilio



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-09 Thread Richard Henderson

On 07/09/2017 11:29 AM, Emilio G. Cota wrote:

On Sun, Jul 09, 2017 at 11:19:37 -1000, Richard Henderson wrote:

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:

This allows us to generate TCG code in parallel. MTTCG already uses
it, although the next commit pushes down a lock to actually
perform parallel generation.

User-mode is kept out of this: contention due to concurrent translation
is more commonly found in full-system mode.


Um, why do you believe that?  Are you suggesting that a multi-threaded
user-only guest is much more likely to share TBs and do much less code
generation total?


Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and
go all the time, so this doesn't work well with having a single
code_gen_buffer, which I assumed was non-negotiable.


Ah, yes.  For any subdivision N of code_gen_buffer that we choose, at some 
point we may have N+1 threads and need to do Something Else.


That's probably something worth commenting somewhere with the "first" #ifndef 
CONFIG_USER_ONLY.



r~



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-09 Thread Emilio G. Cota
On Sun, Jul 09, 2017 at 11:19:37 -1000, Richard Henderson wrote:
> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> >This allows us to generate TCG code in parallel. MTTCG already uses
> >it, although the next commit pushes down a lock to actually
> >perform parallel generation.
> >
> >User-mode is kept out of this: contention due to concurrent translation
> >is more commonly found in full-system mode.
> 
> Um, why do you believe that?  Are you suggesting that a multi-threaded
> user-only guest is much more likely to share TBs and do much less code
> generation total?

Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and
go all the time, so this doesn't work well with having a single
code_gen_buffer, which I assumed was non-negotiable.

> At the moment I think it's just a confusing distinction.  As proven by some
> of the comment adjustments you made.
> 
> >-TCGContext tcg_ctx;
> >+TCG_THREAD TCGContext tcg_ctx;
> 
> This is a really large structure, and it's not needed by any of the I/O
> threads.  We're probably better off dynamically allocating this ourselves
> and do something like
(snip)

Agreed, will look into this.

E.



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-09 Thread Richard Henderson

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:

This allows us to generate TCG code in parallel. MTTCG already uses
it, although the next commit pushes down a lock to actually
perform parallel generation.

User-mode is kept out of this: contention due to concurrent translation
is more commonly found in full-system mode.


Um, why do you believe that?  Are you suggesting that a multi-threaded 
user-only guest is much more likely to share TBs and do much less code 
generation total?


At the moment I think it's just a confusing distinction.  As proven by some of 
the comment adjustments you made.



-TCGContext tcg_ctx;
+TCG_THREAD TCGContext tcg_ctx;


This is a really large structure, and it's not needed by any of the I/O 
threads.  We're probably better off dynamically allocating this ourselves and 
do something like


__thread TCGContext *tcg_ctx_ptr;
#define tcg_ctx (*tcg_ctx_ptr)

You could then even have tcg_init_ctx be the object instead of a pointer to the 
object.


for the main thread, so you don't have to have an extra pointer to this; just 
reference it by symbol.



-static struct tcg_temp_info temps[TCG_MAX_TEMPS];
-static TCGTempSet temps_used;
+static TCG_THREAD struct tcg_temp_info temps[TCG_MAX_TEMPS];
+static TCG_THREAD TCGTempSet temps_used;



I've been meaning to dynamically allocate these for a while now.  I think that 
would be better than putting this large array in TLS.  Which, again, is not 
needed for I/O threads.



r~



Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu

2017-07-09 Thread Richard Henderson

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:

I was not sure about tci_regs. From code inspection it seems that
they have to be per-thread, so I converted them, but I do not think
anyone has ever tried to get MTTCG working with TCI.


Yes, those should be per-thread.

Really, they should be on the stack in tcg_qemu_tb_exec, but the helpers aren't 
really structured to support that.



r~