Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On 11/07/2017 00:13, Emilio G. Cota wrote: > On Mon, Jul 10, 2017 at 17:33:07 -0400, Paolo Bonzini wrote: >> >>> I agree that it would be nice to have the same mechanism for all. >>> >>> The main hurdle I see is how to allow for concurrent code generation while >>> minimizing flushes of the single, fixed-size[*] code_gen_buffer. >>> In user-mode this is tricky because there is no way to bound the number >>> of threads that might be spawned by the guest code (I don't think reading >>> /proc/sys/kernel/threads-max is a viable solution here). >>> >>> Switching to a "__thread *tcg_ctx_ptr" model will help minimize >>> user-mode/softmmu differences though. The only remaining difference would be >>> that user-mode would need tb_lock() around tb_gen_code, whereas softmmu >>> wouldn't, but everything else would be the same. >> >> Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you >> wouldn't >> get any parallelism. On the other hand, you could just say that the >> fixed-size >> code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu. > > Yes. tb_lock/mmap_lock, or like they're called in some asserts, memory_lock. > > A way to get some parallelism in user-mode given the constraints > would be to share regions among TCG threads. Threads would still need to take > a per-region lock, but it wouldn't be a global lock so that would scale > better. > > I'm not sure we really need that much parallelism for code generation in > user-mode, > though. So I wouldn't focus on this until seeing benchmarks that have a clear > bottleneck due to "memory_lock". I agree. Still, we could minimize the differences by protecting tb_gen_code only with mmap_lock, instead of mmap_lock+tb_lock. Paolo
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On Mon, Jul 10, 2017 at 17:33:07 -0400, Paolo Bonzini wrote: > > > I agree that it would be nice to have the same mechanism for all. > > > > The main hurdle I see is how to allow for concurrent code generation while > > minimizing flushes of the single, fixed-size[*] code_gen_buffer. > > In user-mode this is tricky because there is no way to bound the number > > of threads that might be spawned by the guest code (I don't think reading > > /proc/sys/kernel/threads-max is a viable solution here). > > > > Switching to a "__thread *tcg_ctx_ptr" model will help minimize > > user-mode/softmmu differences though. The only remaining difference would be > > that user-mode would need tb_lock() around tb_gen_code, whereas softmmu > > wouldn't, but everything else would be the same. > > Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you > wouldn't > get any parallelism. On the other hand, you could just say that the > fixed-size > code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu. Yes. tb_lock/mmap_lock, or like they're called in some asserts, memory_lock. A way to get some parallelism in user-mode given the constraints would be to share regions among TCG threads. Threads would still need to take a per-region lock, but it wouldn't be a global lock so that would scale better. I'm not sure we really need that much parallelism for code generation in user-mode, though. So I wouldn't focus on this until seeing benchmarks that have a clear bottleneck due to "memory_lock". E.
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
> I agree that it would be nice to have the same mechanism for all. > > The main hurdle I see is how to allow for concurrent code generation while > minimizing flushes of the single, fixed-size[*] code_gen_buffer. > In user-mode this is tricky because there is no way to bound the number > of threads that might be spawned by the guest code (I don't think reading > /proc/sys/kernel/threads-max is a viable solution here). > > Switching to a "__thread *tcg_ctx_ptr" model will help minimize > user-mode/softmmu differences though. The only remaining difference would be > that user-mode would need tb_lock() around tb_gen_code, whereas softmmu > wouldn't, but everything else would be the same. Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you wouldn't get any parallelism. On the other hand, you could just say that the fixed-size code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu. Paolo
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On Mon, Jul 10, 2017 at 14:05:01 +0200, Paolo Bonzini wrote: > On 09/07/2017 09:50, Emilio G. Cota wrote: > > User-mode is kept out of this: contention due to concurrent translation > > is more commonly found in full-system mode. > > Out of curiosity, is it harder or you just didn't try? It would be nice > if the commit message mentioned the problems (if any) in addition to the > reason why you didn't do it. > > Having similar policies for user and softmmu emulation is much more > maintainable (for an earlier example, see the unification of user mode > emulation's start/end_exclusive logic with softmmu's "safe work"). I agree that it would be nice to have the same mechanism for all. The main hurdle I see is how to allow for concurrent code generation while minimizing flushes of the single, fixed-size[*] code_gen_buffer. In user-mode this is tricky because there is no way to bound the number of threads that might be spawned by the guest code (I don't think reading /proc/sys/kernel/threads-max is a viable solution here). Switching to a "__thread *tcg_ctx_ptr" model will help minimize user-mode/softmmu differences though. The only remaining difference would be that user-mode would need tb_lock() around tb_gen_code, whereas softmmu wouldn't, but everything else would be the same. E. [*] Note that in user-mode we use code_gen_buffer defined at compile-time as a static buffer[].
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On 09/07/2017 09:50, Emilio G. Cota wrote: > User-mode is kept out of this: contention due to concurrent translation > is more commonly found in full-system mode. Out of curiosity, is it harder or you just didn't try? It would be nice if the commit message mentioned the problems (if any) in addition to the reason why you didn't do it. Having similar policies for user and softmmu emulation is much more maintainable (for an earlier example, see the unification of user mode emulation's start/end_exclusive logic with softmmu's "safe work"). Paolo
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On Sun, Jul 09, 2017 at 11:48:53 -1000, Richard Henderson wrote: > On 07/09/2017 11:29 AM, Emilio G. Cota wrote: (snip) > >Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and > >go all the time, so this doesn't work well with having a single > >code_gen_buffer, which I assumed was non-negotiable. > > Ah, yes. For any subdivision N of code_gen_buffer that we choose, at some > point we may have N+1 threads and need to do Something Else. > > That's probably something worth commenting somewhere with the "first" > #ifndef CONFIG_USER_ONLY. Yep, will do. Thanks, Emilio
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On 07/09/2017 11:29 AM, Emilio G. Cota wrote: On Sun, Jul 09, 2017 at 11:19:37 -1000, Richard Henderson wrote: On 07/08/2017 09:50 PM, Emilio G. Cota wrote: This allows us to generate TCG code in parallel. MTTCG already uses it, although the next commit pushes down a lock to actually perform parallel generation. User-mode is kept out of this: contention due to concurrent translation is more commonly found in full-system mode. Um, why do you believe that? Are you suggesting that a multi-threaded user-only guest is much more likely to share TBs and do much less code generation total? Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and go all the time, so this doesn't work well with having a single code_gen_buffer, which I assumed was non-negotiable. Ah, yes. For any subdivision N of code_gen_buffer that we choose, at some point we may have N+1 threads and need to do Something Else. That's probably something worth commenting somewhere with the "first" #ifndef CONFIG_USER_ONLY. r~
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On Sun, Jul 09, 2017 at 11:19:37 -1000, Richard Henderson wrote: > On 07/08/2017 09:50 PM, Emilio G. Cota wrote: > >This allows us to generate TCG code in parallel. MTTCG already uses > >it, although the next commit pushes down a lock to actually > >perform parallel generation. > > > >User-mode is kept out of this: contention due to concurrent translation > >is more commonly found in full-system mode. > > Um, why do you believe that? Are you suggesting that a multi-threaded > user-only guest is much more likely to share TBs and do much less code > generation total? Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and go all the time, so this doesn't work well with having a single code_gen_buffer, which I assumed was non-negotiable. > At the moment I think it's just a confusing distinction. As proven by some > of the comment adjustments you made. > > >-TCGContext tcg_ctx; > >+TCG_THREAD TCGContext tcg_ctx; > > This is a really large structure, and it's not needed by any of the I/O > threads. We're probably better off dynamically allocating this ourselves > and do something like (snip) Agreed, will look into this. E.
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On 07/08/2017 09:50 PM, Emilio G. Cota wrote: This allows us to generate TCG code in parallel. MTTCG already uses it, although the next commit pushes down a lock to actually perform parallel generation. User-mode is kept out of this: contention due to concurrent translation is more commonly found in full-system mode. Um, why do you believe that? Are you suggesting that a multi-threaded user-only guest is much more likely to share TBs and do much less code generation total? At the moment I think it's just a confusing distinction. As proven by some of the comment adjustments you made. -TCGContext tcg_ctx; +TCG_THREAD TCGContext tcg_ctx; This is a really large structure, and it's not needed by any of the I/O threads. We're probably better off dynamically allocating this ourselves and do something like __thread TCGContext *tcg_ctx_ptr; #define tcg_ctx (*tcg_ctx_ptr) You could then even have tcg_init_ctx be the object instead of a pointer to the object. for the main thread, so you don't have to have an extra pointer to this; just reference it by symbol. -static struct tcg_temp_info temps[TCG_MAX_TEMPS]; -static TCGTempSet temps_used; +static TCG_THREAD struct tcg_temp_info temps[TCG_MAX_TEMPS]; +static TCG_THREAD TCGTempSet temps_used; I've been meaning to dynamically allocate these for a while now. I think that would be better than putting this large array in TLS. Which, again, is not needed for I/O threads. r~
Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
On 07/08/2017 09:50 PM, Emilio G. Cota wrote: I was not sure about tci_regs. From code inspection it seems that they have to be per-thread, so I converted them, but I do not think anyone has ever tried to get MTTCG working with TCI. Yes, those should be per-thread. Really, they should be on the stack in tcg_qemu_tb_exec, but the helpers aren't really structured to support that. r~