Re: [Qemu-devel] [PATCH v3 3/4] cputlb: serialize tlb updates with env->tlb_lock

2018-10-08 Thread Alex Bennée


Emilio G. Cota  writes:

> On Mon, Oct 08, 2018 at 14:57:18 +0100, Alex Bennée wrote:
>> Emilio G. Cota  writes:
>> > The readers that do not hold tlb_lock must use atomic reads when
>> > reading .addr_write, since this field can be updated by other threads;
>> > the conversion to atomic reads is done in the next patch.
>>
>> We don't enforce this for the TCG code - but rely on the backend ISA's
>> to avoid torn reads from updates from cputlb that could invalidate an
>> entry.
>
> We do enforce it though; the TLB reads we emit in TCG backend
> code are appropriately sized to guarantee atomic reads.
>
>> > -/* For atomic correctness when running MTTCG we need to use the right
>> > - * primitives when copying entries */
>> > -static inline void copy_tlb_helper(CPUTLBEntry *d, CPUTLBEntry *s,
>> > -   bool atomic_set)
>> > +/* Called with tlb_lock held */
>> > +static inline void copy_tlb_helper_locked(CPUTLBEntry *d, const 
>> > CPUTLBEntry *s)
>> >  {
>> > -#if TCG_OVERSIZED_GUEST
>> >  *d = *s;
>>
>> In general I'm happy with the patch set but what ensures that this
>> always DRT with respect to the TCG code reads that race with it?
>
> copy_tlb_helper is only called by the "owner" CPU, so it cannot
> race with TCG code (i.e. the owner thread cannot race with itself).
>
> I wanted to add an assert_cpu_is_self(cpu) here, but that needs
> a CPUState pointer. Maybe I should just get rid of the function?
> All the callers have the assert, so that might make the code
> clearer.

I'm happy keeping the function and just expanding the comment:

/* Called with tlb_lock held and only ever from the vCPU context */

Reviewed-by: Alex Bennée 

>
> Thanks,
>
>   Emilio


--
Alex Bennée



Re: [Qemu-devel] [PATCH v3 3/4] cputlb: serialize tlb updates with env->tlb_lock

2018-10-08 Thread Emilio G. Cota
On Mon, Oct 08, 2018 at 14:57:18 +0100, Alex Bennée wrote:
> Emilio G. Cota  writes:
> > The readers that do not hold tlb_lock must use atomic reads when
> > reading .addr_write, since this field can be updated by other threads;
> > the conversion to atomic reads is done in the next patch.
> 
> We don't enforce this for the TCG code - but rely on the backend ISA's
> to avoid torn reads from updates from cputlb that could invalidate an
> entry.

We do enforce it though; the TLB reads we emit in TCG backend
code are appropriately sized to guarantee atomic reads.

> > -/* For atomic correctness when running MTTCG we need to use the right
> > - * primitives when copying entries */
> > -static inline void copy_tlb_helper(CPUTLBEntry *d, CPUTLBEntry *s,
> > -   bool atomic_set)
> > +/* Called with tlb_lock held */
> > +static inline void copy_tlb_helper_locked(CPUTLBEntry *d, const 
> > CPUTLBEntry *s)
> >  {
> > -#if TCG_OVERSIZED_GUEST
> >  *d = *s;
> 
> In general I'm happy with the patch set but what ensures that this
> always DRT with respect to the TCG code reads that race with it?

copy_tlb_helper is only called by the "owner" CPU, so it cannot
race with TCG code (i.e. the owner thread cannot race with itself).

I wanted to add an assert_cpu_is_self(cpu) here, but that needs
a CPUState pointer. Maybe I should just get rid of the function?
All the callers have the assert, so that might make the code
clearer.

Thanks,

Emilio




Re: [Qemu-devel] [PATCH v3 3/4] cputlb: serialize tlb updates with env->tlb_lock

2018-10-08 Thread Alex Bennée


Emilio G. Cota  writes:

> Currently we rely on atomic operations for cross-CPU invalidations.
> There are two cases that these atomics miss: cross-CPU invalidations
> can race with either (1) vCPU threads flushing their TLB, which
> happens via memset, or (2) vCPUs calling tlb_reset_dirty on their TLB,
> which updates .addr_write with a regular store. This results in
> undefined behaviour, since we're mixing regular and atomic ops
> on concurrent accesses.
>
> Fix it by using tlb_lock, a per-vCPU lock. All updaters of tlb_table
> and the corresponding victim cache now hold the lock.
> The readers that do not hold tlb_lock must use atomic reads when
> reading .addr_write, since this field can be updated by other threads;
> the conversion to atomic reads is done in the next patch.

We don't enforce this for the TCG code - but rely on the backend ISA's
to avoid torn reads from updates from cputlb that could invalidate an
entry.

>
> Note that an alternative fix would be to expand the use of atomic ops.
> However, in the case of TLB flushes this would have a huge performance
> impact, since (1) TLB flushes can happen very frequently and (2) we
> currently use a full memory barrier to flush each TLB entry, and a TLB
> has many entries. Instead, acquiring the lock is barely slower than a
> full memory barrier since it is uncontended, and with a single lock
> acquisition we can flush the entire TLB.
>
> Tested-by: Alex Bennée 
> Signed-off-by: Emilio G. Cota 
> ---
>  include/exec/cpu-defs.h |   3 +
>  accel/tcg/cputlb.c  | 152 +---
>  2 files changed, 84 insertions(+), 71 deletions(-)
>
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index a171ffc1a4..4ff62f32bf 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -24,6 +24,7 @@
>  #endif
>
>  #include "qemu/host-utils.h"
> +#include "qemu/thread.h"
>  #include "qemu/queue.h"
>  #ifdef CONFIG_TCG
>  #include "tcg-target.h"
> @@ -142,6 +143,8 @@ typedef struct CPUIOTLBEntry {
>
>  #define CPU_COMMON_TLB \
>  /* The meaning of the MMU modes is defined in the target code. */   \
> +/* tlb_lock serializes updates to tlb_table and tlb_v_table */  \
> +QemuSpin tlb_lock;  \
>  CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];  \
>  CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];   \
>  CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];\
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index f6b388c961..2b0ff47fdf 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -75,6 +75,9 @@ QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
>
>  void tlb_init(CPUState *cpu)
>  {
> +CPUArchState *env = cpu->env_ptr;
> +
> +qemu_spin_init(>tlb_lock);
>  }
>
>  /* flush_all_helper: run fn across all cpus
> @@ -129,8 +132,17 @@ static void tlb_flush_nocheck(CPUState *cpu)
>  atomic_set(>tlb_flush_count, env->tlb_flush_count + 1);
>  tlb_debug("(count: %zu)\n", tlb_flush_count());
>
> +/*
> + * tlb_table/tlb_v_table updates from any thread must hold tlb_lock.
> + * However, updates from the owner thread (as is the case here; see the
> + * above assert_cpu_is_self) do not need atomic_set because all reads
> + * that do not hold the lock are performed by the same owner thread.
> + */
> +qemu_spin_lock(>tlb_lock);
>  memset(env->tlb_table, -1, sizeof(env->tlb_table));
>  memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
> +qemu_spin_unlock(>tlb_lock);
> +
>  cpu_tb_jmp_cache_clear(cpu);
>
>  env->vtlb_index = 0;
> @@ -182,6 +194,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, 
> run_on_cpu_data data)
>
>  tlb_debug("start: mmu_idx:0x%04lx\n", mmu_idx_bitmask);
>
> +qemu_spin_lock(>tlb_lock);
>  for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>
>  if (test_bit(mmu_idx, _idx_bitmask)) {
> @@ -191,6 +204,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, 
> run_on_cpu_data data)
>  memset(env->tlb_v_table[mmu_idx], -1, 
> sizeof(env->tlb_v_table[0]));
>  }
>  }
> +qemu_spin_unlock(>tlb_lock);
>
>  cpu_tb_jmp_cache_clear(cpu);
>
> @@ -247,19 +261,24 @@ static inline bool tlb_hit_page_anyprot(CPUTLBEntry 
> *tlb_entry,
> tlb_hit_page(tlb_entry->addr_code, page);
>  }
>
> -static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong page)
> +/* Called with tlb_lock held */
> +static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
> +  target_ulong page)
>  {
>  if (tlb_hit_page_anyprot(tlb_entry, page)) {
>  memset(tlb_entry, -1, sizeof(*tlb_entry));
>  }
>  }
>
> -static inline void tlb_flush_vtlb_page(CPUArchState *env, int mmu_idx,
> -   target_ulong page)
> +/* Called with tlb_lock held */
>