tcg: Introduce probe_read_access()

David Hildenbrand Wed, 21 Aug 2019 12:36:55 -0700

On 21.08.19 21:19, Richard Henderson wrote:
> On 8/21/19 10:37 AM, David Hildenbrand wrote:
>> Hah, guess what, I implemented a similar variant of "fetch all
>> of the host addresses" *but* it is not that easy as you might
>> think (sorry for the bad news).
> 
> I think it is, because I didn't think it *that* easy.  :-)


:) hehe

> 
>> There are certain cases where we can't get access to the raw host
>> page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't
>> as you describe. (my first approach did exactly this)
> 
> NODIRTY and LAP are automatically handled via probe_write
> faulting instead of returning the address.  I think there
> may be a bug in probe_write at present not checking

For LAP pages we immediately set TLB_INVALID_MASK again, to trigger a
new fault on the next write access (only). The could be handled in
tlb_vaddr_to_host(), simply returning the address to the page after
trying to fill the tlb and succeeding (I implemented that, that's the
easy part).

TLB_NOTDIRTY and TLB_MMIO are the real issue. We don't want to refault,
we want to treat that memory like IO memory and route it via
MemoryRegionOps() - e.g., watch_mem_ops() in qemu/exec.c. So it's not a
fault but IO memory.

That's why we don't expose that memory via tlb_vaddr_to_host(). Faulting
in case of TLB_NOTDIRTY or TLB_MMIO would be bad.

> 
> Watchpoints could be handled the same way, if we were to
> export check_watchpoint from exec.c.  Indeed, I see no way
> to handle watchpoints correctly if we don't.  I think that's
> an outstanding bug with probe_write.

probe_write() performs the MMU translation. If that succeeds, there is
no fault. If there are watchpoints, the memory is treated like IO and
memory access is rerouted. I think this works as designed.

> 
> Any other objections?  I certainly think that restricting the
> size of such operations to one page is a large simplification
> over the S390Access array thing that you create in this patch.

You mean two pages I assume. Yeah, I could certainly simplify the
prototype patch I have here quite a lot. 2 pages should be enough for
everybody ;)

The basic question is: Should we try to somehow work around IO memory
access (including NOTDIRTY and watchpoints) in tlb_vaddr_to_host() or
perform access in these cases via cpu_physical_memory_write() ?

It feels somewhat wrong to me to tune tlb_vaddr_to_host() to always
return the address of a page although we are dealing with
MemoryRegionOps()  that want a more controlled access mechanism.

> 
> 
> r~
> 
>>
>> The following patch requires another re-factoring
>> (tcg_s390_cpu_mmu_translate), but you should get the idea.
>>
>>
>>
>> From 0cacd2aea3dbc25e93492cca04f6c866b86d7f8a Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <da...@redhat.com>
>> Date: Tue, 20 Aug 2019 09:37:11 +0200
>> Subject: [PATCH v1] s390x/tcg: Fault-safe MVC (MOVE) implementation
>>
>> MVC can cross page boundaries. In case we fault on the second page, we
>> already partially copied data. If we have overlaps, we would
>> trigger a fault after having partially moved data, eventually having
>> our original data already overwritten. When continuing after the fault,
>> we would try to move already modified data, not the original data -
>> very bad.
>>
>> glibc started to use MVC for forward memmove() and is able to trigger
>> exectly this corruption (via rpmbuild and rpm). Fedora 31 (rawhide)
>> currently fails to install as we trigger rpm database corruptions due to
>> this bug.
>>
>> We need a way to translate a virtual address range to individual pages that
>> we can access later on without triggering faults. Probing all virtual
>> addresses once before the read/write is not sufficient - the guest could
>> have modified the page tables (e.g., write-protect, map out) in between,
>> so on we could fault on any new tlb_fill() - we have to skip any new MMU
>> translations.
>>
>> Unfortunately, there are TLB entries for which cannot get a host address
>> for (esp., watchpoints, LAP, NOTDIRTY) - in these cases we cannot avoid
>> a new MMU translation using the ordinary ld/st helpers. Let's fallback
>> to guest physical addresses in these cases, that we access via
>> cpu_physical_memory_(read|write),
>>
>> This change reduced the boottime for s390x guests (to prompt) from ~1:29
>> min to ~1:19 min in my tests. For example, LAP protected pages are now only
>> translated once when writing to them using MVC and we don't always fallback
>> to byte-based copies.
>>
>> We will want to use the same mechanism for other accesses as well (e.g.,
>> mvcl), prepare for that right away.
>>
>> Signed-off-by: David Hildenbrand <da...@redhat.com>
>> ---
>>  target/s390x/mem_helper.c | 213 +++++++++++++++++++++++++++++++++++---
>>  1 file changed, 200 insertions(+), 13 deletions(-)
>>
>> diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c
>> index 91ba2e03d9..1ca293e00d 100644
>> --- a/target/s390x/mem_helper.c
>> +++ b/target/s390x/mem_helper.c
>> @@ -24,8 +24,10 @@
>>  #include "exec/helper-proto.h"
>>  #include "exec/exec-all.h"
>>  #include "exec/cpu_ldst.h"
>> +#include "exec/cpu-common.h"
>>  #include "qemu/int128.h"
>>  #include "qemu/atomic128.h"
>> +#include "tcg_s390x.h"
>>  
>>  #if !defined(CONFIG_USER_ONLY)
>>  #include "hw/s390x/storage-keys.h"
>> @@ -104,6 +106,181 @@ static inline void cpu_stsize_data_ra(CPUS390XState 
>> *env, uint64_t addr,
>>      }
>>  }
>>  
>> +/*
>> + * An access covers one page, except for the start/end of the translated
>> + * virtual address range.
>> + */
>> +typedef struct S390Access {
>> +    union {
>> +        char *haddr;
>> +        hwaddr paddr;
>> +    };
>> +    uint16_t size;
>> +    bool isHaddr;
>> +} S390Access;
>> +
>> +/*
>> + * Prepare access to a virtual address range, guaranteeing we won't trigger
>> + * faults during the actual access. Sometimes we can't get access to the
>> + * host address (e.g., LAP, cpu watchpoints/PER, clean pages, ...). Then, we
>> + * translate to guest physical addresses instead. We'll have to perform
>> + * slower, indirect, access to these physical addresses then.
>> + */
>> +static void access_prepare_idx(CPUS390XState *env, S390Access access[],
>> +                               int nb_access, vaddr vaddr, target_ulong 
>> size,
>> +                               MMUAccessType access_type, int mmu_idx,
>> +                               uintptr_t ra)
>> +{
>> +    int i = 0;
>> +    int cur_size;
>> +
>> +    /*
>> +     * After we obtained the host address of a TLB entry that entry might
>> +     * be invalidated again - e.g., via tlb_set_dirty(), via another
>> +     * tlb_fill(). We assume here that it is fine to temporarily store the
>> +     * host address to access it later - we didn't agree to any tlb flush 
>> and
>> +     * there seems to be no mechanism protecting the return value of
>> +     * tlb_vaddr_to_host().
>> +     */
>> +    while (size) {
>> +        g_assert(i < nb_access);
>> +        cur_size = adj_len_to_page(size, vaddr);
>> +
>> +        access[i].isHaddr = true;
>> +        access[i].haddr = tlb_vaddr_to_host(env, vaddr, access_type, 
>> mmu_idx);
>> +        if (!access[i].haddr) {
>> +            access[i].isHaddr = false;
>> +            tcg_s390_cpu_mmu_translate(env, vaddr, cur_size, access_type,
>> +                                       mmu_idx, false, &access[i].paddr,
>> +                                       NULL, ra);
>> +        }
>> +        access[i].size = cur_size;
>> +
>> +        vaddr += cur_size;
>> +        size -= cur_size;
>> +        i++;
>> +    }
>> +
>> +    /* let's zero-out the remaining entries, so we have a size of 0 */
>> +    if (i < nb_access) {
>> +        memset(&access[i], 0 , sizeof(S390Access) * (nb_access - i));
>> +    }
>> +}
>> +
>> +static void access_prepare(CPUS390XState *env, S390Access access[],
>> +                           int nb_access, target_ulong vaddr, target_ulong 
>> size,
>> +                           MMUAccessType access_type, uintptr_t ra)
>> +{
>> +    int mmu_idx = cpu_mmu_index(env, false);
>> +
>> +    access_prepare_idx(env, access, nb_access, vaddr, size, access_type,
>> +                       mmu_idx, ra);
>> +}
>> +
>> +static void access_set(CPUS390XState *env, S390Access write[], int nb_write,
>> +                       uint8_t byte, target_ulong size)
>> +{
>> +    target_ulong cur_size;
>> +    void *buf = NULL;
>> +    int w = 0;
>> +
>> +    while (size) {
>> +        g_assert(w < nb_write);
>> +        if (!write[w].size) {
>> +            w++;
>> +            continue;
>> +        }
>> +
>> +        cur_size = MIN(size, write[w].size);
>> +        if (write[w].isHaddr) {
>> +            memset(write[w].haddr, byte, cur_size);
>> +            write[w].haddr += cur_size;
>> +        } else {
>> +#ifndef CONFIG_USER_ONLY
>> +            if (!buf) {
>> +                buf = g_malloc(TARGET_PAGE_SIZE);
>> +                memset(buf, byte, cur_size);
>> +            }
>> +            cpu_physical_memory_write(write[w].paddr, buf, cur_size);
>> +            write[w].paddr += cur_size;
>> +#else
>> +            g_assert_not_reached();
>> +#endif
>> +        }
>> +        write[w].size -= cur_size;
>> +        size -= cur_size;
>> +    }
>> +    g_free(buf);
>> +}
>> +
>> +/*
>> + * Copy memory in chunks up to chunk_size. If the ranges don't overlap or
>> + * if it's a forward move, this function behaves like memmove().
>> + *
>> + * To achieve a backwards byte-by-byte copy (e.g., MVC), the chunk_size
>> + * must not be bigger than the address difference (in the worst case, 1 
>> byte).
>> + */
>> +static void access_copy(CPUS390XState *env, S390Access write[], int 
>> nb_write,
>> +                        S390Access read[], int nb_read, target_ulong size,
>> +                        target_ulong chunk_size)
>> +{
>> +    target_ulong cur_size;
>> +    void *buf = NULL;
>> +    int r = 0;
>> +    int w = 0;
>> +
>> +    g_assert(chunk_size > 0);
>> +    chunk_size = MIN(chunk_size, TARGET_PAGE_SIZE);
>> +
>> +    while (size) {
>> +        g_assert(w < nb_write);
>> +        g_assert(r < nb_read);
>> +        if (!write[w].size) {
>> +            w++;
>> +            continue;
>> +        }
>> +        if (!read[r].size) {
>> +            r++;
>> +            continue;
>> +        }
>> +        cur_size = MIN(MIN(MIN(size, write[w].size), read[r].size), 
>> chunk_size);
>> +
>> +        if (write[w].isHaddr && read[r].isHaddr) {
>> +            memmove(write[w].haddr, read[r].haddr,
>> +                    cur_size);
>> +            write[w].haddr += cur_size;
>> +            read[r].haddr += cur_size;
>> +#ifndef CONFIG_USER_ONLY
>> +        } else if (!write[w].isHaddr && read[r].isHaddr) {
>> +            cpu_physical_memory_write(write[w].paddr,
>> +                                      read[r].haddr, cur_size);
>> +            write[w].paddr += cur_size;
>> +            read[r].haddr += cur_size;
>> +        } else if (write[w].isHaddr && !read[r].isHaddr) {
>> +            cpu_physical_memory_read(read[r].paddr,
>> +                                     write[w].haddr, cur_size);
>> +            write[w].haddr += cur_size;
>> +            read[r].paddr += cur_size;
>> +        } else {
>> +            if (!buf) {
>> +                buf = g_malloc(chunk_size);
>> +            }
>> +            cpu_physical_memory_read(read[r].paddr, buf, cur_size);
>> +            cpu_physical_memory_write(write[w].paddr, buf, cur_size);
>> +            write[w].paddr += cur_size;
>> +            read[r].paddr += cur_size;
>> +#else
>> +        } else {
>> +            g_assert_not_reached();
>> +#endif
>> +        }
>> +        write[w].size -= cur_size;
>> +        read[r].size -= cur_size;
>> +        size -= cur_size;
>> +    }
>> +    g_free(buf);
>> +}
>> +
>>  static void fast_memset(CPUS390XState *env, uint64_t dest, uint8_t byte,
>>                          uint32_t l, uintptr_t ra)
>>  {
>> @@ -302,24 +479,34 @@ uint32_t HELPER(oc)(CPUS390XState *env, uint32_t l, 
>> uint64_t dest,
>>  static uint32_t do_helper_mvc(CPUS390XState *env, uint32_t l, uint64_t dest,
>>                                uint64_t src, uintptr_t ra)
>>  {
>> -    uint32_t i;
>> +    /* 256 bytes cannot cross more than two pages */
>> +    S390Access read[2];
>> +    S390Access write[2];
>>  
>>      HELPER_LOG("%s l %d dest %" PRIx64 " src %" PRIx64 "\n",
>>                 __func__, l, dest, src);
>> +    l++;
>>  
>> -    /* mvc and memmove do not behave the same when areas overlap! */
>> -    /* mvc with source pointing to the byte after the destination is the
>> -       same as memset with the first source byte */
>> +    g_assert(l <= 256);
>> +    access_prepare(env, write, ARRAY_SIZE(write), dest, l, MMU_DATA_STORE, 
>> ra);
>> +
>> +    /*
>> +     * The result of MVC is as if moving single bytes from left to right
>> +     * (in contrast to memmove()). It can be used like memset().
>> +     */
>>      if (dest == src + 1) {
>> -        fast_memset(env, dest, cpu_ldub_data_ra(env, src, ra), l + 1, ra);
>> -    } else if (dest < src || src + l < dest) {
>> -        fast_memmove(env, dest, src, l + 1, ra);
>> -    } else {
>> -        /* slow version with byte accesses which always work */
>> -        for (i = 0; i <= l; i++) {
>> -            uint8_t x = cpu_ldub_data_ra(env, src + i, ra);
>> -            cpu_stb_data_ra(env, dest + i, x, ra);
>> -        }
>> +        access_set(env, write, ARRAY_SIZE(write),
>> +                   cpu_ldub_data_ra(env, src, ra), l);
>> +        return env->cc_op;
>> +    }
>> +
>> +    access_prepare(env, read, ARRAY_SIZE(read), src, l, MMU_DATA_LOAD, ra);
>> +    if (dest < src || src + l <= dest) {
>> +        access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), 
>> l,
>> +                    TARGET_PAGE_SIZE);
>> +    } else if (src < dest) {
>> +        access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), 
>> l,
>> +                    dest - src);
>>      }
>>  
>>      return env->cc_op;
>>
> 


-- 

Thanks,

David / dhildenb

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()

Reply via email to