On 21.08.19 21:19, Richard Henderson wrote: > On 8/21/19 10:37 AM, David Hildenbrand wrote: >> Hah, guess what, I implemented a similar variant of "fetch all >> of the host addresses" *but* it is not that easy as you might >> think (sorry for the bad news). > > I think it is, because I didn't think it *that* easy. :-)
:) hehe > >> There are certain cases where we can't get access to the raw host >> page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't >> as you describe. (my first approach did exactly this) > > NODIRTY and LAP are automatically handled via probe_write > faulting instead of returning the address. I think there > may be a bug in probe_write at present not checking For LAP pages we immediately set TLB_INVALID_MASK again, to trigger a new fault on the next write access (only). The could be handled in tlb_vaddr_to_host(), simply returning the address to the page after trying to fill the tlb and succeeding (I implemented that, that's the easy part). TLB_NOTDIRTY and TLB_MMIO are the real issue. We don't want to refault, we want to treat that memory like IO memory and route it via MemoryRegionOps() - e.g., watch_mem_ops() in qemu/exec.c. So it's not a fault but IO memory. That's why we don't expose that memory via tlb_vaddr_to_host(). Faulting in case of TLB_NOTDIRTY or TLB_MMIO would be bad. > > Watchpoints could be handled the same way, if we were to > export check_watchpoint from exec.c. Indeed, I see no way > to handle watchpoints correctly if we don't. I think that's > an outstanding bug with probe_write. probe_write() performs the MMU translation. If that succeeds, there is no fault. If there are watchpoints, the memory is treated like IO and memory access is rerouted. I think this works as designed. > > Any other objections? I certainly think that restricting the > size of such operations to one page is a large simplification > over the S390Access array thing that you create in this patch. You mean two pages I assume. Yeah, I could certainly simplify the prototype patch I have here quite a lot. 2 pages should be enough for everybody ;) The basic question is: Should we try to somehow work around IO memory access (including NOTDIRTY and watchpoints) in tlb_vaddr_to_host() or perform access in these cases via cpu_physical_memory_write() ? It feels somewhat wrong to me to tune tlb_vaddr_to_host() to always return the address of a page although we are dealing with MemoryRegionOps() that want a more controlled access mechanism. > > > r~ > >> >> The following patch requires another re-factoring >> (tcg_s390_cpu_mmu_translate), but you should get the idea. >> >> >> >> From 0cacd2aea3dbc25e93492cca04f6c866b86d7f8a Mon Sep 17 00:00:00 2001 >> From: David Hildenbrand <da...@redhat.com> >> Date: Tue, 20 Aug 2019 09:37:11 +0200 >> Subject: [PATCH v1] s390x/tcg: Fault-safe MVC (MOVE) implementation >> >> MVC can cross page boundaries. In case we fault on the second page, we >> already partially copied data. If we have overlaps, we would >> trigger a fault after having partially moved data, eventually having >> our original data already overwritten. When continuing after the fault, >> we would try to move already modified data, not the original data - >> very bad. >> >> glibc started to use MVC for forward memmove() and is able to trigger >> exectly this corruption (via rpmbuild and rpm). Fedora 31 (rawhide) >> currently fails to install as we trigger rpm database corruptions due to >> this bug. >> >> We need a way to translate a virtual address range to individual pages that >> we can access later on without triggering faults. Probing all virtual >> addresses once before the read/write is not sufficient - the guest could >> have modified the page tables (e.g., write-protect, map out) in between, >> so on we could fault on any new tlb_fill() - we have to skip any new MMU >> translations. >> >> Unfortunately, there are TLB entries for which cannot get a host address >> for (esp., watchpoints, LAP, NOTDIRTY) - in these cases we cannot avoid >> a new MMU translation using the ordinary ld/st helpers. Let's fallback >> to guest physical addresses in these cases, that we access via >> cpu_physical_memory_(read|write), >> >> This change reduced the boottime for s390x guests (to prompt) from ~1:29 >> min to ~1:19 min in my tests. For example, LAP protected pages are now only >> translated once when writing to them using MVC and we don't always fallback >> to byte-based copies. >> >> We will want to use the same mechanism for other accesses as well (e.g., >> mvcl), prepare for that right away. >> >> Signed-off-by: David Hildenbrand <da...@redhat.com> >> --- >> target/s390x/mem_helper.c | 213 +++++++++++++++++++++++++++++++++++--- >> 1 file changed, 200 insertions(+), 13 deletions(-) >> >> diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c >> index 91ba2e03d9..1ca293e00d 100644 >> --- a/target/s390x/mem_helper.c >> +++ b/target/s390x/mem_helper.c >> @@ -24,8 +24,10 @@ >> #include "exec/helper-proto.h" >> #include "exec/exec-all.h" >> #include "exec/cpu_ldst.h" >> +#include "exec/cpu-common.h" >> #include "qemu/int128.h" >> #include "qemu/atomic128.h" >> +#include "tcg_s390x.h" >> >> #if !defined(CONFIG_USER_ONLY) >> #include "hw/s390x/storage-keys.h" >> @@ -104,6 +106,181 @@ static inline void cpu_stsize_data_ra(CPUS390XState >> *env, uint64_t addr, >> } >> } >> >> +/* >> + * An access covers one page, except for the start/end of the translated >> + * virtual address range. >> + */ >> +typedef struct S390Access { >> + union { >> + char *haddr; >> + hwaddr paddr; >> + }; >> + uint16_t size; >> + bool isHaddr; >> +} S390Access; >> + >> +/* >> + * Prepare access to a virtual address range, guaranteeing we won't trigger >> + * faults during the actual access. Sometimes we can't get access to the >> + * host address (e.g., LAP, cpu watchpoints/PER, clean pages, ...). Then, we >> + * translate to guest physical addresses instead. We'll have to perform >> + * slower, indirect, access to these physical addresses then. >> + */ >> +static void access_prepare_idx(CPUS390XState *env, S390Access access[], >> + int nb_access, vaddr vaddr, target_ulong >> size, >> + MMUAccessType access_type, int mmu_idx, >> + uintptr_t ra) >> +{ >> + int i = 0; >> + int cur_size; >> + >> + /* >> + * After we obtained the host address of a TLB entry that entry might >> + * be invalidated again - e.g., via tlb_set_dirty(), via another >> + * tlb_fill(). We assume here that it is fine to temporarily store the >> + * host address to access it later - we didn't agree to any tlb flush >> and >> + * there seems to be no mechanism protecting the return value of >> + * tlb_vaddr_to_host(). >> + */ >> + while (size) { >> + g_assert(i < nb_access); >> + cur_size = adj_len_to_page(size, vaddr); >> + >> + access[i].isHaddr = true; >> + access[i].haddr = tlb_vaddr_to_host(env, vaddr, access_type, >> mmu_idx); >> + if (!access[i].haddr) { >> + access[i].isHaddr = false; >> + tcg_s390_cpu_mmu_translate(env, vaddr, cur_size, access_type, >> + mmu_idx, false, &access[i].paddr, >> + NULL, ra); >> + } >> + access[i].size = cur_size; >> + >> + vaddr += cur_size; >> + size -= cur_size; >> + i++; >> + } >> + >> + /* let's zero-out the remaining entries, so we have a size of 0 */ >> + if (i < nb_access) { >> + memset(&access[i], 0 , sizeof(S390Access) * (nb_access - i)); >> + } >> +} >> + >> +static void access_prepare(CPUS390XState *env, S390Access access[], >> + int nb_access, target_ulong vaddr, target_ulong >> size, >> + MMUAccessType access_type, uintptr_t ra) >> +{ >> + int mmu_idx = cpu_mmu_index(env, false); >> + >> + access_prepare_idx(env, access, nb_access, vaddr, size, access_type, >> + mmu_idx, ra); >> +} >> + >> +static void access_set(CPUS390XState *env, S390Access write[], int nb_write, >> + uint8_t byte, target_ulong size) >> +{ >> + target_ulong cur_size; >> + void *buf = NULL; >> + int w = 0; >> + >> + while (size) { >> + g_assert(w < nb_write); >> + if (!write[w].size) { >> + w++; >> + continue; >> + } >> + >> + cur_size = MIN(size, write[w].size); >> + if (write[w].isHaddr) { >> + memset(write[w].haddr, byte, cur_size); >> + write[w].haddr += cur_size; >> + } else { >> +#ifndef CONFIG_USER_ONLY >> + if (!buf) { >> + buf = g_malloc(TARGET_PAGE_SIZE); >> + memset(buf, byte, cur_size); >> + } >> + cpu_physical_memory_write(write[w].paddr, buf, cur_size); >> + write[w].paddr += cur_size; >> +#else >> + g_assert_not_reached(); >> +#endif >> + } >> + write[w].size -= cur_size; >> + size -= cur_size; >> + } >> + g_free(buf); >> +} >> + >> +/* >> + * Copy memory in chunks up to chunk_size. If the ranges don't overlap or >> + * if it's a forward move, this function behaves like memmove(). >> + * >> + * To achieve a backwards byte-by-byte copy (e.g., MVC), the chunk_size >> + * must not be bigger than the address difference (in the worst case, 1 >> byte). >> + */ >> +static void access_copy(CPUS390XState *env, S390Access write[], int >> nb_write, >> + S390Access read[], int nb_read, target_ulong size, >> + target_ulong chunk_size) >> +{ >> + target_ulong cur_size; >> + void *buf = NULL; >> + int r = 0; >> + int w = 0; >> + >> + g_assert(chunk_size > 0); >> + chunk_size = MIN(chunk_size, TARGET_PAGE_SIZE); >> + >> + while (size) { >> + g_assert(w < nb_write); >> + g_assert(r < nb_read); >> + if (!write[w].size) { >> + w++; >> + continue; >> + } >> + if (!read[r].size) { >> + r++; >> + continue; >> + } >> + cur_size = MIN(MIN(MIN(size, write[w].size), read[r].size), >> chunk_size); >> + >> + if (write[w].isHaddr && read[r].isHaddr) { >> + memmove(write[w].haddr, read[r].haddr, >> + cur_size); >> + write[w].haddr += cur_size; >> + read[r].haddr += cur_size; >> +#ifndef CONFIG_USER_ONLY >> + } else if (!write[w].isHaddr && read[r].isHaddr) { >> + cpu_physical_memory_write(write[w].paddr, >> + read[r].haddr, cur_size); >> + write[w].paddr += cur_size; >> + read[r].haddr += cur_size; >> + } else if (write[w].isHaddr && !read[r].isHaddr) { >> + cpu_physical_memory_read(read[r].paddr, >> + write[w].haddr, cur_size); >> + write[w].haddr += cur_size; >> + read[r].paddr += cur_size; >> + } else { >> + if (!buf) { >> + buf = g_malloc(chunk_size); >> + } >> + cpu_physical_memory_read(read[r].paddr, buf, cur_size); >> + cpu_physical_memory_write(write[w].paddr, buf, cur_size); >> + write[w].paddr += cur_size; >> + read[r].paddr += cur_size; >> +#else >> + } else { >> + g_assert_not_reached(); >> +#endif >> + } >> + write[w].size -= cur_size; >> + read[r].size -= cur_size; >> + size -= cur_size; >> + } >> + g_free(buf); >> +} >> + >> static void fast_memset(CPUS390XState *env, uint64_t dest, uint8_t byte, >> uint32_t l, uintptr_t ra) >> { >> @@ -302,24 +479,34 @@ uint32_t HELPER(oc)(CPUS390XState *env, uint32_t l, >> uint64_t dest, >> static uint32_t do_helper_mvc(CPUS390XState *env, uint32_t l, uint64_t dest, >> uint64_t src, uintptr_t ra) >> { >> - uint32_t i; >> + /* 256 bytes cannot cross more than two pages */ >> + S390Access read[2]; >> + S390Access write[2]; >> >> HELPER_LOG("%s l %d dest %" PRIx64 " src %" PRIx64 "\n", >> __func__, l, dest, src); >> + l++; >> >> - /* mvc and memmove do not behave the same when areas overlap! */ >> - /* mvc with source pointing to the byte after the destination is the >> - same as memset with the first source byte */ >> + g_assert(l <= 256); >> + access_prepare(env, write, ARRAY_SIZE(write), dest, l, MMU_DATA_STORE, >> ra); >> + >> + /* >> + * The result of MVC is as if moving single bytes from left to right >> + * (in contrast to memmove()). It can be used like memset(). >> + */ >> if (dest == src + 1) { >> - fast_memset(env, dest, cpu_ldub_data_ra(env, src, ra), l + 1, ra); >> - } else if (dest < src || src + l < dest) { >> - fast_memmove(env, dest, src, l + 1, ra); >> - } else { >> - /* slow version with byte accesses which always work */ >> - for (i = 0; i <= l; i++) { >> - uint8_t x = cpu_ldub_data_ra(env, src + i, ra); >> - cpu_stb_data_ra(env, dest + i, x, ra); >> - } >> + access_set(env, write, ARRAY_SIZE(write), >> + cpu_ldub_data_ra(env, src, ra), l); >> + return env->cc_op; >> + } >> + >> + access_prepare(env, read, ARRAY_SIZE(read), src, l, MMU_DATA_LOAD, ra); >> + if (dest < src || src + l <= dest) { >> + access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), >> l, >> + TARGET_PAGE_SIZE); >> + } else if (src < dest) { >> + access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), >> l, >> + dest - src); >> } >> >> return env->cc_op; >> > -- Thanks, David / dhildenb