On Thu, Oct 1, 2020 at 10:36 PM Kirill A. Shutemov <kirill.shute...@linux.intel.com> wrote: > > On Thu, Oct 01, 2020 at 05:09:02PM -0700, Lokesh Gidra wrote: > > On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <kaleshsi...@google.com> wrote: > > > > > > On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov > > > <kirill.shute...@linux.intel.com> wrote: > > > > > > > > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote: > > > > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov > > > > > <kirill.shute...@linux.intel.com> wrote: > > > > > > > > > > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote: > > > > > > > mremap time can be optimized by moving entries at the PMD/PUD > > > > > > > level if > > > > > > > the source and destination addresses are PMD/PUD-aligned and > > > > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 > > > > > > > and > > > > > > > x86. Other architectures where this type of move is supported and > > > > > > > known to > > > > > > > be safe can also opt-in to these optimizations by enabling > > > > > > > HAVE_MOVE_PMD > > > > > > > and HAVE_MOVE_PUD. > > > > > > > > > > > > > > Observed Performance Improvements for remapping a PUD-aligned > > > > > > > 1GB-sized > > > > > > > region on x86 and arm64: > > > > > > > > > > > > > > - HAVE_MOVE_PMD is already enabled on x86 : N/A > > > > > > > - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up > > > > > > > > > > > > > > - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up > > > > > > > - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up > > > > > > > > > > > > > > Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD > > > > > > > give a total of ~150x speed up on arm64. > > > > > > > > > > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD? > > > > > > > > > > > We have a Java garbage collector under development which requires > > > > > moving physical pages of multi-gigabyte heap using mremap. During this > > > > > move, the application threads have to be paused for correctness. It is > > > > > critical to keep this pause as short as possible to avoid jitters > > > > > during user interaction. This is where HAVE_MOVE_PUD will greatly > > > > > help. > > > > > > > > Any chance to quantify the effect of mremap() with and without > > > > HAVE_MOVE_PUD? > > > > > > > > I doubt it's a major contributor to the GC pause. I expect you need to > > > > move tens of gigs to get sizable effect. And if your GC routinely moves > > > > tens of gigs, maybe problem somewhere else? > > > > > > > > I'm asking for numbers, because increase in complexity comes with cost. > > > > If it doesn't provide an substantial benefit to a real workload > > > > maintaining the code forever doesn't make sense. > > > > > mremap is indeed the biggest contributor to the GC pause. It has to > > take place in what is typically known as a 'stop-the-world' pause, > > wherein all application threads are paused. During this pause the GC > > thread flips the GC roots (threads' stacks, globals etc.), and then > > resumes threads along with concurrent compaction of the heap.This > > GC-root flip differs depending on which compaction algorithm is being > > used. > > > > In our case it involves updating object references in threads' stacks > > and remapping java heap to a different location. The threads' stacks > > can be handled in parallel with the mremap. Therefore, the dominant > > factor is indeed the cost of mremap. From patches 2 and 4, it is clear > > that remapping 1GB without this optimization will take ~9ms on arm64. > > > > Although this mremap has to happen only once every GC cycle, and the > > typical size is also not going to be more than a GB or 2, pausing > > application threads for ~9ms is guaranteed to cause jitters. OTOH, > > with this optimization, mremap is reduced to ~60us, which is a totally > > acceptable pause time. > > > > Unfortunately, implementation of the new GC algorithm hasn't yet > > reached the point where I can quantify the effect of this > > optimization. But I can confirm that without this optimization the new > > GC will not be approved. > > IIUC, the 9ms -> 90us improvement attributed to combination HAVE_MOVE_PMD > and HAVE_MOVE_PUD, right? I expect HAVE_MOVE_PMD to be reasonable for some > workloads, but marginal benefit of HAVE_MOVE_PUD is in doubt. Do you see > it's useful for your workload? > Yes, 9ms -> 90us is when both are combined. The past experience has been that even ~1ms long stop-the-world pause is prone to cause jitters. HAVE_MOVE_PMD takes us only this far. So HAVE_MOVE_PUD is required to bring the mremap cost to acceptable level.
Ideally, I was hoping that the functionality of HAVE_MOVE_PMD can be extended to all levels of the hierarchical page table, and in the process simplify the implementation. But unfortunately, that doesn't seem to be possible from patch 3. > -- > Kirill A. Shutemov