Re: failing to understand the issues with transparent huge paging

Henri Tremblay Wed, 09 Aug 2017 06:07:59 -0700

>From what I see here I would deduce:
1- THP can give a huge performance gain (when using PreTouch, in some
cases, possibly when not playing with offheap too much)
2- But it will increase hiccups


A bit like the throughput collector.

So my current take away is:

   - Use THP is you care about throughput only
   - If you care about latency, just don't
   - If you really care about throughput, use non-transparent huge pages

Is that accurate?

On 9 August 2017 at 04:51, Peter Veentjer <alarmnum...@gmail.com> wrote:

> Thanks for your very useful replies Gil.
>
> Question:
>
> Using huge pages can give a big performance boost:
>
> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/
>
> $ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch
> real    13m58.167s
> user    43m37.519s
> sys     1011m25.740s
>
> $ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
> real    2m14.758s
> user    1m56.488s
> sys     73m59.046s
>
>
>
> But THP seems to be unusable. Does this effectively mean that we can't
> benefit from THP under Linux?
>
> Till so far it looks like a damned if you do, damned if you don't
> situation.
>
> Or should we move to non transparent huge pages?
>
> On Tuesday, August 8, 2017 at 7:44:25 PM UTC+3, Gil Tene wrote:
>>
>>
>>
>> On Monday, August 7, 2017 at 11:50:27 AM UTC-7, Alen Vrečko wrote:
>>>
>>> Saw this a while back.
>>>
>>> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/
>>>
>>> Basically using THP/defrag with madvise and using
>>> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts.
>>>
>>> Looks like the defrag cost should be paid in full at startup due to
>>> AlwaysPreTouch. Never got around to test this in production. Just have
>>> THP disabled. Thoughts?
>>>
>>
>> The above flags would only cover the Java heap. In a Java application. So
>> obviously THP for non-Java things doesn't get helped by that.
>>
>> And for Java stuff, unfortunately, there are lots of non-Java-heap things
>> that are exposed to THP's potentially huge on-demand faulting latencies.
>> The JVM manages lots memory outside of the Java heap for various things (GC
>> stuff, stacks, Metaspace, Code cache, JIT compiler things, and a whole
>> bunch of runtime stuff), and the application itself will often be using
>> off-heap stuff intentionally (e.g. via DirectByteBuffers) or inadvertently
>> (e.g. when libraries make either temporary or lasting use of off-heap
>> memory). E.g. even simple socket I/O involves some use of off heap memory
>> as an intermediate storage location.
>>
>> As a simple demonstration of why THP artifacts for non-Java-heaps are a
>> key problem for Java apps, I first ran into these THP issues by experience,
>> with Zing, right around the time that RHEL 6 turned it on. We found out the
>> hard way that we have to turn it off to maintain reasonable latency
>> profiles. And since Zing has always *ensured* 2MB pages are used for
>> everything in the heap, the code cache, and for virtually all GC support
>> structures, it is clearly the THP impact on all the rest of the stuff that
>> has caused us to deal with it and recommend against it's use. The way we
>> see THP manifest regularly ()when left on) is with occasional huge TTSPs
>> (time to safepoint) [huge in Zing terms, meaning anything from a few msec
>> to 100s of msec], which we know are there because we specifically log and
>> chart TTSPs. But high TTSPs are just a symptom: since we only measure TTSP
>> when we actually try to bring threads to safepoints, and doing that is a
>> relatively (dynamically) rare event. that means that whenever we see actual
>> high TTSPs in our logs, it is likely that similar-sized disruptions are
>> occurring at the application level, but at a much higher frequently than
>> than of the high TTSPs we observe.
>>
>>
>>> - Alen
>>>
>>> 2017-08-07 20:14 GMT+02:00 Gil Tene <g...@azul.com>:
>>> > To highlight the problematic path in current THP kernel
>>> implementations,
>>> > here is an example call trace that can happen (pulled from the
>>> discussion
>>> > linked to below). It shows that a simple on-demand page fault in
>>> regular
>>> > anonymous memory (e.g. when a normal malloc call is made and
>>> manipulates a
>>> > malloc-mananged 2MB area, or when such the resulting malloc'ed struct
>>> is
>>> > written to) can end up compacting an entire zone (which can be the
>>> vast
>>> > majority of system memory) in a single call, using the faulting
>>> thread. The
>>> > specific example stack trace is taken from a situation where that
>>> fault took
>>> > so long (on a NUMA system) that a soft0lockup was triggered, showing
>>> the
>>> > call took longer than 22 seconds (!!!). But even without the NUMA or
>>> > migrate_pages aspects, compassion of a single zone can take 100s of
>>> msec or
>>> > more.
>>> >
>>> > Browsing through the current kernel code (e.g.
>>> > http://elixir.free-electrons.com/linux/latest/source/mm/comp
>>> action.c#L1722)
>>> > seems to show that this is still the likely path that would be taken
>>> when no
>>> > free 2MB pages are found in current kernels :-(
>>> >
>>> > And this situation will naturally occur under all sorts of common
>>> timing g
>>> > conditions (i/o fragmenting free memory to 4KB (but no 2M), background
>>> > compaction/defrag falls behind during some heavy kernel-driven i/o
>>> spike,
>>> > and some unlucky thread doing a malloc when the 2MB physical free list
>>> > exhausted)
>>> >
>>> >
>>> > kernel: Call Trace:
>>> > kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240
>>> > kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610
>>> > kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380
>>> > kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400
>>> > kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0
>>> > kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0
>>> > kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196
>>> > kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90
>>> > kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0
>>> > kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140
>>> > kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410
>>> > kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60
>>> > kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520
>>> > kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0
>>> > kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40
>>> > kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300
>>> > kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70
>>> > kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0
>>> > kernel: [<ffffffff8160b808>] page_fault+0x28/0x30
>>> >
>>> >
>>> > On Monday, August 7, 2017 at 10:25:50 AM UTC-7, Gil Tene wrote:
>>> >>
>>> >> THP certainly sits in my "just don't do it" list of tuning things due
>>> to
>>> >> it's fundamental dramatic latency disruption in current
>>> implementations,
>>> >> seen as occasional 10s to 100s of msec (and sometimes even 1sec+)
>>> stalls on
>>> >> something as simple and common as a 32 byte malloc. THP is a form of
>>> >> in-kernel GC. And the current THP implementation involves potential
>>> and
>>> >> occasional synchronous, stop-the-world compaction done at
>>> allocation-time,
>>> >> on or by any application thread that does an mmap or a malloc.
>>> >>
>>> >> I dug up an e-mail I wrote on the subject (to a recipient on this
>>> list)
>>> >> back in Jan 2013 [see below]. While it has some specific links
>>> (including a
>>> >> stack trace showing the kernel de-fragging the whole system on a
>>> single mmap
>>> >> call), note that this material is now 4.5 years old, and things
>>> *might* have
>>> >> changed or improved to some degree. While I've seen no recent
>>> first-hand
>>> >> evidence of efforts to improve things on the
>>> don't-dramatically-stall-malloc
>>> >> (or other mappings) front, I haven't been following it very closely
>>> (I just
>>> >> wrote it off as "lets check again in 5 years"). If someone else here
>>> knows
>>> >> of some actual improvements to this picture in recent years, or of
>>> efforts
>>> >> or discussions in the Linux Kernel community on this subject, please
>>> point
>>> >> to them.
>>> >>
>>> >> IMO, the notion of THP is not flawed. The implementation is. And I
>>> believe
>>> >> that the implementation of THP *can* be improved to be much more
>>> robust and
>>> >> to avoid forcing occasional huge latency artifacts on
>>> memory-allocating
>>> >> threads:
>>> >>
>>> >> 1. the first (huge) step in improving thing would be to
>>> >> never-ever-ever-ever have a mapping thread spend any time performing
>>> any
>>> >> kind of defragmentation, and to simply accept 4KB mappings when no
>>> 2MB
>>> >> physical pages are available. Let background defragmentation do all
>>> the work
>>> >> (including converting 4KLB-allocated-but-2MB-contiguous ranges to
>>> 2MB
>>> >> mappings).
>>> >>
>>> >> 2. The second level (much needed, but at an order of magnitude of 10s
>>> of
>>> >> milliseconds rather than the current 100s of msec or more) would be
>>> to make
>>> >> background defragmentation work without stalling foreground access to
>>> a
>>> >> currently-being-defragmented 2MB region. I.e. don't stall access for
>>> the
>>> >> duration of a 2MB defrag operation (which can take several msec).
>>> >>
>>> >> While both of these are needed for a "don't worry about it" mode of
>>> use
>>> >> (which something called "transparent": really should aim for), #1 is
>>> a much
>>> >> easier step than #2 is. Without it, THP can cause application pauses
>>> (to any
>>> >> linux app) that are often worse than e.g. HotSpot Java GC pauses.
>>> Which is
>>> >> ironic.
>>> >>
>>> >> -- Gil.
>>> >>
>>> >> -------------------------------
>>> >>
>>> >> The problem is not the background defrag operation. The problem is
>>> >> synchronous defragging done on allocation, where THP on means a 2MB
>>> >> allocation will attempt to allocate a 2MB contiguous page, and if it
>>> can't
>>> >> find one, it may end up defragging an entire zone before the
>>> allocation
>>> >> completes. The /sys/kernel/mm/transparent_hugepage/defrag setting
>>> only
>>> >> controls the background...
>>> >>
>>> >> Here is something I wrote up on it internally after much
>>> investigation:
>>> >>
>>> >> Transparent huge pages (THP) is a feature Red Hat championed and
>>> >> introduced in RHEL 6.x, and got into the upstream kernel around the
>>>  ~2.6.38
>>> >> time, it generally exists in all Linux 3.x kernels and beyond (so it
>>> exists
>>> >> in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge
>>> pages,
>>> >> the kernel *attempts* to use 2MB page mappings to map contiguous and
>>> aligned
>>> >> memory ranges of that size (which are quite common for many program
>>> >> scenarios), but will break those into 4KB mappings when needed (e.g.
>>> cannot
>>> >> satisfy with 2MB pages, or when it needs to swap or page out the
>>> memory,
>>> >> since paging is done 4KB at a time). With such a mixed approach, some
>>> sort
>>> >> of a "defragmenter" or "compactor" is required to exist, because
>>> without it
>>> >> simple fragmentation will (over time) make 2MB contiguous physical
>>> pages a
>>> >> rare thing, and performance will tend to degrade over time. As a
>>> result, and
>>> >> in order to support THP, Linux kernels will attempt to defragment (or
>>> >> "compact") memory and memory zones. This can be done either by
>>> unmapping
>>> >> pages, copying their contents to a new compacted space, and mapping
>>> them in
>>> >> the new location, or by potentially forcing individually mapped 4KB
>>> pages in
>>> >> a 2MB physical page out (via swapping or by paging them out if they
>>> are file
>>> >> system pages), and reclaiming the 2MB contiguous page when that is
>>> done. 4KB
>>> >> pages that were forced out will come back in as needed (swapped back
>>> in on
>>> >> demand, or paged back in on demand).
>>> >>
>>> >> Defragmentation/compaction with THP can happen in two places:
>>> >>
>>> >> 1. First, there is a background defragmenter (a process called
>>> >> "khugepaged") that goes around and compacts 2MB physical pages by
>>> pushing
>>> >> their 4KB pages out when possible. This background defragger could
>>> >> potentially cause pages to swapped out if swapping is enabled, even
>>> with no
>>> >> swapping pressure in place.
>>> >>
>>> >> 2. "Synchronous Compaction": In some cases, an on demand page fault
>>> (e.g.
>>> >> when first accessing a newly alloocated 4KB page created via mmap()
>>> or
>>> >> malloc()) could end up trying to compact memory in order to fault
>>> into a 2MB
>>> >> physical page instead of a 4KB page (this can be seen in the stack
>>> trace
>>> >> discussed in this posting, for example:
>>> >> https://access.redhat.com/solutions/1560893). When this happens, a
>>> single
>>> >> 4KB allocation could end up waiting for an attempt to compact an
>>> entire
>>> >> "zone" of pages, even if those are compacted purely thru in-memory
>>> moves
>>> >> with no I/O. It can also be blocked waiting for disk I/O as seen on
>>> some
>>> >> stack traces in related discussions.
>>> >>
>>> >> More details can be found in places like this:
>>> >> http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt
>>> >> http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf
>>> >>
>>> >> And examples for cases of avoiding thrashing by disabling THP ion
>>> RHEL 6.2
>>> >> are around:
>>> >>
>>> >>
>>> >> http://oaktable.net/content/linux-6-transparent-huge-pages-
>>> and-hadoop-workloads
>>> >>
>>> >>
>>> >> http://techaticpsr.blogspot.com/2012/04/its-official-we-have
>>> -no-love-for.html
>>> >>
>>> >> BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps
>>> >> compact physical memory and use more optimal mappings in the kernel,
>>> but it
>>> >> can come with some significant (and often surprising) latency
>>> impacts. I
>>> >> recommend we turn it off by default in Zing installations, and it
>>> appears
>>> >> that many other software packages (including most DBs, and many Java
>>> based
>>> >> apps) recommend the same.
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> Groups
>>> > "mechanical-sympathy" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> > email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: failing to understand the issues with transparent huge paging

Reply via email to