>From what I see here I would deduce: 1- THP can give a huge performance gain (when using PreTouch, in some cases, possibly when not playing with offheap too much) 2- But it will increase hiccups
A bit like the throughput collector. So my current take away is: - Use THP is you care about throughput only - If you care about latency, just don't - If you really care about throughput, use non-transparent huge pages Is that accurate? On 9 August 2017 at 04:51, Peter Veentjer <alarmnum...@gmail.com> wrote: > Thanks for your very useful replies Gil. > > Question: > > Using huge pages can give a big performance boost: > > https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/ > > $ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch > real 13m58.167s > user 43m37.519s > sys 1011m25.740s > > $ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch > real 2m14.758s > user 1m56.488s > sys 73m59.046s > > > > But THP seems to be unusable. Does this effectively mean that we can't > benefit from THP under Linux? > > Till so far it looks like a damned if you do, damned if you don't > situation. > > Or should we move to non transparent huge pages? > > On Tuesday, August 8, 2017 at 7:44:25 PM UTC+3, Gil Tene wrote: >> >> >> >> On Monday, August 7, 2017 at 11:50:27 AM UTC-7, Alen Vrečko wrote: >>> >>> Saw this a while back. >>> >>> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/ >>> >>> Basically using THP/defrag with madvise and using >>> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts. >>> >>> Looks like the defrag cost should be paid in full at startup due to >>> AlwaysPreTouch. Never got around to test this in production. Just have >>> THP disabled. Thoughts? >>> >> >> The above flags would only cover the Java heap. In a Java application. So >> obviously THP for non-Java things doesn't get helped by that. >> >> And for Java stuff, unfortunately, there are lots of non-Java-heap things >> that are exposed to THP's potentially huge on-demand faulting latencies. >> The JVM manages lots memory outside of the Java heap for various things (GC >> stuff, stacks, Metaspace, Code cache, JIT compiler things, and a whole >> bunch of runtime stuff), and the application itself will often be using >> off-heap stuff intentionally (e.g. via DirectByteBuffers) or inadvertently >> (e.g. when libraries make either temporary or lasting use of off-heap >> memory). E.g. even simple socket I/O involves some use of off heap memory >> as an intermediate storage location. >> >> As a simple demonstration of why THP artifacts for non-Java-heaps are a >> key problem for Java apps, I first ran into these THP issues by experience, >> with Zing, right around the time that RHEL 6 turned it on. We found out the >> hard way that we have to turn it off to maintain reasonable latency >> profiles. And since Zing has always *ensured* 2MB pages are used for >> everything in the heap, the code cache, and for virtually all GC support >> structures, it is clearly the THP impact on all the rest of the stuff that >> has caused us to deal with it and recommend against it's use. The way we >> see THP manifest regularly ()when left on) is with occasional huge TTSPs >> (time to safepoint) [huge in Zing terms, meaning anything from a few msec >> to 100s of msec], which we know are there because we specifically log and >> chart TTSPs. But high TTSPs are just a symptom: since we only measure TTSP >> when we actually try to bring threads to safepoints, and doing that is a >> relatively (dynamically) rare event. that means that whenever we see actual >> high TTSPs in our logs, it is likely that similar-sized disruptions are >> occurring at the application level, but at a much higher frequently than >> than of the high TTSPs we observe. >> >> >>> - Alen >>> >>> 2017-08-07 20:14 GMT+02:00 Gil Tene <g...@azul.com>: >>> > To highlight the problematic path in current THP kernel >>> implementations, >>> > here is an example call trace that can happen (pulled from the >>> discussion >>> > linked to below). It shows that a simple on-demand page fault in >>> regular >>> > anonymous memory (e.g. when a normal malloc call is made and >>> manipulates a >>> > malloc-mananged 2MB area, or when such the resulting malloc'ed struct >>> is >>> > written to) can end up compacting an entire zone (which can be the >>> vast >>> > majority of system memory) in a single call, using the faulting >>> thread. The >>> > specific example stack trace is taken from a situation where that >>> fault took >>> > so long (on a NUMA system) that a soft0lockup was triggered, showing >>> the >>> > call took longer than 22 seconds (!!!). But even without the NUMA or >>> > migrate_pages aspects, compassion of a single zone can take 100s of >>> msec or >>> > more. >>> > >>> > Browsing through the current kernel code (e.g. >>> > http://elixir.free-electrons.com/linux/latest/source/mm/comp >>> action.c#L1722) >>> > seems to show that this is still the likely path that would be taken >>> when no >>> > free 2MB pages are found in current kernels :-( >>> > >>> > And this situation will naturally occur under all sorts of common >>> timing g >>> > conditions (i/o fragmenting free memory to 4KB (but no 2M), background >>> > compaction/defrag falls behind during some heavy kernel-driven i/o >>> spike, >>> > and some unlucky thread doing a malloc when the 2MB physical free list >>> > exhausted) >>> > >>> > >>> > kernel: Call Trace: >>> > kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240 >>> > kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610 >>> > kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380 >>> > kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400 >>> > kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0 >>> > kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0 >>> > kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196 >>> > kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90 >>> > kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0 >>> > kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140 >>> > kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410 >>> > kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60 >>> > kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520 >>> > kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0 >>> > kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40 >>> > kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300 >>> > kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70 >>> > kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0 >>> > kernel: [<ffffffff8160b808>] page_fault+0x28/0x30 >>> > >>> > >>> > On Monday, August 7, 2017 at 10:25:50 AM UTC-7, Gil Tene wrote: >>> >> >>> >> THP certainly sits in my "just don't do it" list of tuning things due >>> to >>> >> it's fundamental dramatic latency disruption in current >>> implementations, >>> >> seen as occasional 10s to 100s of msec (and sometimes even 1sec+) >>> stalls on >>> >> something as simple and common as a 32 byte malloc. THP is a form of >>> >> in-kernel GC. And the current THP implementation involves potential >>> and >>> >> occasional synchronous, stop-the-world compaction done at >>> allocation-time, >>> >> on or by any application thread that does an mmap or a malloc. >>> >> >>> >> I dug up an e-mail I wrote on the subject (to a recipient on this >>> list) >>> >> back in Jan 2013 [see below]. While it has some specific links >>> (including a >>> >> stack trace showing the kernel de-fragging the whole system on a >>> single mmap >>> >> call), note that this material is now 4.5 years old, and things >>> *might* have >>> >> changed or improved to some degree. While I've seen no recent >>> first-hand >>> >> evidence of efforts to improve things on the >>> don't-dramatically-stall-malloc >>> >> (or other mappings) front, I haven't been following it very closely >>> (I just >>> >> wrote it off as "lets check again in 5 years"). If someone else here >>> knows >>> >> of some actual improvements to this picture in recent years, or of >>> efforts >>> >> or discussions in the Linux Kernel community on this subject, please >>> point >>> >> to them. >>> >> >>> >> IMO, the notion of THP is not flawed. The implementation is. And I >>> believe >>> >> that the implementation of THP *can* be improved to be much more >>> robust and >>> >> to avoid forcing occasional huge latency artifacts on >>> memory-allocating >>> >> threads: >>> >> >>> >> 1. the first (huge) step in improving thing would be to >>> >> never-ever-ever-ever have a mapping thread spend any time performing >>> any >>> >> kind of defragmentation, and to simply accept 4KB mappings when no >>> 2MB >>> >> physical pages are available. Let background defragmentation do all >>> the work >>> >> (including converting 4KLB-allocated-but-2MB-contiguous ranges to >>> 2MB >>> >> mappings). >>> >> >>> >> 2. The second level (much needed, but at an order of magnitude of 10s >>> of >>> >> milliseconds rather than the current 100s of msec or more) would be >>> to make >>> >> background defragmentation work without stalling foreground access to >>> a >>> >> currently-being-defragmented 2MB region. I.e. don't stall access for >>> the >>> >> duration of a 2MB defrag operation (which can take several msec). >>> >> >>> >> While both of these are needed for a "don't worry about it" mode of >>> use >>> >> (which something called "transparent": really should aim for), #1 is >>> a much >>> >> easier step than #2 is. Without it, THP can cause application pauses >>> (to any >>> >> linux app) that are often worse than e.g. HotSpot Java GC pauses. >>> Which is >>> >> ironic. >>> >> >>> >> -- Gil. >>> >> >>> >> ------------------------------- >>> >> >>> >> The problem is not the background defrag operation. The problem is >>> >> synchronous defragging done on allocation, where THP on means a 2MB >>> >> allocation will attempt to allocate a 2MB contiguous page, and if it >>> can't >>> >> find one, it may end up defragging an entire zone before the >>> allocation >>> >> completes. The /sys/kernel/mm/transparent_hugepage/defrag setting >>> only >>> >> controls the background... >>> >> >>> >> Here is something I wrote up on it internally after much >>> investigation: >>> >> >>> >> Transparent huge pages (THP) is a feature Red Hat championed and >>> >> introduced in RHEL 6.x, and got into the upstream kernel around the >>> ~2.6.38 >>> >> time, it generally exists in all Linux 3.x kernels and beyond (so it >>> exists >>> >> in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge >>> pages, >>> >> the kernel *attempts* to use 2MB page mappings to map contiguous and >>> aligned >>> >> memory ranges of that size (which are quite common for many program >>> >> scenarios), but will break those into 4KB mappings when needed (e.g. >>> cannot >>> >> satisfy with 2MB pages, or when it needs to swap or page out the >>> memory, >>> >> since paging is done 4KB at a time). With such a mixed approach, some >>> sort >>> >> of a "defragmenter" or "compactor" is required to exist, because >>> without it >>> >> simple fragmentation will (over time) make 2MB contiguous physical >>> pages a >>> >> rare thing, and performance will tend to degrade over time. As a >>> result, and >>> >> in order to support THP, Linux kernels will attempt to defragment (or >>> >> "compact") memory and memory zones. This can be done either by >>> unmapping >>> >> pages, copying their contents to a new compacted space, and mapping >>> them in >>> >> the new location, or by potentially forcing individually mapped 4KB >>> pages in >>> >> a 2MB physical page out (via swapping or by paging them out if they >>> are file >>> >> system pages), and reclaiming the 2MB contiguous page when that is >>> done. 4KB >>> >> pages that were forced out will come back in as needed (swapped back >>> in on >>> >> demand, or paged back in on demand). >>> >> >>> >> Defragmentation/compaction with THP can happen in two places: >>> >> >>> >> 1. First, there is a background defragmenter (a process called >>> >> "khugepaged") that goes around and compacts 2MB physical pages by >>> pushing >>> >> their 4KB pages out when possible. This background defragger could >>> >> potentially cause pages to swapped out if swapping is enabled, even >>> with no >>> >> swapping pressure in place. >>> >> >>> >> 2. "Synchronous Compaction": In some cases, an on demand page fault >>> (e.g. >>> >> when first accessing a newly alloocated 4KB page created via mmap() >>> or >>> >> malloc()) could end up trying to compact memory in order to fault >>> into a 2MB >>> >> physical page instead of a 4KB page (this can be seen in the stack >>> trace >>> >> discussed in this posting, for example: >>> >> https://access.redhat.com/solutions/1560893). When this happens, a >>> single >>> >> 4KB allocation could end up waiting for an attempt to compact an >>> entire >>> >> "zone" of pages, even if those are compacted purely thru in-memory >>> moves >>> >> with no I/O. It can also be blocked waiting for disk I/O as seen on >>> some >>> >> stack traces in related discussions. >>> >> >>> >> More details can be found in places like this: >>> >> http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt >>> >> http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf >>> >> >>> >> And examples for cases of avoiding thrashing by disabling THP ion >>> RHEL 6.2 >>> >> are around: >>> >> >>> >> >>> >> http://oaktable.net/content/linux-6-transparent-huge-pages- >>> and-hadoop-workloads >>> >> >>> >> >>> >> http://techaticpsr.blogspot.com/2012/04/its-official-we-have >>> -no-love-for.html >>> >> >>> >> BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps >>> >> compact physical memory and use more optimal mappings in the kernel, >>> but it >>> >> can come with some significant (and often surprising) latency >>> impacts. I >>> >> recommend we turn it off by default in Zing installations, and it >>> appears >>> >> that many other software packages (including most DBs, and many Java >>> based >>> >> apps) recommend the same. >>> > >>> > -- >>> > You received this message because you are subscribed to the Google >>> Groups >>> > "mechanical-sympathy" group. >>> > To unsubscribe from this group and stop receiving emails from it, send >>> an >>> > email to mechanical-sympathy+unsubscr...@googlegroups.com. >>> > For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "mechanical-sympathy" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to mechanical-sympathy+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.