To highlight the problematic path in current THP kernel implementations, here is an example call trace that can happen (pulled from the discussion linked to below). It shows that a simple on-demand page fault in regular anonymous memory (e.g. when a normal malloc call is made and manipulates a malloc-mananged 2MB area, or when such the resulting malloc'ed struct is written to) can end up compacting an entire zone (which can be the vast majority of system memory) in a single call, using the faulting thread. The specific example stack trace is taken from a situation where that fault took so long (on a NUMA system) that a soft0lockup was triggered, showing the call took longer than 22 seconds (!!!). But even without the NUMA or migrate_pages aspects, compassion of a single zone can take 100s of msec or more.
Browsing through the current kernel code (e.g. http://elixir.free-electrons.com/linux/latest/source/mm/compaction.c#L1722) seems to show that this is still the likely path that would be taken when no free 2MB pages are found in current kernels :-( And this situation will naturally occur under all sorts of common timing g conditions (i/o fragmenting free memory to 4KB (but no 2M), background compaction/defrag falls behind during some heavy kernel-driven i/o spike, and some unlucky thread doing a malloc when the 2MB physical free list exhausted) kernel: Call Trace: kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240 kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610 kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380 kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400 kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0 kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0 kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196 kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90 kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0 kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140 kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410 kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60 kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520 kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0 kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40 kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300 kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70 kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0 kernel: [<ffffffff8160b808>] page_fault+0x28/0x30 On Monday, August 7, 2017 at 10:25:50 AM UTC-7, Gil Tene wrote: > > THP certainly sits in my "just don't do it" list of tuning things due to > it's fundamental dramatic latency disruption in current implementations, > seen as occasional 10s to 100s of msec (and sometimes even 1sec+) stalls on > something as simple and common as a 32 byte malloc. THP is a form of > in-kernel GC. And the current THP implementation involves potential and > occasional synchronous, stop-the-world compaction done at allocation-time, > on or by any application thread that does an mmap or a malloc. > > I dug up an e-mail I wrote on the subject (to a recipient on this list) > back in Jan 2013 [see below]. While it has some specific links (including a > stack trace showing the kernel de-fragging the whole system on a single > mmap call), note that this material is now 4.5 years old, and things > *might* have changed or improved to some degree. While I've seen no recent > first-hand evidence of efforts to improve things on the > don't-dramatically-stall-malloc (or other mappings) front, I haven't been > following it very closely (I just wrote it off as "lets check again in 5 > years"). If someone else here knows of some actual improvements to this > picture in recent years, or of efforts or discussions in the Linux Kernel > community on this subject, please point to them. > > IMO, the notion of THP is not flawed. The implementation is. And I believe > that the implementation of THP *can* be improved to be much more robust and > to avoid forcing occasional huge latency artifacts on memory-allocating > threads: > > 1. the first (huge) step in improving thing would be to > never-ever-ever-ever have a mapping thread spend any time performing any > kind of defragmentation, and to simply accept 4KB mappings when no 2MB > physical pages are available. Let background defragmentation do all the > work (including converting 4KLB-allocated-but-2MB-contiguous ranges to 2MB > mappings). > > 2. The second level (much needed, but at an order of magnitude of 10s of > milliseconds rather than the current 100s of msec or more) would be to make > background defragmentation work without stalling foreground access to a > currently-being-defragmented 2MB region. I.e. don't stall access for the > duration of a 2MB defrag operation (which can take several msec). > > While both of these are needed for a "don't worry about it" mode of use > (which something called "transparent": really should aim for), #1 is a much > easier step than #2 is. Without it, THP can cause application pauses (to > any linux app) that are often worse than e.g. HotSpot Java GC pauses. Which > is ironic. > > -- Gil. > > ------------------------------- > > The problem is not the background defrag operation. The problem is > synchronous defragging done on allocation, where THP on means a 2MB > allocation will attempt to allocate a 2MB contiguous page, and if it can't > find one, it may end up defragging an entire zone before the allocation > completes. The /sys/kernel/mm/transparent_hugepage/defrag setting only > controls the background... > > Here is something I wrote up on it internally after much investigation: > > Transparent huge pages (THP) is a feature Red Hat championed and > introduced in RHEL 6.x, and got into the upstream kernel around the > ~2.6.38 time, it generally exists in all Linux 3.x kernels and beyond (so > it exists in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent > huge pages, the kernel *attempts* to use 2MB page mappings to map > contiguous and aligned memory ranges of that size (which are quite common > for many program scenarios), but will break those into 4KB mappings when > needed (e.g. cannot satisfy with 2MB pages, or when it needs to swap or > page out the memory, since paging is done 4KB at a time). With such a mixed > approach, some sort of a "defragmenter" or "compactor" is required to > exist, because without it simple fragmentation will (over time) make 2MB > contiguous physical pages a rare thing, and performance will tend to > degrade over time. As a result, and in order to support THP, Linux kernels > will attempt to defragment (or "compact") memory and memory zones. This can > be done either by unmapping pages, copying their contents to a new > compacted space, and mapping them in the new location, or by potentially > forcing individually mapped 4KB pages in a 2MB physical page out (via > swapping or by paging them out if they are file system pages), and > reclaiming the 2MB contiguous page when that is done. 4KB pages that were > forced out will come back in as needed (swapped back in on demand, or paged > back in on demand). > > Defragmentation/compaction with THP can happen in two places: > > 1. First, there is a background defragmenter (a process called > "khugepaged") that goes around and compacts 2MB physical pages by pushing > their 4KB pages out when possible. This background defragger could > potentially cause pages to swapped out if swapping is enabled, even with no > swapping pressure in place. > > 2. "Synchronous Compaction": In some cases, an on demand page fault (e.g. > when first accessing a newly alloocated 4KB page created via mmap() or > malloc()) could end up trying to compact memory in order to fault into a > 2MB physical page instead of a 4KB page (this can be seen in the stack > trace discussed in this posting, for > example: https://access.redhat.com/solutions/1560893). When this happens, a > single 4KB allocation could end up waiting for an attempt to compact an > entire "zone" of pages, even if those are compacted purely thru in-memory > moves with no I/O. It can also be blocked waiting for disk I/O as seen on > some stack traces in related discussions. > > More details can be found in places like this: > http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt > http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf > > And examples for cases of avoiding thrashing by disabling THP ion RHEL 6.2 > are around: > > > http://oaktable.net/content/linux-6-transparent-huge-pages-and-hadoop-workloads > > > http://techaticpsr.blogspot.com/2012/04/its-official-we-have-no-love-for.html > > *BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps > compact physical memory and use more optimal mappings in the kernel, but it > can come with some significant (and often surprising) latency impacts. I > recommend we turn it off by default in Zing installations, and it appears > that many other software packages (including most **DBs, and many Java > based apps) recommend the same.* > -- You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.