Re: failing to understand the issues with transparent huge paging

Gil Tene Sun, 13 Aug 2017 00:10:33 -0700


On Saturday, August 12, 2017 at 3:01:31 AM UTC-7, Alexandr Nikitin wrote:
>
> I played with Transparent Hugepages some time ago and I want to share some 
> numbers based on real world high-load applications.
> We have a JVM application: high-load tcp server based on netty. No clear 
> bottleneck, CPU, memory and network are equally highly loaded. The amount 
> of work depends on request content.
> The following numbers are based on normal server load ~40% of maximum 
> number of requests one server can handle.
>
> *When THP is off:*
> End-to-end application latency in microseconds:
> "p50" : 718.891,
> "p95" : 4110.26,
> "p99" : 7503.938,
> "p999" : 15564.827,
>
> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
> ...
> ...         25,164,369      iTLB-load-misses
> ...         81,154,170      dTLB-load-misses
> ...
>
> *When THP is always on:*
> End-to-end application latency in microseconds:
> "p50" : 601.196,
> "p95" : 3260.494,
> "p99" : 7104.526,
> "p999" : 11872.642,
>
> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
> ...
> ...    21,400,513      dTLB-load-misses
> ...      4,633,644      iTLB-load-misses
> ...
>
> As you can see THP performance impact is measurable and too significant to 
> ignore. 4.1 ms vs 3.2 ms 99%% and 100M vs 25M TLB misses.
> I also used SytemTap to measure few kernel functions like 
> collapse_huge_page, clear_huge_page, split_huge_page. There were no 
> significant spikes using THP.
> AFAIR that was 3.10 kernel which is 4 years old now. I can repeat 
> experiments with the newer kernels if there's interest. (I don't know what 
> was changed there though)
>

Unfortunately, just because you didn't run into a huge spike during your 
test doesn't mean it won't hit you in the future... The stack trace example 
I posted earlier represents the path that will be taken if an on-demand 
allocation page fault on a THP-allocated region happens when no free 2MB 
page is available in the system. Inducing that behavior is not that hard, 
e.g. just do a bunch of high volume journaling or logging, and you'll 
probably trigger it eventually. And when it does take that path, that will 
be your thread de-fragging the entire system's physical memory, one 2MB 
page at a time.

And when that happens, you're probably not talking 10-20msec. More like 
several hundreds of msec (growing with the system physical memory size, the 
specific stack trace is taken from a RHEL issue that reported >22 seconds). 
If that occasional outlier is something you are fine with, then turning THP 
on for the speed benefits you may be seeing makes sense. But if you can't 
accept the occasional ~0.5+ sec freezes, turn it off. 

>
> On Monday, August 7, 2017 at 6:42:21 PM UTC+3, Peter Veentjer wrote:
>>
>> Hi Everyone,
>>
>> I'm failing to understand the problem with transparent huge pages.
>>
>> I 'understand' how normal pages work. A page is typically 4kb in a 
>> virtual address space; each process has its own. 
>>
>> I understand how the TLB fits in; a cache providing a mapping of virtual 
>> to real addresses to speed up address conversion.
>>
>> I understand that using a large page e.g. 2mb instead of a 4kb page can 
>> reduce pressure on the TLB.
>>
>> So till so far it looks like huge large pages makes a lot of sense; of 
>> course at the expensive of wasting memory if only a small section of a page 
>> is being used. 
>>
>> The first part I don't understand is: why is it called transparent huge 
>> pages? So what is transparent about it? 
>>
>> The second part I'm failing to understand is: why can it cause problems? 
>> There are quite a few applications that recommend disabling THP and I 
>> recently helped a customer that was helped by disabling it. It seems there 
>> is more going on behind the scene's than having an increased page size. Is 
>> it caused due to fragmentation? So if a new page is needed and memory is 
>> fragmented (due to smaller pages); that small-pages need to be compacted 
>> before a new huge page can be allocated? But if this would be the only 
>> thing; this shouldn't be a problem once all pages for the application have 
>> been touched and all pages are retained.
>>
>> So I'm probably missing something simple.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: failing to understand the issues with transparent huge paging

Reply via email to