On Wed, 5 Dec 2018, Mel Gorman wrote: > > This is a single MADV_HUGEPAGE usecase, there is nothing special about it. > > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and > > faulted the memory with a fragmented local node and then measured the > > remote access latency to the remote hugepage that occurs without setting > > __GFP_THISNODE. You can also measure the remote allocation latency by > > fragmenting the entire system and then faulting. > > > > I'll make the same point as before, the form the fragmentation takes > matters as well as the types of pages that are resident and whether > they are active or not. It affects the level of work the system does > as well as the overall success rate of operations (be it reclaim, THP > allocation, compaction, whatever). This is why a reproduction case that is > representative of the problem you're facing on the real workload matters > would have been helpful because then any alternative proposal could have > taken your workload into account during testing. >
We know from Andrea's report that compaction is failing, and repeatedly failing because otherwise we would not need excessive swapping to make it work. That can mean one of two things: (1) a general low-on-memory situation that causes us repeatedly to be under watermarks to deem compaction suitable (isolate_freepages() will be too painful) or (2) compaction has the memory that it needs but is failing to make a hugepage available because all pages from a pageblock cannot be migrated. If (1), perhaps in the presence of an antagonist that is quickly allocating the memory before compaction can pass watermark checks, further reclaim is not beneficial: the allocation is becoming too expensive and there is no guarantee that compaction can find this reclaimed memory in isolate_freepages(). I chose to duplicate (2) by synthetically introducing fragmentation (high-order slab, free every other one) locally to test the patch that does not set __GFP_THISNODE. The result is a remote transparent hugepage, but we do not even need to get to the point of local compaction for that fallback to happen. And this is where I measure the 13.9% access latency regression for the lifetime of the binary as a result of this patch. If local compaction works the first time, great! But that is not what is happening in Andrea's report and as a result of not setting __GFP_THISNODE we are *guaranteed* worse access latency and may encounter even worse allocation latency if the remote memory is fragmented as well. So while I'm only testing the functional behavior of the patch itself, I cannot speak to the nature of the local fragmentation on Andrea's systems.