On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > I've must have said this at least six or seven times: fault latency is 
> 
> In your original regression report in this thread to Linus:
> 
> https://lkml.kernel.org/r/[email protected]
> 
> you said "On a fragmented host, the change itself showed a 13.9%
> access latency regression on Haswell and up to 40% allocation latency
>                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> regression. This is more substantial on Naples and Rome.  I also
> ^^^^^^^^^^
> measured similar numbers to this for Haswell."
> 
> > secondary to the *access* latency.  We want to try hard for MADV_HUGEPAGE 
> > users to do synchronous compaction and try to make a hugepage available.  
> 
> I'm glad you said it six or seven times now, because you forgot to
> mention in the above email that the "40% allocation/fault latency
> regression" you reported above, is actually a secondary concern because
> those must be long lived allocations and we can't yet generate
> compound pages for free after all..
> 

I've been referring to the long history of this discussion, namely my 
explicit Nacked-by in https://marc.info/?l=linux-kernel&m=153868420126775 
two months ago stating the 13.9% access latency regression.  The patch was 
nonetheless still merged and I proposed the revert for the same chief 
complaint, and it was reverted.

I brought up the access latency issue three months ago in
https://marc.info/?l=linux-kernel&m=153661012118046 and said allocation 
latency was a secondary concern, specifically that our users of 
MADV_HUGEPAGE are willing to accept the increased allocation latency for 
local hugepages.

> BTW, I never bothered to ask yet, but, did you enable NUMA balancing
> in your benchmarks? NUMA balancing would fix the access latency very
> easily too, so that 13.9% access latency must quickly disappear if you
> correctly have NUMA balancing enabled in a NUMA system.
> 

No, we do not have CONFIG_NUMA_BALANCING enabled.  The __GFP_THISNODE 
behavior for hugepages was added in 4.0 for the PPC usecase, not by me.  
That had nothing to do with the madvise mode: the initial documentation 
referred to the mode as a way to prevent an increase in rss for configs 
where "enabled" was set to madvise.  The allocation policy was never about 
MADV_HUGEPAGE in any 4.x kernel, it was only an indication for certain 
defrag settings to determine how much work should be done to allocate 
*local* hugepages at fault.

If you are saying that the change in allocator policy in a patch from 
Aneesh almost four years ago and has gone unreported by anybody up until a 
few months ago, I can understand the frustration.  I do, however, support 
the __GFP_THISNODE change he made because his data shows the same results 
as mine.

I've suggested a very simple extension, specifically a prctl() mode that 
is inherited across fork, that would allow a workload to specify that it 
prefers remote allocations over local compaction/reclaim because it is too 
large to fit on a single node.  I'd value your feedback for that 
suggestion to fix your usecase.

Reply via email to