On Fri, 2019-08-02 at 10:28 -0700, Alexander Duyck wrote:
> On Fri, 2019-08-02 at 12:19 -0400, Nitesh Narayan Lal wrote:
> > On 8/2/19 11:13 AM, Alexander Duyck wrote:
> > > On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
> > > > On 8/1/19 6:24 PM, Alexander Duyck wrote:
> > > > > 

<snip>

> > > > > One side effect of these patches is that the guest becomes much more
> > > > > resilient in terms of NUMA locality. With the pages being freed and 
> > > > > then
> > > > > reallocated when used it allows for the pages to be much closer to the
> > > > > active thread, and as a result there can be situations where this 
> > > > > patch
> > > > > set will out-perform the stock kernel when the guest memory is not 
> > > > > local
> > > > > to the guest vCPUs.
> > > > Was this the reason because of which you were seeing better results for
> > > > page_fault1 earlier?
> > > Yes I am thinking so. What I have found is that in the case where the
> > > patches are not applied on the guest it takes a few runs for the numbers
> > > to stabilize. What I think was going on is that I was running memhog to
> > > initially fill the guest and that was placing all the pages on one node or
> > > the other and as such was causing additional variability as the pages were
> > > slowly being migrated over to the other node to rebalance the workload.
> > > One way I tested it was by trying the unpatched case with a direct-
> > > assigned device since that forces it to pin the memory. In that case I was
> > > getting bad results consistently as all the memory was forced to come from
> > > one node during the pre-allocation process.
> > > 
> > 
> > I have also seen that the page_fault1 values take some time to get 
> > stabilize on
> > an unmodified kernel.
> > What I am wondering here is that if on a single NUMA guest doing the 
> > following
> > will give the right/better idea or not:
> > 
> > 1. Pin the guest to a single NUMA node.
> > 2. Run memhog so that it touches all the guest memory.
> > 3. Run will-it-scale/page_fault1.
> > 
> > Compare/observe the values for the last core (this is considering the other 
> > core
> > values doesn't drastically differ).
> 
> I'll rerun the test with qemu affinitized to one specific socket. It will
> cut the core/thread count down to 8/16 on my test system. Also I will try
> with THP and page shuffling enabled.

Okay so results with 8/16 all affinitized to one socket, THP enabled
page_fault1, and shuffling enabled:

With page reporting disabled in the hypervisor there wasn't much
difference. I saw a range of 0.69% to -1.35% versus baseline, and an
average of 0.16% improvement. So effectively no change.

With page reporting enabled I saw a range of -2.10% to -4.50%, with an
average of -3.05% regression. This is much closer to what I would expect
for this patch set as the page faulting, double zeroing (once in host, and
once in guest), and hinting process itself should have some overhead.

Reply via email to