Re: [PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings

2024-02-15 Thread Mark Rutland
On Thu, Feb 15, 2024 at 10:31:47AM +, Ryan Roberts wrote:
> Hi All,
> 
> This is a series to opportunistically and transparently use contpte mappings
> (set the contiguous bit in ptes) for user memory when those mappings meet the
> requirements. The change benefits arm64, but there is some (very) minor
> refactoring for x86 to enable its integration with core-mm.

I've looked over each of the arm64-specific patches, and those all seem good to
me. I've thrown my local Syzkaller instance at the series, and I'll shout if
that hits anything that's not clearly a latent issue prior to this series.

The other bits also look good to me, so FWIW, for the series as a whole:

Acked-by: Mark Rutland 

Mark.


[PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings

2024-02-15 Thread Ryan Roberts
Hi All,

This is a series to opportunistically and transparently use contpte mappings
(set the contiguous bit in ptes) for user memory when those mappings meet the
requirements. The change benefits arm64, but there is some (very) minor
refactoring for x86 to enable its integration with core-mm.

It is part of a wider effort to improve performance by allocating and mapping
variable-sized blocks of memory (folios). One aim is for the 4K kernel to
approach the performance of the 16K kernel, but without breaking compatibility
and without the associated increase in memory. Another aim is to benefit the 16K
and 64K kernels by enabling 2M THP, since this is the contpte size for those
kernels. We have good performance data that demonstrates both aims are being met
(see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And for anonymous memory, "multi-size THP" is now upstream.


Patch Layout


In this version, I've split the patches to better show each optimization:

  - 1-2:mm prep: misc code and docs cleanups
  - 3-6:mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
generic wrapper around it
  - 7-11:   arm64 prep: Refactor ptep helpers into new layer
  - 12: functional contpte implementation
  - 23-18:  various optimizations on top of the contpte implementation


Testing
===

I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for multi-size THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues


Performance
===

High Level Use Cases


First some high level use cases (kernel compilation and speedometer JavaScript
benchmarks). These are running on Ampere Altra (I've seen similar improvements
on Android/Pixel 6).

baseline:  mm-unstable (mTHP switched off)
mTHP:  + enable 16K, 32K, 64K mTHP sizes "always"
mTHP + contpte:+ this series
mTHP + contpte + exefolio: + patch at [6], which series supports

Kernel Compilation with -j8 (negative is faster):

| kernel| real-time | kern-time | user-time |
|---|---|---|---|
| baseline  |  0.0% |  0.0% |  0.0% |
| mTHP  | -5.0% |-39.1% | -0.7% |
| mTHP + contpte| -6.0% |-41.4% | -1.5% |
| mTHP + contpte + exefolio | -7.8% |-43.1% | -3.4% |

Kernel Compilation with -j80 (negative is faster):

| kernel| real-time | kern-time | user-time |
|---|---|---|---|
| baseline  |  0.0% |  0.0% |  0.0% |
| mTHP  | -5.0% |-36.6% | -0.6% |
| mTHP + contpte| -6.1% |-38.2% | -1.6% |
| mTHP + contpte + exefolio | -7.4% |-39.2% | -3.2% |

Speedometer (positive is faster):

| kernel| runs_per_min |
|:--|--|
| baseline  | 0.0% |
| mTHP  | 1.5% |
| mTHP + contpte| 3.2% |
| mTHP + contpte + exefolio | 4.5% |


Micro Benchmarks


The following microbenchmarks are intended to demonstrate the performance of
fork() and munmap() do not regress. I'm showing results for order-0 (4K)
mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
benchmarks.

baseline:  mm-unstable + batch zap [7] series
contpte-basic: + patches 0-19; functional contpte implementation
contpte-batch: + patches 20-23; implement new batched APIs
contpte-inline:+ patch 24; __always_inline to help compiler
contpte-fold:  + patch 25; fold contpte mapping when sensible

Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
(on top of MacOS) for reference, although experience suggests this might not be
the most reliable for performance numbers of this sort:

| FORK   | order-0| order-9|
| Ampere Altra   |||
| (pte-map)  |   mean | stdev |   mean | stdev |
|||---||---|
| baseline   |   0.0% |  2.7% |   0.0% |  0.2% |
| contpte-basic  |   6.3% |  1.4% |1948.7% |  0.2% |
|