subject:"\[PATCH v3 0\/2\] arm64\/mm\: Enable color zero pages"

Re: [PATCH v3 0/2] arm64/mm: Enable color zero pages

2020-09-28 Thread Gavin Shan


Hi Catalin,

On 9/29/20 1:22 AM, Catalin Marinas wrote:

On Mon, Sep 28, 2020 at 05:22:54PM +1000, Gavin Shan wrote:

Testing
===
[1] The experiment reveals how heavily the (L1) data cache miss impacts
 the overall application's performance. The machine where the test
 is carried out has the following L1 data cache topology. In the
 mean while, the host kernel have following configurations.

 The test case allocates contiguous page frames through HugeTLBfs
 and reads 4-bytes data from the same offset (0x0) from these (N)
 contiguous page frames. N is equal to 8 or 9 separately in the
 following two test cases. This is repeated for one million of
 times.

 Note that 8 is number of L1 data cache ways. The experiment is
 cause L1 cache thrashing on one particular set.

 Host:  CONFIG_ARM64_PAGE_SHIFT=12
DEFAULT_HUGE_PAGE_SIZE=2MB
 L1 dcache: cache-line-size=64
number-of-sets=64
number-of-ways=8

 N=8   N=9
 --
 cache-misses:   43,4299,038,460
 L1-dcache-load-misses:  43,4299,038,460
 seconds time elapsed:   0.299206372   0.722253140   (2.41 times)

[2] The experiment should have been carried out on machine where the
 L1 data cache capacity of one particular way is larger than 4KB.
 However, I'm unable to find such kind of machines. So I have to
 evaluate the performance impact caused by L2 data cache thrashing.
 The experiment is carried out on the machine, which has following
 L1/L2 data cache topology. The host kernel configuration is same
 to [1].

 The corresponding test program allocates contiguous page frames
 through hugeTLBfs and builds VMAs backed by zero pages. These
 contiguous pages are sequentially read from fixed offset (0) in step
 of 32KB and by 8 times. After that, the VMA backed by zero pages are
 sequentially read in step of 4KB and by once. It's repeated by 8
 millions of times.

 Note 32KB is the cache capacity in one L2 data cache way and 8 is
 number of L2 data cache sets. This experiment is to cause L2 data
 cache thrashing on one particular set.

 L1 dcache:  
 L2 dcache:  cache-line-size=64
 number-of-sets=512
 number-of-ways=8

 ---
 cache-references:   1,427,213,7371,421,394,472
 cache-misses:  35,804,552   42,636,698
 L1-dcache-load-misses: 35,804,552   42,636,698
 seconds time elapsed:   2.602511671  2.098198172  (+19.3%)


No-one is denying a performance improvement in a very specific way but
what's missing here is explaining how these artificial benchmarks relate
to real-world applications.



Thanks for your comments. It depends on the activities of reading zero
pages and its frequency. The idea is to distribute reading zero page(s)
on multiple sets of caches. Otherwise, the cache sets corresponding to
these zero page(s) are have more load and prone to cause cache thrashing,
depending on the workload pattern though.

As discussed on v1, there are two use cases from the kernel code: (1)
/proc/vmcore (2) DAX. For (1), it's only valid on x86 where those
non-RAM-resident pages are mapped and backed by zero page(s). For (2),
I was expecting to setup xfs and DAX on RBD (Ram Block Device).
Unfortunately, DAX support for RBD was removed two years ago and
I'm unable to enable xfs and DAX on RBD. DAX is only supported on
limited hardware and I don't have around.

   # mknod /dev/ramdisk b 1 20
   # mkfs.xfs /dev/ramdisk
   # mkdir -p /tmp/ramdisk
   # mount -txfs -odax /dev/ramdisk /tmp/ramdisk
   # dmesg | tail -n 4
   [ 3721.848830] brd: module loaded
   [ 3772.015934] XFS (ram20): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
   [ 3772.023423] XFS (ram20): DAX unsupported by block device. Turning off DAX.
   [ 3772.030285] XFS (ram20): DAX and reflink cannot be used together!

the feature just needs a couple of extra pages and it wouldn't be a
concern. However, the caching behavior for reading zero page(s) is
altering because the caches for zero pages are distributed. It depends
on how frequently these zero page(s) are accessed. Also, I tried to
build the kernel image and no performance altering is detected.

   command:   make -j 80 clean; time make -j 80
  (was executed for 3 times)
   without the patch: 3m29.084s 3m29.265s 3m30.806s
   with the patch:3m28.954s 3m29.819s 3m30.180s

Cheers,
Gavin

Re: [PATCH v3 0/2] arm64/mm: Enable color zero pages

2020-09-28 Thread Catalin Marinas

Hi Gavin,

On Mon, Sep 28, 2020 at 05:22:54PM +1000, Gavin Shan wrote:
> Testing
> ===
> [1] The experiment reveals how heavily the (L1) data cache miss impacts
> the overall application's performance. The machine where the test
> is carried out has the following L1 data cache topology. In the
> mean while, the host kernel have following configurations.
> 
> The test case allocates contiguous page frames through HugeTLBfs
> and reads 4-bytes data from the same offset (0x0) from these (N)
> contiguous page frames. N is equal to 8 or 9 separately in the
> following two test cases. This is repeated for one million of
> times.
> 
> Note that 8 is number of L1 data cache ways. The experiment is
> cause L1 cache thrashing on one particular set.
> 
> Host:  CONFIG_ARM64_PAGE_SHIFT=12
>DEFAULT_HUGE_PAGE_SIZE=2MB
> L1 dcache: cache-line-size=64
>number-of-sets=64
>number-of-ways=8
> 
> N=8   N=9
> --
> cache-misses:   43,4299,038,460
> L1-dcache-load-misses:  43,4299,038,460
> seconds time elapsed:   0.299206372   0.722253140   (2.41 times)
> 
> [2] The experiment should have been carried out on machine where the
> L1 data cache capacity of one particular way is larger than 4KB.
> However, I'm unable to find such kind of machines. So I have to
> evaluate the performance impact caused by L2 data cache thrashing.
> The experiment is carried out on the machine, which has following
> L1/L2 data cache topology. The host kernel configuration is same
> to [1].
> 
> The corresponding test program allocates contiguous page frames
> through hugeTLBfs and builds VMAs backed by zero pages. These
> contiguous pages are sequentially read from fixed offset (0) in step
> of 32KB and by 8 times. After that, the VMA backed by zero pages are
> sequentially read in step of 4KB and by once. It's repeated by 8
> millions of times.
> 
> Note 32KB is the cache capacity in one L2 data cache way and 8 is
> number of L2 data cache sets. This experiment is to cause L2 data
> cache thrashing on one particular set.
> 
> L1 dcache:  
> L2 dcache:  cache-line-size=64
> number-of-sets=512
> number-of-ways=8
> 
> ---
> cache-references:   1,427,213,7371,421,394,472
> cache-misses:  35,804,552   42,636,698
> L1-dcache-load-misses: 35,804,552   42,636,698
> seconds time elapsed:   2.602511671  2.098198172  (+19.3%)

No-one is denying a performance improvement in a very specific way but
what's missing here is explaining how these artificial benchmarks relate
to real-world applications.

-- 
Catalin

[PATCH v3 0/2] arm64/mm: Enable color zero pages

2020-09-28 Thread Gavin Shan

The feature of color zero pages isn't enabled on arm64, meaning all
read-only (anonymous) VM areas are backed up by same zero page. It
leads pressure to data cache on reading data from them. In extreme
case, the same data cache set could be experiencing high pressure
and thrashing. This tries to enable color zero pages to resolve the
issue.

PATCH[1/2] decouples the zero PGD table from zero page
PATCH[2/2] allocates the needed zero pages according to L1 cache size

Testing
===
[1] The experiment reveals how heavily the (L1) data cache miss impacts
the overall application's performance. The machine where the test
is carried out has the following L1 data cache topology. In the
mean while, the host kernel have following configurations.

The test case allocates contiguous page frames through HugeTLBfs
and reads 4-bytes data from the same offset (0x0) from these (N)
contiguous page frames. N is equal to 8 or 9 separately in the
following two test cases. This is repeated for one million of
times.

Note that 8 is number of L1 data cache ways. The experiment is
cause L1 cache thrashing on one particular set.

Host:  CONFIG_ARM64_PAGE_SHIFT=12
   DEFAULT_HUGE_PAGE_SIZE=2MB
L1 dcache: cache-line-size=64
   number-of-sets=64
   number-of-ways=8

N=8   N=9
--
cache-misses:   43,4299,038,460
L1-dcache-load-misses:  43,4299,038,460
seconds time elapsed:   0.299206372   0.722253140   (2.41 times)

[2] The experiment should have been carried out on machine where the
L1 data cache capacity of one particular way is larger than 4KB.
However, I'm unable to find such kind of machines. So I have to
evaluate the performance impact caused by L2 data cache thrashing.
The experiment is carried out on the machine, which has following
L1/L2 data cache topology. The host kernel configuration is same
to [1].

The corresponding test program allocates contiguous page frames
through hugeTLBfs and builds VMAs backed by zero pages. These
contiguous pages are sequentially read from fixed offset (0) in step
of 32KB and by 8 times. After that, the VMA backed by zero pages are
sequentially read in step of 4KB and by once. It's repeated by 8
millions of times.

Note 32KB is the cache capacity in one L2 data cache way and 8 is
number of L2 data cache sets. This experiment is to cause L2 data
cache thrashing on one particular set.

L1 dcache:  
L2 dcache:  cache-line-size=64
number-of-sets=512
number-of-ways=8

---
cache-references:   1,427,213,7371,421,394,472
cache-misses:  35,804,552   42,636,698
L1-dcache-load-misses: 35,804,552   42,636,698
seconds time elapsed:   2.602511671  2.098198172  (+19.3%)

Changes since v2:

   * Rebased to last upstream kernel (5.9.rc6) (Gavin)
   * Improved commit log   (Gavin)
   * Provide performance data in the cover letter  (Catalin)


Gavin Shan (2):
  arm64/mm: Introduce zero PGD table
  arm64/mm: Enable color zero pages

 arch/arm64/include/asm/cache.h   |  3 ++
 arch/arm64/include/asm/mmu_context.h |  6 +--
 arch/arm64/include/asm/pgtable.h | 11 -
 arch/arm64/kernel/cacheinfo.c| 67 
 arch/arm64/kernel/setup.c|  2 +-
 arch/arm64/kernel/vmlinux.lds.S  |  4 ++
 arch/arm64/mm/init.c | 37 +++
 arch/arm64/mm/mmu.c  |  7 ---
 arch/arm64/mm/proc.S |  2 +-
 drivers/base/cacheinfo.c |  3 +-
 include/linux/cacheinfo.h|  6 +++
 11 files changed, 132 insertions(+), 16 deletions(-)

-- 
2.23.0

Re: [PATCH v3 0/2] arm64/mm: Enable color zero pages

Re: [PATCH v3 0/2] arm64/mm: Enable color zero pages

[PATCH v3 0/2] arm64/mm: Enable color zero pages

3 matches

Site Navigation

Mail list logo

Footer information