Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Kemeng Shi



on 2021/4/13 22:53, Borislav Petkov wrote:
> I thought "should be better" too last time when I measured rep; movs vs
> NT stores but actual measurements showed no real difference.
Mabye the NT stores make difference when store to slow dimms, like the
persistent memory I just tested. Also, it likely reduces unnecessary cache
load and flush, and benifits the running processes which have data cached.

-- 
Best wishes
Kemeng Shi


Re: Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Borislav Petkov
On Tue, Apr 13, 2021 at 08:54:55PM +0800, Kemeng Shi wrote:
> Yes. And NT stores should be better for copy_page especially copying a lot
> of pages as only partial memory of copied page will be access recently.

I thought "should be better" too last time when I measured rep; movs vs
NT stores but actual measurements showed no real difference.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re:Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Kemeng Shi



on 2021/4/13 19:01, Borislav Petkov wrote:
> + linux-nvdimm
> 
> Original mail at 
> https://lkml.kernel.org/r/3f28adee-8214-fa8e-b368-eaf8b1934...@huawei.com
> 
> On Tue, Apr 13, 2021 at 02:25:58PM +0800, Kemeng Shi wrote:
>> I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in
> 
> What is AEP?
> 
AEP is a type of persistent memory produced by Intel. It's slower than
normal memory but is persistent.
>> my system. I will move cold pages from DRAM node to AEP node with
>> move_pages system call. With old "rep movsq', it costs 2030ms to move
>> 1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages.
> 
> So there's __copy_user_nocache() which does NT stores.
> 
>> -ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
>> +ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \
>> +  "jmp copy_page_nt", X86_FEATURE_XMM2
> 
> This makes every machine which has sse2 do NT stores now. Which means
> *every* machine practically.
> 
Yes. And NT stores should be better for copy_page especially copying a lot
of pages as only partial memory of copied page will be access recently.
> The folks on linux-nvdimm@ should be able to give you a better idea what
> to do.
> 
> HTH.
> 
Thanks for response and help.


Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Borislav Petkov
+ linux-nvdimm

Original mail at 
https://lkml.kernel.org/r/3f28adee-8214-fa8e-b368-eaf8b1934...@huawei.com

On Tue, Apr 13, 2021 at 02:25:58PM +0800, Kemeng Shi wrote:
> I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in

What is AEP?

> my system. I will move cold pages from DRAM node to AEP node with
> move_pages system call. With old "rep movsq', it costs 2030ms to move
> 1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages.

So there's __copy_user_nocache() which does NT stores.

> - ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
> + ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \
> +  "jmp copy_page_nt", X86_FEATURE_XMM2

This makes every machine which has sse2 do NT stores now. Which means
*every* machine practically.

The folks on linux-nvdimm@ should be able to give you a better idea what
to do.

HTH.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


[PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Kemeng Shi
I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in
my system. I will move cold pages from DRAM node to AEP node with
move_pages system call. With old "rep movsq', it costs 2030ms to move
1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages.
I also test move 1GB pages from AEP node to DRAM node. But the result is
unexpected. "rep movesq" cost about 372 ms while "movnti" cost about
477ms. As said in X86 , "movnti" could avoid "polluting the caches" in
this situaction. I don't know if it's general result or just happening
in my machine. Hardware information is as follow:
CPU:
Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
DRAM:
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 64 GB
Form Factor: DIMM
Set: None
Locator: DIMM130 J40
Bank Locator: _Node1_Channel3_Dimm0
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03B71EB0
Asset Tag: 1950
Part Number: M393A8G40MB2-CVF
Rank: 2
Configured Memory Speed: 2666 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: 
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 64 GB
Cache Size: None
Logical Size: None
AEP:
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM131 J41
Bank Locator: _Node1_Channel3_Dimm1
Type: Logical non-volatile device
Type Detail: Synchronous Non-Volatile LRDIMM
Speed: 2666 MT/s
Manufacturer: Intel
Serial Number: 6803
Asset Tag: 1949
Part Number: NMA1XXD128GPS
Rank: 1
Configured Memory Speed: 2666 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: Intel persistent memory
Memory Operating Mode Capability: Volatile memory
Byte-accessible persistent memory
Firmware Version: 5355
Module Manufacturer ID: Bank 1, Hex 0x89
Module Product ID: 0x0556
Memory Subsystem Controller Manufacturer ID: Bank 1, Hex 0x89
Memory Subsystem Controller Product ID: 0x097A
Non-Volatile Size: 126 GB
Volatile Size: None
Cache Size: None
Logical Size: None
Memory dimm topoloygy:
AEP
 |
DRAMDRAMDRAM
 |   |   |
 |---|---|
CPU
 |---|---|
 |   |   |
DRAMDRAMDRAM

Signed-off-by: Kemeng Shi 
---
 arch/x86/lib/copy_page_64.S | 73 -
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 2402d4c489d2..69389b4aeeed 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -14,7 +14,8 @@
  */
ALIGN
 SYM_FUNC_START(copy_page)
-   ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
+   ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \
+  "jmp copy_page_nt", X86_FEATURE_XMM2
movl$4096/8, %ecx
rep movsq
ret
@@ -87,3 +88,73 @@ SYM_FUNC_START_LOCAL(copy_page_regs)
addq$2*8, %rsp
ret
 SYM_FUNC_END(copy_page_regs)
+
+SYM_FUNC_START_LOCAL(copy_page_nt)
+   subq$2*8,   %rsp
+   movq%rbx,   (%rsp)
+   movq%r12,   1*8(%rsp)
+
+   movl$(4096/64)-5, %ecx
+   .p2align 4
+.LoopNT64:
+   decl%ecx
+
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi), %r8
+   movq0x8*4(%rsi), %r9
+   movq0x8*5(%rsi), %r10
+   movq0x8*6(%rsi), %r11
+   movq0x8*7(%rsi), %r12
+
+   prefetcht0 5*64(%rsi)
+
+   movnti  %rax, 0x8*0(%rdi)
+   movnti  %rbx, 0x8*1(%rdi)
+   movnti  %rdx, 0x8*2(%rdi)
+   movnti  %r8,  0x8*3(%rdi)
+   movnti  %r9,  0x8*4(%rdi)
+   movnti  %r10, 0x8*5(%rdi)
+   movnti  %r11, 0x8*6(%rdi)
+   movnti  %r12, 0x8*7(%rdi)
+
+   leaq64(%rdi), %rdi
+   leaq64(%rsi), %rsi
+   jnz .LoopNT64
+
+   movl$5, %ecx
+   .p2align 4
+.LoopNT2:
+   decl%ecx
+
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi),