[go-nuts] Re: memclr optimazation does worse?

T L Thu, 15 Dec 2016 19:40:20 -0800


On Friday, December 16, 2016 at 12:36:47 AM UTC+8, rd wrote:
>
> TL,
>
> As peterGo, I was unable to reproduce your findings:
>
> uname -a
> Linux 4.8.0-30-generic #32-Ubuntu SMP Fri Dec 2 03:43:27 UTC 2016 x86_64 
> x86_64 x86_64 GNU/Linux
>
> go version
> go version go1.7.4 linux/amd64
>
> cat /proc/cpuinfo
> CPU Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
>
> go test -bench=. 
> [...]
> BenchmarkMemclr_2000000-4           3000        421532 ns/op
> BenchmarkLoop_2000000-4               2000        791318 ns/op
>
> So memclr is ~2x faster on my machine.
>
> In order to see what actually happens, lets use the pprof tool:
> go test -bench=. -cpuprofile test.prof
>
> Then `go tool pprof test.prof`, and `top 5` (sanity check):
>       flat  flat%   sum%        cum   cum%
>      1.69s 57.88% 57.88%      1.69s 57.88%  _/tmp/goperf.memsetLoop
>      1.22s 41.78% 99.66%      1.22s 41.78%  runtime.memclr
>
> So far so good, memsetloop and the _runtime_ memclr are being called.
>
> Going down the rabbit hole, lets look at the assembly:
>  (pprof) disasm memsetLoop
> Total: 2.92s
> ROUTINE ======================== _/tmp/goperf.memsetLoop
>      1.69s      1.69s (flat, cum) 57.88% of Total
>          .          .     46d770: MOVQ 0x10(SP), AX
>          .          .     46d775: MOVQ 0x8(SP), CX
>          .          .     46d77a: MOVL 0x20(SP), DX
>          .          .     46d77e: XORL BX, BX
>          .          .     46d780: CMPQ AX, BX
>          .          .     46d783: JGE 0x46d790
>      400ms      400ms     46d785: MOVL DX, 0(CX)(BX*4)
>      1.14s      1.14s     46d788: INCQ BX
>      150ms      150ms     46d78b: CMPQ AX, BX
>          .          .     46d78e: JL 0x46d785
>
> Standard loop, and definitively not using vectorized instructions 
> (explains the difference on my CPU)
>
> For comparison, the finely hand-tuned memclr implementation is at 
> https://golang.org/src/runtime/memclr_amd64.s (my computer being fairly 
> recent, it takes full advantage of the large registers available).
>
> Can you try to perform the same exercise on your hardware? It will likely 
> shed some lights on the peculiar results you are experiencing.
>
> Regards
> RD
>


Thanks for the guild, RD.

Here is the my system info

$ uname -a
Linux debian8 4.2.0-1-amd64 #1 SMP Debian 4.2.6-3 (2015-12-06) x86_64 
GNU/Linux
$ go version
go version go1.7.4 linux/amd6
$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 42
model name    : Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz
stepping    : 7
microcode    : 0x29
cpu MHz        : 995.828
cache size    : 3072 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est 
tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer xsave avx lahf_lm arat epb pln pts dtherm tpr_shadow 
vnmi flexpriority ept vpid xsaveopt
bugs        :
bogomips    : 4589.65
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:
...(3 more same ones)


Here is the result for MyInt==int

$ go test -bench=. -cpuprofile test.prof
testing: warning: no tests to run
BenchmarkLoop_2000000-4             1000       2026451 ns/op
BenchmarkMemclr_2000000-4           1000       2075557 ns/op
PASS
ok      _/tmp    4.546s
$ go tool pprof test.prof
Entering interactive mode (type "help" for commands)
(pprof) top 5
4.55s of 4.55s total (  100%)
Showing top 5 nodes out of 8 (cum >= 2.31s)
      flat  flat%   sum%        cum   cum%
     2.31s 50.77% 50.77%      2.31s 50.77%  runtime.memclr
     2.24s 49.23%   100%      2.24s 49.23%  _/tmp.memsetLoop
         0     0%   100%      2.24s 49.23%  _/tmp.BenchmarkLoop_2000000
         0     0%   100%      2.31s 50.77%  _/tmp.BenchmarkMemclr_2000000
         0     0%   100%      2.31s 50.77%  _/tmp.memclr
(pprof) disasm memsetLoop
Total: 4.55s
ROUTINE ======================== _/tmp.memsetLoop
     2.24s      2.24s (flat, cum) 49.23% of Total
         .          .     46d770: MOVQ 0x10(SP), AX
         .          .     46d775: MOVQ 0x8(SP), CX
         .          .     46d77a: MOVQ 0x20(SP), DX
         .          .     46d77f: XORL BX, BX
         .          .     46d781: CMPQ AX, BX
         .          .     46d784: JGE 0x46d792
         .          .     46d786: MOVQ DX, 0(CX)(BX*8)
     2.24s      2.24s     46d78a: INCQ BX
         .          .     46d78d: CMPQ AX, BX
         .          .     46d790: JL 0x46d786


And the result for MyInt=int32

$ go test -bench=. -cpuprofile test.prof
testing: warning: no tests to run
BenchmarkLoop_2000000-4             1000       1128167 ns/op
BenchmarkMemclr_2000000-4           2000       1031849 ns/op
PASS
ok      _/tmp    3.460s
$ go tool pprof test.prof
Entering interactive mode (type "help" for commands)
(pprof) top 5   
3.44s of 3.45s total (99.71%)
Dropped 10 nodes (cum <= 0.02s)
Showing top 5 nodes out of 8 (cum >= 2.18s)
      flat  flat%   sum%        cum   cum%
     2.18s 63.19% 63.19%      2.18s 63.19%  runtime.memclr
     1.26s 36.52% 99.71%      1.26s 36.52%  _/tmp.memsetLoop
         0     0% 99.71%      1.26s 36.52%  _/tmp.BenchmarkLoop_2000000
         0     0% 99.71%      2.18s 63.19%  _/tmp.BenchmarkMemclr_2000000
         0     0% 99.71%      2.18s 63.19%  _/tmp.memclr
(pprof) 
(pprof) disasm memsetLoop
Total: 3.45s
ROUTINE ======================== _/tmp.memsetLoop
     1.26s      1.26s (flat, cum) 36.52% of Total
         .          .     46d770: MOVQ 0x10(SP), AX
         .          .     46d775: MOVQ 0x8(SP), CX
         .          .     46d77a: MOVL 0x20(SP), DX
         .          .     46d77e: XORL BX, BX
         .          .     46d780: CMPQ AX, BX
         .          .     46d783: JGE 0x46d790
         .          .     46d785: MOVL DX, 0(CX)(BX*4)
     1.25s      1.25s     46d788: INCQ BX
      10ms       10ms     46d78b: CMPQ AX, BX
         .          .     46d78e: JL 0x46d785


 

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[go-nuts] Re: memclr optimazation does worse?

Reply via email to