go version go1.7.4 linux/amd64 BenchmarkMemclr_100-36 500000000 31.5 ns/op BenchmarkLoop_100-36 200000000 71.7 ns/op BenchmarkMemclr_1000-36 50000000 257 ns/op BenchmarkLoop_1000-36 20000000 612 ns/op BenchmarkMemclr_10000-36 5000000 2675 ns/op BenchmarkLoop_10000-36 2000000 6280 ns/op BenchmarkMemclr_100000-36 500000 39956 ns/op BenchmarkLoop_100000-36 200000 66346 ns/op BenchmarkMemclr_200000-36 200000 79805 ns/op BenchmarkLoop_200000-36 100000 132527 ns/op BenchmarkMemclr_300000-36 200000 119613 ns/op BenchmarkLoop_300000-36 100000 198872 ns/op BenchmarkMemclr_400000-36 100000 160355 ns/op BenchmarkLoop_400000-36 50000 265406 ns/op BenchmarkMemclr_500000-36 100000 199190 ns/op BenchmarkLoop_500000-36 50000 331522 ns/op BenchmarkMemclr_1000000-36 50000 398051 ns/op BenchmarkLoop_1000000-36 20000 663510 ns/op BenchmarkMemclr_2000000-36 20000 796084 ns/op BenchmarkLoop_2000000-36 10000 1326865 ns/op
Uniformly better on my AWS test system: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 36 On-line CPU(s) list: 0-35 Thread(s) per core: 2 Core(s) per socket: 9 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz Stepping: 2 CPU MHz: 3199.968 BogoMIPS: 6101.39 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-8,18-26 NUMA node1 CPU(s): 9-17,27-35 On Thu, Dec 15, 2016 at 12:18 AM, rd <rd6...@gmail.com> wrote: > TL, > > As peterGo, I was unable to reproduce your findings: > > uname -a > Linux 4.8.0-30-generic #32-Ubuntu SMP Fri Dec 2 03:43:27 UTC 2016 x86_64 > x86_64 x86_64 GNU/Linux > > go version > go version go1.7.4 linux/amd64 > > cat /proc/cpuinfo > CPU Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz > > go test -bench=. > [...] > BenchmarkMemclr_2000000-4 3000 421532 ns/op > BenchmarkLoop_2000000-4 2000 791318 ns/op > > So memclr is ~2x faster on my machine. > > In order to see what actually happens, lets use the pprof tool: > go test -bench=. -cpuprofile test.prof > > Then `go tool pprof test.prof`, and `top 5` (sanity check): > flat flat% sum% cum cum% > 1.69s 57.88% 57.88% 1.69s 57.88% _/tmp/goperf.memsetLoop > 1.22s 41.78% 99.66% 1.22s 41.78% runtime.memclr > > So far so good, memsetloop and the _runtime_ memclr are being called. > > Going down the rabbit hole, lets look at the assembly: > (pprof) disasm memsetLoop > Total: 2.92s > ROUTINE ======================== _/tmp/goperf.memsetLoop > 1.69s 1.69s (flat, cum) 57.88% of Total > . . 46d770: MOVQ 0x10(SP), AX > . . 46d775: MOVQ 0x8(SP), CX > . . 46d77a: MOVL 0x20(SP), DX > . . 46d77e: XORL BX, BX > . . 46d780: CMPQ AX, BX > . . 46d783: JGE 0x46d790 > 400ms 400ms 46d785: MOVL DX, 0(CX)(BX*4) > 1.14s 1.14s 46d788: INCQ BX > 150ms 150ms 46d78b: CMPQ AX, BX > . . 46d78e: JL 0x46d785 > > Standard loop, and definitively not using vectorized instructions > (explains the difference on my CPU) > > For comparison, the finely hand-tuned memclr implementation is at > https://golang.org/src/runtime/memclr_amd64.s (my computer being fairly > recent, it takes full advantage of the large registers available). > > Can you try to perform the same exercise on your hardware? It will likely > shed some lights on the peculiar results you are experiencing. > > Regards > RD > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- Michael T. Jones michael.jo...@gmail.com -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.