[go-nuts] Re: memclr optimazation does worse?

2016-12-16 Thread rd6137
Hi Solokov,

interesting idea, but it does not seem that the cache size would be the 
issue at hand here — please note that slices of more than 524288 integers 
do not fit in my laptop cache:

 1. The runtime is linear to the size of the array (no "cliff" — see 
attached picture)

 2. The page misses are ridiculously low (expected since the benchmark 
yields itself to contiguous allocation and processing):

 Performance counter stats for '/usr/local/go/bin/go test -bench=.':

  20882.464799  task-clock (msec) #0.998 CPUs 
utilized  
 4,947  context-switches  #0.237 
K/sec  
   239  cpu-migrations#0.011 
K/sec  
23,583  page-faults   #0.001 
M/sec  
64,142,963,898  cycles#3.072 
GHz
37,246,686,615  instructions  #0.58  insn per 
cycle 
 4,753,144,504  branches  #  227.614 
M/sec  
 2,870,298  branch-misses #0.06% of all 
branches  

I am running the following excerpt from the original code:
package P

import (
"strconv"
"testing"
)

func memclr(a []int) {
for i := range a {
a[i] = 0
}
}

func BenchmarkMemclr(b *testing.B) {
for i := 10; i < 40960; i *= 2 {
b.Run("bench"+strconv.Itoa(i), func(b *testing.B) {
var a = make([]int, i)
b.ResetTimer()
for i := 0; i < b.N; i++ {
memclr(a)
}
})
}
}





-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: memclr optimazation does worse?

2016-12-15 Thread Sokolov Yura
I suppose, prefetch instructions in AVX loop (for block after current) can 
solve this issue.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: memclr optimazation does worse?

2016-12-15 Thread Sokolov Yura
Memory is slow. While slice fits to cache, memclr is measurably faster.
When slice doesn't fit cache, memclr at least not significantly faster.

I've heard, adaptive prefetching is turned on if there were 3 consequent 
accesses to same cache-line in increasing address order. So, perhaps optimised 
SSE/AVX zeroing doesn't trigger adaptive prefetch cause it uses less memory 
accesses. And then, it may vary much by CPU model: newer models may fix 
adaptive prefetch, so that memclr is great again.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: memclr optimazation does worse?

2016-12-15 Thread T L


On Friday, December 16, 2016 at 12:36:47 AM UTC+8, rd wrote:
>
> TL,
>
> As peterGo, I was unable to reproduce your findings:
>
> uname -a
> Linux 4.8.0-30-generic #32-Ubuntu SMP Fri Dec 2 03:43:27 UTC 2016 x86_64 
> x86_64 x86_64 GNU/Linux
>
> go version
> go version go1.7.4 linux/amd64
>
> cat /proc/cpuinfo
> CPU Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
>
> go test -bench=. 
> [...]
> BenchmarkMemclr_200-4   3000421532 ns/op
> BenchmarkLoop_200-4   2000791318 ns/op
>
> So memclr is ~2x faster on my machine.
>
> In order to see what actually happens, lets use the pprof tool:
> go test -bench=. -cpuprofile test.prof
>
> Then `go tool pprof test.prof`, and `top 5` (sanity check):
>   flat  flat%   sum%cum   cum%
>  1.69s 57.88% 57.88%  1.69s 57.88%  _/tmp/goperf.memsetLoop
>  1.22s 41.78% 99.66%  1.22s 41.78%  runtime.memclr
>
> So far so good, memsetloop and the _runtime_ memclr are being called.
>
> Going down the rabbit hole, lets look at the assembly:
>  (pprof) disasm memsetLoop
> Total: 2.92s
> ROUTINE  _/tmp/goperf.memsetLoop
>  1.69s  1.69s (flat, cum) 57.88% of Total
>  .  . 46d770: MOVQ 0x10(SP), AX
>  .  . 46d775: MOVQ 0x8(SP), CX
>  .  . 46d77a: MOVL 0x20(SP), DX
>  .  . 46d77e: XORL BX, BX
>  .  . 46d780: CMPQ AX, BX
>  .  . 46d783: JGE 0x46d790
>  400ms  400ms 46d785: MOVL DX, 0(CX)(BX*4)
>  1.14s  1.14s 46d788: INCQ BX
>  150ms  150ms 46d78b: CMPQ AX, BX
>  .  . 46d78e: JL 0x46d785
>
> Standard loop, and definitively not using vectorized instructions 
> (explains the difference on my CPU)
>
> For comparison, the finely hand-tuned memclr implementation is at 
> https://golang.org/src/runtime/memclr_amd64.s (my computer being fairly 
> recent, it takes full advantage of the large registers available).
>
> Can you try to perform the same exercise on your hardware? It will likely 
> shed some lights on the peculiar results you are experiencing.
>
> Regards
> RD
>

Thanks for the guild, RD.

Here is the my system info

$ uname -a
Linux debian8 4.2.0-1-amd64 #1 SMP Debian 4.2.6-3 (2015-12-06) x86_64 
GNU/Linux
$ go version
go version go1.7.4 linux/amd6
$ cat /proc/cpuinfo
processor: 0
vendor_id: GenuineIntel
cpu family: 6
model: 42
model name: Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz
stepping: 7
microcode: 0x29
cpu MHz: 995.828
cache size: 3072 KB
physical id: 0
siblings: 4
core id: 0
cpu cores: 2
apicid: 0
initial apicid: 0
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est 
tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer xsave avx lahf_lm arat epb pln pts dtherm tpr_shadow 
vnmi flexpriority ept vpid xsaveopt
bugs:
bogomips: 4589.65
clflush size: 64
cache_alignment: 64
address sizes: 36 bits physical, 48 bits virtual
power management:
...(3 more same ones)


Here is the result for MyInt==int

$ go test -bench=. -cpuprofile test.prof
testing: warning: no tests to run
BenchmarkLoop_200-4 1000   2026451 ns/op
BenchmarkMemclr_200-4   1000   2075557 ns/op
PASS
ok  _/tmp4.546s
$ go tool pprof test.prof
Entering interactive mode (type "help" for commands)
(pprof) top 5
4.55s of 4.55s total (  100%)
Showing top 5 nodes out of 8 (cum >= 2.31s)
  flat  flat%   sum%cum   cum%
 2.31s 50.77% 50.77%  2.31s 50.77%  runtime.memclr
 2.24s 49.23%   100%  2.24s 49.23%  _/tmp.memsetLoop
 0 0%   100%  2.24s 49.23%  _/tmp.BenchmarkLoop_200
 0 0%   100%  2.31s 50.77%  _/tmp.BenchmarkMemclr_200
 0 0%   100%  2.31s 50.77%  _/tmp.memclr
(pprof) disasm memsetLoop
Total: 4.55s
ROUTINE  _/tmp.memsetLoop
 2.24s  2.24s (flat, cum) 49.23% of Total
 .  . 46d770: MOVQ 0x10(SP), AX
 .  . 46d775: MOVQ 0x8(SP), CX
 .  . 46d77a: MOVQ 0x20(SP), DX
 .  . 46d77f: XORL BX, BX
 .  . 46d781: CMPQ AX, BX
 .  . 46d784: JGE 0x46d792
 .  . 46d786: MOVQ DX, 0(CX)(BX*8)
 2.24s  2.24s 46d78a: INCQ BX
 .  . 46d78d: CMPQ AX, BX
 .  . 46d790: JL 0x46d786


And the result for MyInt=int32

$ go test -bench=. -cpuprofile test.prof
testing: warning: no tests to run
BenchmarkLoop_200-4   

Re: [go-nuts] Re: memclr optimazation does worse?

2016-12-15 Thread Michael Jones
go version go1.7.4 linux/amd64

BenchmarkMemclr_100-36 531.5 ns/op
BenchmarkLoop_100-36   271.7 ns/op
BenchmarkMemclr_1000-36   5000   257 ns/op
BenchmarkLoop_1000-36 2000   612 ns/op
BenchmarkMemclr_1-36   500  2675 ns/op
BenchmarkLoop_1-36 200  6280 ns/op
BenchmarkMemclr_10-36  50 39956 ns/op
BenchmarkLoop_10-3620 66346 ns/op
BenchmarkMemclr_20-36  20 79805 ns/op
BenchmarkLoop_20-3610132527 ns/op
BenchmarkMemclr_30-36  20119613 ns/op
BenchmarkLoop_30-3610198872 ns/op
BenchmarkMemclr_40-36  10160355 ns/op
BenchmarkLoop_40-36 5265406 ns/op
BenchmarkMemclr_50-36  10199190 ns/op
BenchmarkLoop_50-36 5331522 ns/op
BenchmarkMemclr_100-36   5398051 ns/op
BenchmarkLoop_100-36 2663510 ns/op
BenchmarkMemclr_200-36   2796084 ns/op
BenchmarkLoop_200-36 1   1326865 ns/op

Uniformly better on my AWS test system:

Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):36
On-line CPU(s) list:   0-35
Thread(s) per core:2
Core(s) per socket:9
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 63
Model name:Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
Stepping:  2
CPU MHz:   3199.968
BogoMIPS:  6101.39
Hypervisor vendor: Xen
Virtualization type:   full
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  25600K
NUMA node0 CPU(s): 0-8,18-26
NUMA node1 CPU(s): 9-17,27-35

On Thu, Dec 15, 2016 at 12:18 AM, rd  wrote:

> TL,
>
> As peterGo, I was unable to reproduce your findings:
>
> uname -a
> Linux 4.8.0-30-generic #32-Ubuntu SMP Fri Dec 2 03:43:27 UTC 2016 x86_64
> x86_64 x86_64 GNU/Linux
>
> go version
> go version go1.7.4 linux/amd64
>
> cat /proc/cpuinfo
> CPU Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz
>
> go test -bench=.
> [...]
> BenchmarkMemclr_200-4   3000421532 ns/op
> BenchmarkLoop_200-4   2000791318 ns/op
>
> So memclr is ~2x faster on my machine.
>
> In order to see what actually happens, lets use the pprof tool:
> go test -bench=. -cpuprofile test.prof
>
> Then `go tool pprof test.prof`, and `top 5` (sanity check):
>   flat  flat%   sum%cum   cum%
>  1.69s 57.88% 57.88%  1.69s 57.88%  _/tmp/goperf.memsetLoop
>  1.22s 41.78% 99.66%  1.22s 41.78%  runtime.memclr
>
> So far so good, memsetloop and the _runtime_ memclr are being called.
>
> Going down the rabbit hole, lets look at the assembly:
>  (pprof) disasm memsetLoop
> Total: 2.92s
> ROUTINE  _/tmp/goperf.memsetLoop
>  1.69s  1.69s (flat, cum) 57.88% of Total
>  .  . 46d770: MOVQ 0x10(SP), AX
>  .  . 46d775: MOVQ 0x8(SP), CX
>  .  . 46d77a: MOVL 0x20(SP), DX
>  .  . 46d77e: XORL BX, BX
>  .  . 46d780: CMPQ AX, BX
>  .  . 46d783: JGE 0x46d790
>  400ms  400ms 46d785: MOVL DX, 0(CX)(BX*4)
>  1.14s  1.14s 46d788: INCQ BX
>  150ms  150ms 46d78b: CMPQ AX, BX
>  .  . 46d78e: JL 0x46d785
>
> Standard loop, and definitively not using vectorized instructions
> (explains the difference on my CPU)
>
> For comparison, the finely hand-tuned memclr implementation is at
> https://golang.org/src/runtime/memclr_amd64.s (my computer being fairly
> recent, it takes full advantage of the large registers available).
>
> Can you try to perform the same exercise on your hardware? It will likely
> shed some lights on the peculiar results you are experiencing.
>
> Regards
> RD
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Michael T. Jones
michael.jo...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: memclr optimazation does worse?

2016-12-15 Thread rd
TL,

As peterGo, I was unable to reproduce your findings:

uname -a
Linux 4.8.0-30-generic #32-Ubuntu SMP Fri Dec 2 03:43:27 UTC 2016 x86_64 
x86_64 x86_64 GNU/Linux

go version
go version go1.7.4 linux/amd64

cat /proc/cpuinfo
CPU Intel(R) Core(TM) i7-6560U CPU @ 2.20GHz

go test -bench=. 
[...]
BenchmarkMemclr_200-4   3000421532 ns/op
BenchmarkLoop_200-4   2000791318 ns/op

So memclr is ~2x faster on my machine.

In order to see what actually happens, lets use the pprof tool:
go test -bench=. -cpuprofile test.prof

Then `go tool pprof test.prof`, and `top 5` (sanity check):
  flat  flat%   sum%cum   cum%
 1.69s 57.88% 57.88%  1.69s 57.88%  _/tmp/goperf.memsetLoop
 1.22s 41.78% 99.66%  1.22s 41.78%  runtime.memclr

So far so good, memsetloop and the _runtime_ memclr are being called.

Going down the rabbit hole, lets look at the assembly:
 (pprof) disasm memsetLoop
Total: 2.92s
ROUTINE  _/tmp/goperf.memsetLoop
 1.69s  1.69s (flat, cum) 57.88% of Total
 .  . 46d770: MOVQ 0x10(SP), AX
 .  . 46d775: MOVQ 0x8(SP), CX
 .  . 46d77a: MOVL 0x20(SP), DX
 .  . 46d77e: XORL BX, BX
 .  . 46d780: CMPQ AX, BX
 .  . 46d783: JGE 0x46d790
 400ms  400ms 46d785: MOVL DX, 0(CX)(BX*4)
 1.14s  1.14s 46d788: INCQ BX
 150ms  150ms 46d78b: CMPQ AX, BX
 .  . 46d78e: JL 0x46d785

Standard loop, and definitively not using vectorized instructions (explains 
the difference on my CPU)

For comparison, the finely hand-tuned memclr implementation is at 
https://golang.org/src/runtime/memclr_amd64.s (my computer being fairly 
recent, it takes full advantage of the large registers available).

Can you try to perform the same exercise on your hardware? It will likely 
shed some lights on the peculiar results you are experiencing.

Regards
RD

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: memclr optimazation does worse?

2016-12-15 Thread T L


On Thursday, December 15, 2016 at 6:50:14 AM UTC+8, not...@google.com wrote:
>
> Be wary of slice size, as caching is going to have an extremely strong 
> effect on the results.  I submitted a CL that made append, only clear 
> memory that was not going to be overwritten ( 
> https://github.com/golang/go/commit/c1e267cc734135a66af8a1a5015e572cbb598d44 
> ).  I thought this would have a much larger impact, but it only had a small 
> impact.  memclr would zero the memory, but it also brought it into the 
> cache, where it was hot for being overwritten.
>
> Have you tried running with perf to see dcache misses for each benchmark?
>
>
how to check cache misses with go pperf?
 

> On Wednesday, December 14, 2016 at 6:12:08 AM UTC-8, T L wrote:
>>
>> I just read this issue thread: https://github.com/golang/go/issues/5373
>> and this https://codereview.appspot.com/137880043
>> which says:
>>
>> for i := range a { 
>> a[i] = [zero val] 
>> }
>>
>> will be replaced with memclr. 
>> I made some benchmarks, but the results are disappointing. 
>> When the length of slice/array is very large, memclr is slower.
>>
>> Result
>>
>> BenchmarkMemclr_100-4   137.2 ns/op
>> BenchmarkLoop_100-4 170.7 ns/op
>> BenchmarkMemclr_1000-4  2000   351 ns/op
>> BenchmarkLoop_1000-41000   464 ns/op
>> BenchmarkMemclr_1-4  100  3623 ns/op
>> BenchmarkLoop_1-4100  4940 ns/op
>> BenchmarkMemclr_10-4  10 49230 ns/op
>> BenchmarkLoop_10-410 58761 ns/op
>> BenchmarkMemclr_20-4   5 98165 ns/op
>> BenchmarkLoop_20-4 5115833 ns/op
>> BenchmarkMemclr_30-4   3170617 ns/op
>> BenchmarkLoop_30-4 2190193 ns/op
>> BenchmarkMemclr_40-4   2275676 ns/op
>> BenchmarkLoop_40-4 2288729 ns/op
>> BenchmarkMemclr_50-4   1410280 ns/op
>> BenchmarkLoop_50-4 1416195 ns/op
>> BenchmarkMemclr_100-4   5000   1025504 ns/op
>> BenchmarkLoop_100-4 5000   1012198 ns/op
>> BenchmarkMemclr_200-4   2000   2071861 ns/op
>> BenchmarkLoop_200-4 2000   2032703 ns/op
>>
>> test code:
>>
>> package main
>>
>> import "testing"
>>
>> func memclr(a []int) {
>> for i := range a {
>> a[i] = 0
>> }
>> }
>>
>> func memsetLoop(a []int, v int) {
>> for i := range a {
>> a[i] = v
>> }
>> }
>>
>> var i = 0
>>
>> func BenchmarkMemclr_100(b *testing.B) {
>> var a = make([]int, 100)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_100(b *testing.B) {
>> var a = make([]int, 100)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_1000(b *testing.B) {
>> var a = make([]int, 1000)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_1000(b *testing.B) {
>> var a = make([]int, 1000)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_1(b *testing.B) {
>> var a = make([]int, 1)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_1(b *testing.B) {
>> var a = make([]int, 1)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_10(b *testing.B) {
>> var a = make([]int, 10)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_10(b *testing.B) {
>> var a = make([]int, 10)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_20(b *testing.B) {
>> var a = make([]int, 20)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_20(b *testing.B) {
>> var a = make([]int, 20)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_30(b *testing.B) {
>> var a = make([]int, 30)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_30(b *testing.B) {
>> var a = make([]int, 30)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_40(b *testing.B) {
>> var a = make([]int, 40)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func 

[go-nuts] Re: memclr optimazation does worse?

2016-12-14 Thread T L


On Thursday, December 15, 2016 at 10:18:57 AM UTC+8, T L wrote:
>
> But if I changed the line
> type MyInt int32
> to 
> type MyInt int
> then again, the memclr version becomes slower, or no advantage, for cases 
> of slice lengths larger than 200.
>

Tried other types, looks the situation is more possible happening for types 
with value size equal to 8 (on amd64).

 

>
> On Thursday, December 15, 2016 at 10:05:23 AM UTC+8, T L wrote:
>>
>>
>>
>> On Wednesday, December 14, 2016 at 10:12:08 PM UTC+8, T L wrote:
>>>
>>> I just read this issue thread: https://github.com/golang/go/issues/5373
>>> and this https://codereview.appspot.com/137880043
>>> which says:
>>>
>>> for i := range a { 
>>> a[i] = [zero val] 
>>> }
>>>
>>> will be replaced with memclr. 
>>> I made some benchmarks, but the results are disappointing. 
>>> When the length of slice/array is very large, memclr is slower.
>>>
>>> Result
>>>
>>> BenchmarkMemclr_100-4   137.2 ns/op
>>> BenchmarkLoop_100-4 170.7 ns/op
>>> BenchmarkMemclr_1000-4  2000   351 ns/op
>>> BenchmarkLoop_1000-41000   464 ns/op
>>> BenchmarkMemclr_1-4  100  3623 ns/op
>>> BenchmarkLoop_1-4100  4940 ns/op
>>> BenchmarkMemclr_10-4  10 49230 ns/op
>>> BenchmarkLoop_10-410 58761 ns/op
>>> BenchmarkMemclr_20-4   5 98165 ns/op
>>> BenchmarkLoop_20-4 5115833 ns/op
>>> BenchmarkMemclr_30-4   3170617 ns/op
>>> BenchmarkLoop_30-4 2190193 ns/op
>>> BenchmarkMemclr_40-4   2275676 ns/op
>>> BenchmarkLoop_40-4 2288729 ns/op
>>> BenchmarkMemclr_50-4   1410280 ns/op
>>> BenchmarkLoop_50-4 1416195 ns/op
>>> BenchmarkMemclr_100-4   5000   1025504 ns/op
>>> BenchmarkLoop_100-4 5000   1012198 ns/op
>>> BenchmarkMemclr_200-4   2000   2071861 ns/op
>>> BenchmarkLoop_200-4 2000   2032703 ns/op
>>>
>>> test code:
>>>
>>> package main
>>>
>>> import "testing"
>>>
>>> func memclr(a []int) {
>>> for i := range a {
>>> a[i] = 0
>>> }
>>> }
>>>
>>> func memsetLoop(a []int, v int) {
>>> for i := range a {
>>> a[i] = v
>>> }
>>> }
>>>
>>> var i = 0
>>>
>>> func BenchmarkMemclr_100(b *testing.B) {
>>> var a = make([]int, 100)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memclr(a)
>>> }
>>> }
>>>
>>> func BenchmarkLoop_100(b *testing.B) {
>>> var a = make([]int, 100)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memsetLoop(a, i)
>>> }
>>> }
>>>
>>> func BenchmarkMemclr_1000(b *testing.B) {
>>> var a = make([]int, 1000)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memclr(a)
>>> }
>>> }
>>>
>>> func BenchmarkLoop_1000(b *testing.B) {
>>> var a = make([]int, 1000)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memsetLoop(a, i)
>>> }
>>> }
>>>
>>> func BenchmarkMemclr_1(b *testing.B) {
>>> var a = make([]int, 1)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memclr(a)
>>> }
>>> }
>>>
>>> func BenchmarkLoop_1(b *testing.B) {
>>> var a = make([]int, 1)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memsetLoop(a, i)
>>> }
>>> }
>>>
>>> func BenchmarkMemclr_10(b *testing.B) {
>>> var a = make([]int, 10)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memclr(a)
>>> }
>>> }
>>>
>>> func BenchmarkLoop_10(b *testing.B) {
>>> var a = make([]int, 10)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memsetLoop(a, i)
>>> }
>>> }
>>>
>>> func BenchmarkMemclr_20(b *testing.B) {
>>> var a = make([]int, 20)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memclr(a)
>>> }
>>> }
>>>
>>> func BenchmarkLoop_20(b *testing.B) {
>>> var a = make([]int, 20)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memsetLoop(a, i)
>>> }
>>> }
>>>
>>> func BenchmarkMemclr_30(b *testing.B) {
>>> var a = make([]int, 30)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memclr(a)
>>> }
>>> }
>>>
>>> func BenchmarkLoop_30(b *testing.B) {
>>> var a = make([]int, 30)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memsetLoop(a, i)
>>> }
>>> }
>>>
>>> func BenchmarkMemclr_40(b *testing.B) {
>>> var a = make([]int, 40)
>>> b.ResetTimer()
>>> for i := 0; i < b.N; i++ {
>>> memclr(a)
>>> }
>>> }
>>>
>>> func BenchmarkLoop_40(b *testing.B) {
>>> var a = make([]int, 40)
>>>  

[go-nuts] Re: memclr optimazation does worse?

2016-12-14 Thread T L
But if I changed the line
type MyInt int32
to 
type MyInt int
then again, the memclr version becomes slower, or no advantage, for cases 
of slice lengths larger than 200.

On Thursday, December 15, 2016 at 10:05:23 AM UTC+8, T L wrote:
>
>
>
> On Wednesday, December 14, 2016 at 10:12:08 PM UTC+8, T L wrote:
>>
>> I just read this issue thread: https://github.com/golang/go/issues/5373
>> and this https://codereview.appspot.com/137880043
>> which says:
>>
>> for i := range a { 
>> a[i] = [zero val] 
>> }
>>
>> will be replaced with memclr. 
>> I made some benchmarks, but the results are disappointing. 
>> When the length of slice/array is very large, memclr is slower.
>>
>> Result
>>
>> BenchmarkMemclr_100-4   137.2 ns/op
>> BenchmarkLoop_100-4 170.7 ns/op
>> BenchmarkMemclr_1000-4  2000   351 ns/op
>> BenchmarkLoop_1000-41000   464 ns/op
>> BenchmarkMemclr_1-4  100  3623 ns/op
>> BenchmarkLoop_1-4100  4940 ns/op
>> BenchmarkMemclr_10-4  10 49230 ns/op
>> BenchmarkLoop_10-410 58761 ns/op
>> BenchmarkMemclr_20-4   5 98165 ns/op
>> BenchmarkLoop_20-4 5115833 ns/op
>> BenchmarkMemclr_30-4   3170617 ns/op
>> BenchmarkLoop_30-4 2190193 ns/op
>> BenchmarkMemclr_40-4   2275676 ns/op
>> BenchmarkLoop_40-4 2288729 ns/op
>> BenchmarkMemclr_50-4   1410280 ns/op
>> BenchmarkLoop_50-4 1416195 ns/op
>> BenchmarkMemclr_100-4   5000   1025504 ns/op
>> BenchmarkLoop_100-4 5000   1012198 ns/op
>> BenchmarkMemclr_200-4   2000   2071861 ns/op
>> BenchmarkLoop_200-4 2000   2032703 ns/op
>>
>> test code:
>>
>> package main
>>
>> import "testing"
>>
>> func memclr(a []int) {
>> for i := range a {
>> a[i] = 0
>> }
>> }
>>
>> func memsetLoop(a []int, v int) {
>> for i := range a {
>> a[i] = v
>> }
>> }
>>
>> var i = 0
>>
>> func BenchmarkMemclr_100(b *testing.B) {
>> var a = make([]int, 100)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_100(b *testing.B) {
>> var a = make([]int, 100)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_1000(b *testing.B) {
>> var a = make([]int, 1000)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_1000(b *testing.B) {
>> var a = make([]int, 1000)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_1(b *testing.B) {
>> var a = make([]int, 1)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_1(b *testing.B) {
>> var a = make([]int, 1)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_10(b *testing.B) {
>> var a = make([]int, 10)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_10(b *testing.B) {
>> var a = make([]int, 10)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_20(b *testing.B) {
>> var a = make([]int, 20)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_20(b *testing.B) {
>> var a = make([]int, 20)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_30(b *testing.B) {
>> var a = make([]int, 30)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_30(b *testing.B) {
>> var a = make([]int, 30)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_40(b *testing.B) {
>> var a = make([]int, 40)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_40(b *testing.B) {
>> var a = make([]int, 40)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memsetLoop(a, i)
>> }
>> }
>>
>> func BenchmarkMemclr_50(b *testing.B) {
>> var a = make([]int, 50)
>> b.ResetTimer()
>> for i := 0; i < b.N; i++ {
>> memclr(a)
>> }
>> }
>>
>> func BenchmarkLoop_50(b *testing.B) {
>> var a = make([]int, 50)
>> 

[go-nuts] Re: memclr optimazation does worse?

2016-12-14 Thread sheepbao
I have the same result in the Mac, go 1.7.1
```go

BenchmarkMemclr_100-4   1 22.8 ns/op

BenchmarkLoop_100-4 3000 47.1 ns/op

BenchmarkMemclr_1000-4  1000   181 ns/op

BenchmarkLoop_1000-4 500   365 ns/op

BenchmarkMemclr_1-4   50   2777 ns/op

BenchmarkLoop_1-4 30   4003 ns/op

BenchmarkMemclr_10-4   5 38993 ns/op

BenchmarkLoop_10-4 3 43893 ns/op

BenchmarkMemclr_20-4   2 79159 ns/op

BenchmarkLoop_20-4 2 87533 ns/op

BenchmarkMemclr_30-4   1 127745 ns/op

BenchmarkLoop_30-4 1 140770 ns/op

BenchmarkMemclr_40-4   1 217689 ns/op

BenchmarkLoop_40-4 1 234632 ns/op

BenchmarkMemclr_50-45000 344265 ns/op

BenchmarkLoop_50-4  2000 535585 ns/op

BenchmarkMemclr_100-4   1000   1130508 ns/op

BenchmarkLoop_100-4 2000 889592 ns/op

BenchmarkMemclr_200-4   1000   2071970 ns/op

BenchmarkLoop_200-4 1000   1758001 ns/op

PASS

ok  _/Users/bao/program/go/learn/goTour/memclr 37.313s

```

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] Re: memclr optimazation does worse?

2016-12-14 Thread notcarl via golang-nuts
Be wary of slice size, as caching is going to have an extremely strong 
effect on the results.  I submitted a CL that made append, only clear 
memory that was not going to be overwritten 
( https://github.com/golang/go/commit/c1e267cc734135a66af8a1a5015e572cbb598d44 
).  I thought this would have a much larger impact, but it only had a small 
impact.  memclr would zero the memory, but it also brought it into the 
cache, where it was hot for being overwritten.

Have you tried running with perf to see dcache misses for each benchmark?

On Wednesday, December 14, 2016 at 6:12:08 AM UTC-8, T L wrote:
>
> I just read this issue thread: https://github.com/golang/go/issues/5373
> and this https://codereview.appspot.com/137880043
> which says:
>
> for i := range a { 
> a[i] = [zero val] 
> }
>
> will be replaced with memclr. 
> I made some benchmarks, but the results are disappointing. 
> When the length of slice/array is very large, memclr is slower.
>
> Result
>
> BenchmarkMemclr_100-4   137.2 ns/op
> BenchmarkLoop_100-4 170.7 ns/op
> BenchmarkMemclr_1000-4  2000   351 ns/op
> BenchmarkLoop_1000-41000   464 ns/op
> BenchmarkMemclr_1-4  100  3623 ns/op
> BenchmarkLoop_1-4100  4940 ns/op
> BenchmarkMemclr_10-4  10 49230 ns/op
> BenchmarkLoop_10-410 58761 ns/op
> BenchmarkMemclr_20-4   5 98165 ns/op
> BenchmarkLoop_20-4 5115833 ns/op
> BenchmarkMemclr_30-4   3170617 ns/op
> BenchmarkLoop_30-4 2190193 ns/op
> BenchmarkMemclr_40-4   2275676 ns/op
> BenchmarkLoop_40-4 2288729 ns/op
> BenchmarkMemclr_50-4   1410280 ns/op
> BenchmarkLoop_50-4 1416195 ns/op
> BenchmarkMemclr_100-4   5000   1025504 ns/op
> BenchmarkLoop_100-4 5000   1012198 ns/op
> BenchmarkMemclr_200-4   2000   2071861 ns/op
> BenchmarkLoop_200-4 2000   2032703 ns/op
>
> test code:
>
> package main
>
> import "testing"
>
> func memclr(a []int) {
> for i := range a {
> a[i] = 0
> }
> }
>
> func memsetLoop(a []int, v int) {
> for i := range a {
> a[i] = v
> }
> }
>
> var i = 0
>
> func BenchmarkMemclr_100(b *testing.B) {
> var a = make([]int, 100)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_100(b *testing.B) {
> var a = make([]int, 100)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_1000(b *testing.B) {
> var a = make([]int, 1000)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_1000(b *testing.B) {
> var a = make([]int, 1000)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_1(b *testing.B) {
> var a = make([]int, 1)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_1(b *testing.B) {
> var a = make([]int, 1)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_10(b *testing.B) {
> var a = make([]int, 10)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_10(b *testing.B) {
> var a = make([]int, 10)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_20(b *testing.B) {
> var a = make([]int, 20)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_20(b *testing.B) {
> var a = make([]int, 20)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_30(b *testing.B) {
> var a = make([]int, 30)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_30(b *testing.B) {
> var a = make([]int, 30)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_40(b *testing.B) {
> var a = make([]int, 40)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_40(b *testing.B) {
> var a = make([]int, 40)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_50(b *testing.B) {
> var a = make([]int, 50)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> 

[go-nuts] Re: memclr optimazation does worse?

2016-12-14 Thread peterGo
TL,

To paraphrase: There are lies, damned lies, and benchmarks [statistics].

Let's use another machine.

The results of your benchmarks.

$ go version
go version devel +96414ca Wed Dec 14 19:36:20 2016 + linux/amd64
$ go test -bench=. -cpu=4
BenchmarkMemclr_100-4   113.0 ns/op
BenchmarkLoop_100-4 500034.2 ns/op
BenchmarkMemclr_1000-4  2000   110 ns/op
BenchmarkLoop_1000-4 500   262 ns/op
BenchmarkMemclr_1-4  100  1080 ns/op
BenchmarkLoop_1-4 50  2861 ns/op
BenchmarkMemclr_10-4  10 16137 ns/op
BenchmarkLoop_10-4 5 31763 ns/op
BenchmarkMemclr_20-4   5 31774 ns/op
BenchmarkLoop_20-4 2 63448 ns/op
BenchmarkMemclr_30-4   3 47662 ns/op
BenchmarkLoop_30-4 2 95335 ns/op
BenchmarkMemclr_40-4   2 63424 ns/op
BenchmarkLoop_40-4 1127160 ns/op
BenchmarkMemclr_50-4   2 81460 ns/op
BenchmarkLoop_50-4 1159163 ns/op
BenchmarkMemclr_100-4  1204890 ns/op
BenchmarkLoop_100-4 5000327647 ns/op
BenchmarkMemclr_200-4   2000733899 ns/op
BenchmarkLoop_200-4 2000885830 ns/op
PASS
ok  tl36.282s
$

Memclr is 17.15% faster than Loop for a very large slice.

Peter

On Wednesday, December 14, 2016 at 10:38:43 AM UTC-5, peterGo wrote:
>
> TL,
>
> For your results, it's a small increase of 1.9%.
>
> For my results, it's a small decrease of −2.03%.
>
> $ go version
> go version devel +232991e Wed Dec 14 05:51:01 2016 + linux/amd64
> $ go test -bench=.
> BenchmarkMemclr_100-4   2000   115 ns/op
> BenchmarkLoop_100-4  500   244 ns/op
> BenchmarkMemclr_1000-4   100  1026 ns/op
> BenchmarkLoop_1000-4 100  1387 ns/op
> BenchmarkMemclr_1-4   20 10521 ns/op
> BenchmarkLoop_1-4 10 14285 ns/op
> BenchmarkMemclr_10-4   1146268 ns/op
> BenchmarkLoop_10-4 1168871 ns/op
> BenchmarkMemclr_20-45000291458 ns/op
> BenchmarkLoop_20-4  5000344252 ns/op
> BenchmarkMemclr_30-43000494498 ns/op
> BenchmarkLoop_30-4  2000602575 ns/op
> BenchmarkMemclr_40-42000734921 ns/op
> BenchmarkLoop_40-4  2000779482 ns/op
> BenchmarkMemclr_50-42000981884 ns/op
> BenchmarkLoop_50-4  2000   1008058 ns/op
> BenchmarkMemclr_100-4   1000   2073439 ns/op
> BenchmarkLoop_100-4 1000   2093744 ns/op
> BenchmarkMemclr_200-4300   3932547 ns/op
> BenchmarkLoop_200-4  300   4132627 ns/op
> PASS
> ok  tl34.872s
> $ 
>
> Peter
>
>
> On Wednesday, December 14, 2016 at 9:12:08 AM UTC-5, T L wrote:
>>
>> I just read this issue thread: https://github.com/golang/go/issues/5373
>> and this https://codereview.appspot.com/137880043
>> which says:
>>
>> for i := range a { 
>> a[i] = [zero val] 
>> }
>>
>> will be replaced with memclr. 
>> I made some benchmarks, but the results are disappointing. 
>> When the length of slice/array is very large, memclr is slower.
>>
>> Result
>>
>> BenchmarkMemclr_100-4   137.2 ns/op
>> BenchmarkLoop_100-4 170.7 ns/op
>> BenchmarkMemclr_1000-4  2000   351 ns/op
>> BenchmarkLoop_1000-41000   464 ns/op
>> BenchmarkMemclr_1-4  100  3623 ns/op
>> BenchmarkLoop_1-4100  4940 ns/op
>> BenchmarkMemclr_10-4  10 49230 ns/op
>> BenchmarkLoop_10-410 58761 ns/op
>> BenchmarkMemclr_20-4   5 98165 ns/op
>> BenchmarkLoop_20-4 5115833 ns/op
>> BenchmarkMemclr_30-4   3170617 ns/op
>> BenchmarkLoop_30-4 2190193 ns/op
>> BenchmarkMemclr_40-4   2275676 ns/op
>> BenchmarkLoop_40-4 2288729 ns/op
>> BenchmarkMemclr_50-4   1410280 ns/op
>> BenchmarkLoop_50-4 1416195 ns/op
>> BenchmarkMemclr_100-4   5000   1025504 ns/op
>> BenchmarkLoop_100-4 5000   1012198 ns/op
>> BenchmarkMemclr_200-4   2000   2071861 ns/op
>> BenchmarkLoop_200-4 2000   2032703 ns/op
>>
>> test code:
>>
>> package main
>>
>> import "testing"
>>
>> func memclr(a []int) {
>>

[go-nuts] Re: memclr optimazation does worse?

2016-12-14 Thread peterGo
TL,

For your results, it's a small increase of 1.9%.

For my results, it's a small decrease of −2.03%.

$ go version
go version devel +232991e Wed Dec 14 05:51:01 2016 + linux/amd64
$ go test -bench=.
BenchmarkMemclr_100-4   2000   115 ns/op
BenchmarkLoop_100-4  500   244 ns/op
BenchmarkMemclr_1000-4   100  1026 ns/op
BenchmarkLoop_1000-4 100  1387 ns/op
BenchmarkMemclr_1-4   20 10521 ns/op
BenchmarkLoop_1-4 10 14285 ns/op
BenchmarkMemclr_10-4   1146268 ns/op
BenchmarkLoop_10-4 1168871 ns/op
BenchmarkMemclr_20-45000291458 ns/op
BenchmarkLoop_20-4  5000344252 ns/op
BenchmarkMemclr_30-43000494498 ns/op
BenchmarkLoop_30-4  2000602575 ns/op
BenchmarkMemclr_40-42000734921 ns/op
BenchmarkLoop_40-4  2000779482 ns/op
BenchmarkMemclr_50-42000981884 ns/op
BenchmarkLoop_50-4  2000   1008058 ns/op
BenchmarkMemclr_100-4   1000   2073439 ns/op
BenchmarkLoop_100-4 1000   2093744 ns/op
BenchmarkMemclr_200-4300   3932547 ns/op
BenchmarkLoop_200-4  300   4132627 ns/op
PASS
ok  tl34.872s
$ 

Peter


On Wednesday, December 14, 2016 at 9:12:08 AM UTC-5, T L wrote:
>
> I just read this issue thread: https://github.com/golang/go/issues/5373
> and this https://codereview.appspot.com/137880043
> which says:
>
> for i := range a { 
> a[i] = [zero val] 
> }
>
> will be replaced with memclr. 
> I made some benchmarks, but the results are disappointing. 
> When the length of slice/array is very large, memclr is slower.
>
> Result
>
> BenchmarkMemclr_100-4   137.2 ns/op
> BenchmarkLoop_100-4 170.7 ns/op
> BenchmarkMemclr_1000-4  2000   351 ns/op
> BenchmarkLoop_1000-41000   464 ns/op
> BenchmarkMemclr_1-4  100  3623 ns/op
> BenchmarkLoop_1-4100  4940 ns/op
> BenchmarkMemclr_10-4  10 49230 ns/op
> BenchmarkLoop_10-410 58761 ns/op
> BenchmarkMemclr_20-4   5 98165 ns/op
> BenchmarkLoop_20-4 5115833 ns/op
> BenchmarkMemclr_30-4   3170617 ns/op
> BenchmarkLoop_30-4 2190193 ns/op
> BenchmarkMemclr_40-4   2275676 ns/op
> BenchmarkLoop_40-4 2288729 ns/op
> BenchmarkMemclr_50-4   1410280 ns/op
> BenchmarkLoop_50-4 1416195 ns/op
> BenchmarkMemclr_100-4   5000   1025504 ns/op
> BenchmarkLoop_100-4 5000   1012198 ns/op
> BenchmarkMemclr_200-4   2000   2071861 ns/op
> BenchmarkLoop_200-4 2000   2032703 ns/op
>
> test code:
>
> package main
>
> import "testing"
>
> func memclr(a []int) {
> for i := range a {
> a[i] = 0
> }
> }
>
> func memsetLoop(a []int, v int) {
> for i := range a {
> a[i] = v
> }
> }
>
> var i = 0
>
> func BenchmarkMemclr_100(b *testing.B) {
> var a = make([]int, 100)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_100(b *testing.B) {
> var a = make([]int, 100)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_1000(b *testing.B) {
> var a = make([]int, 1000)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_1000(b *testing.B) {
> var a = make([]int, 1000)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_1(b *testing.B) {
> var a = make([]int, 1)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_1(b *testing.B) {
> var a = make([]int, 1)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_10(b *testing.B) {
> var a = make([]int, 10)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_10(b *testing.B) {
> var a = make([]int, 10)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memsetLoop(a, i)
> }
> }
>
> func BenchmarkMemclr_20(b *testing.B) {
> var a = make([]int, 20)
> b.ResetTimer()
> for i := 0; i < b.N; i++ {
> memclr(a)
> }
> }
>
> func BenchmarkLoop_20(b *testing.B) {
> var a = make([]int, 20)
>