Re: A microbenchmarking library

2018-10-02 Thread lemonboy
Thanks to @dm1try latest PR it now works fine on OSX!


Re: A microbenchmarking library

2018-10-02 Thread mratsim
I've left it running for longer, definitely stuck. Scale factor is 1.0


Re: A microbenchmarking library

2018-10-02 Thread lemonboy
2 minutes top on a very busy machine. I guess there's something wrong in the 
`getMonotonicTime` implementation for OSX then... What is the computed 
`scaleFactor` in `timer.nim` ?


Re: A microbenchmarking library

2018-10-02 Thread mratsim
How long is it supposed to run? It's also freezing for me on MacOS, killed it 
after a minute.


Re: A microbenchmarking library

2018-10-02 Thread lemonboy
Good news, criterion is now available on nimble!

I also need your help, if you're on Windows or OSX can you please check if the 
`tfib.nim` example runs fine for you? I tried setting up Travis to run the test 
suite on OSX too but got a timeout after 10m so either the timing code is wrong 
for that OS or the machine is very slow. Thank you in advance!


Re: A microbenchmarking library

2018-08-26 Thread lemonboy
Oh I see what you mean. As usual one must be smart enough to set the CPU 
governor to 'performance' and pin the benchmarking thread to a single core to 
prevent measuring errors due to SMP migration.

Benchmarking is twice as hard as writing code :)


Re: A microbenchmarking library

2018-08-26 Thread Stefan_Salewski
Indeed, gcc 8.2 generates perfectly optimized code using cmov instruction for 
"GOTO label", as [https://www.godbolt.org](https://www.godbolt.org)/ proves 
with compiler option -O3:


int w1(int64_t a, int64_t b) {
int r=0;
if (a == b)
  r = 10;
else
  r = 17;
return r;
}

int w2(int64_t a, int64_t b) {
int r=10;
if (a == b)
  goto LW2;
r = 17;
LW2:
return r;
}

w1(long, long):
  cmp rdi, rsi
  mov edx, 17
  mov eax, 10
  cmovne eax, edx
  ret
w2(long, long):
  cmp rdi, rsi
  mov edx, 10
  mov eax, 17
  cmove eax, edx
  ret


Run


Re: A microbenchmarking library

2018-08-25 Thread mratsim
I don't see an issue with the gotos, those are direct jumps so there is no cost 
there.

The real question is the if branching vs cmov.

I do not now the impact of Meltdown, Spectre and L1TF mitigation on Haswell 
branch predictors though as one of the main goal was fixing speculative 
execution and much of Haswell performance came from its very impressive 
predictors.

Here is a benchmark suite for branch vs branchless: 
[https://github.com/xiadz/cmov](https://github.com/xiadz/cmov).

On past architecture (circa 2010) if/else was faster than cmov for predictable 
branches (like testing if nil), seems like it changed for Skylake (security 
patch might change that again).

Also ARM, MIPS arch, and even AMD processors might have a completely different 
behaviour.

Finally cmov only make sense for assignment like this:

let foo = if bool_a: 10 else: 20


Re: A microbenchmarking library

2018-08-25 Thread Stefan_Salewski
Yes, as I already said, 100 cycles is fine for me.

mratsim, what do you generally think about all the GOTOs? I had the feeling it 
makes it for C compiler a bit hard to 100 % optimize the code, for example it 
may be difficult to apply cmov instructions to get branchless code. I think it 
will make no difference in practice, and of course no one intents to change the 
code generator, but I still wonder if my assumption is fully wrong.


Re: A microbenchmarking library

2018-08-25 Thread mratsim
100 cycles is the cost of a cache miss so allocating then comparing 2 strings 
for 100 cycles seems pretty reasonable.


Re: A microbenchmarking library

2018-08-25 Thread Stefan_Salewski
For your example, my output is


# $ nim c t.nim
$ ./t
Benchmark: fib5()
Collected 241 samples
Warning: Found 6 mild and 15 extreme outliers in the time measurements
Warning: Found 8 mild and 10 extreme outliers in the cycles measurements
Time
  Mean:  251.7599ns (249.2240ns .. 255.4843ns)
  Std:   24.0680ns (7.8357ns .. 37.5615ns)
  Slope: 249.0758ns (248.7008ns .. 249.5158ns)
  r^2:   1. (1. .. 1.)
Cycles
  Mean:  647cycles (642cycles .. 654cycles)
  Std:   46cycles (15cycles .. 72cycles)
  Slope: 645cycles (644cycles .. 646cycles)
  r^2:   1. (1. .. 1.)


Run

Watch for relation of cycles to ns, it is in the range 1..10 as expected. For 
your github page it seems to be approx 1000? Wrong scaling for me, or do i miss 
something?


Re: A microbenchmarking library

2018-08-25 Thread lemonboy
> The example output on your github page is still wrong, the cycles count is 
> much too high.

Is it? I've probably benchmarked it without the release switch.

As usual beware of the optimizer, make sure the benchmark function doesn't 
elide the comparison completely. Introducing an argument using measureArg is a 
nice way to prevent this class of problems.


Re: A microbenchmarking library

2018-08-25 Thread Stefan_Salewski
OK, here is the assembler listing...

# nim c -d:release t.nim

# gcc.options.speed = "-save-temps -march=native -O3 -fno-strict-aliasing"


$ cat t.nim
proc t(a, b: string) : bool =
  a == b

proc main =
  var a = "Rust"
  var b = "Nim"
  echo t(a, b)

main()



Run


t_IxGYsz1VoA2HIiGBY5mgGw:
.LFB20:
.cfi_startproc
movl$1, %eax
cmpq%rsi, %rdi
je  .L32
testq   %rdi, %rdi
je  .L35
movq(%rdi), %rdx
testq   %rsi, %rsi
je  .L36
xorl%eax, %eax
cmpq(%rsi), %rdx
je  .L37
.L32:
ret
.p2align 4,,10
.p2align 3
.L35:
testq   %rsi, %rsi
je  .L32
cmpq$0, (%rsi)
sete%al
ret
.p2align 4,,10
.p2align 3
.L37:
testq   %rdx, %rdx
je  .L28
subq$8, %rsp
.cfi_def_cfa_offset 16
addq$16, %rsi
addq$16, %rdi
callmemcmp@PLT
testl   %eax, %eax
sete%al
addq$8, %rsp
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L36:
testq   %rdx, %rdx
sete%al
ret
.L28:
.L21:
.L24:
movl$1, %eax
ret
.cfi_endproc


Run


Re: A microbenchmarking library

2018-08-25 Thread Araq
> C code looks like

And what does the produced assembler code look like?


Re: A microbenchmarking library

2018-08-25 Thread Stefan_Salewski
Nice. The example output on your github page is still wrong, cycles count is 
much too high.

Min cycle count seems to be 4 for most simple procs, but that is no problem. I 
have just tested a plain string comparison -- about 100 cycles, as expected 
from its C code. So string comparison is not a very cheap operation in Nim -- 
not too surprising, as special nil case has to be considered.


import criterion

var cfg = newDefaultConfig()

benchmark cfg:
  
  proc t0() {.measure.} =
var a = "Rust"
var b = "Nim"
doAssert a != b
  
  proc t1() {.measure.} =
var a = "Rust"
var b = "Nim"
doAssert a > b


Run


$ ./t
Benchmark: t0()
Collected 277 samples
Warning: Found 4 mild and 8 extreme outliers in the time measurements
Warning: Found 4 mild and 6 extreme outliers in the cycles measurements
Time
  Mean:  42.4753ns (41.2772ns .. 43.8444ns)
  Std:   11.6368ns (7.6329ns .. 16.4890ns)
  Slope: 42.2136ns (41.7204ns .. 42.9674ns)
  r^2:   0.9977 (0.9960 .. 0.9988)
Cycles
  Mean:  108cycles (105cycles .. 111cycles)
  Std:   25cycles (18cycles .. 33cycles)
  Slope: 109cycles (108cycles .. 111cycles)
  r^2:   0.9977 (0.9958 .. 0.9988)

Benchmark: t1()
Collected 275 samples
Warning: Found 21 mild and 5 extreme outliers in the time measurements
Warning: Found 2 mild and 3 extreme outliers in the cycles measurements
Time
  Mean:  42.1040ns (41.2619ns .. 42.9831ns)
  Std:   7.4068ns (5.1946ns .. 10.0585ns)
  Slope: 48.4274ns (46.7242ns .. 49.9679ns)
  r^2:   0.9934 (0.9891 .. 0.9968)
Cycles
  Mean:  107cycles (105cycles .. 109cycles)
  Std:   15cycles (13cycles .. 17cycles)
  Slope: 125cycles (121cycles .. 129cycles)
  r^2:   0.9934 (0.9886 .. 0.9966)



Run

C code looks like


static N_INLINE(NIM_BOOL, eqStrings)(NimStringDesc* a, NimStringDesc* b) {
NIM_BOOL result;
NI alen;
NI blen;
{   result = (NIM_BOOL)0;
{
if (!(a == b)) goto LA3_;
result = NIM_TRUE;
goto BeforeRet_;
}
LA3_: ;
{
if (!(a == NIM_NIL)) goto LA7_;
alen = ((NI) 0);
}
goto LA5_;
LA7_: ;
{
alen = (*a).Sup.len;
}
LA5_: ;
{
if (!(b == NIM_NIL)) goto LA12_;
blen = ((NI) 0);
}
goto LA10_;
LA12_: ;
{
blen = (*b).Sup.len;
}
LA10_: ;
{
if (!(alen == blen)) goto LA17_;
{
if (!(alen == ((NI) 0))) goto LA21_;
result = NIM_TRUE;
goto BeforeRet_;
}
LA21_: ;
result = equalMem_fmeFeLBvgmAHG9bC8ETS9bYQt(((void*) 
((*a).data)), ((void*) ((*b).data)), ((NI) (alen)));
goto BeforeRet_;
}
LA17_: ;
}BeforeRet_: ;
return result;
}



Run


Re: A microbenchmarking library

2018-08-20 Thread lemonboy
That's now fixed! Thanks for the heads up!


Re: A microbenchmarking library

2018-08-20 Thread dataman
Cool! Where is dip module? :) (in statistics.nim)


A microbenchmarking library

2018-08-20 Thread lemonboy
Dear Nimmers,

[Here's](https://github.com/LemonBoy/criterion.nim) a nice little library for 
all your microbenchmarking needs. Benchmarking is hard and the aim of this 
library is to abstract away most of the complexity and pitfalls and provide the 
user (hopefully) meaningful results. If you ever used Haskell's 
[criterion](http://www.serpentine.com/criterion) package then you may already 
be familiar with how the library works.

I've been chipping away at the API in order to make the library as ergonomic as 
possible and now I think it's right about time to ask for some feedback from 
some third party users.

* * *

Have fun and keep on hacking,

LemonBoy