Re: Creating a new seq is not that fast

2017-04-18 Thread cblake
If zeroed memory is required semantically, then it's better to compare against 
`calloc` not `malloc` in an apples-to-apples sense.

Also, at least on Linux on you need to be careful in benchmarking this sort of 
thing. As you get to larger allocations the OS can actually give the process 
many linked copies of a copy-on-write 4096 byte page of all zero bytes. Because 
of the C-O-W, the initial allocation alone can seem much cheaper than it really 
is. The work is deferred to being done lazily as memory is written to.

This zero page business can also cause confusion with calloc-never-write 
read-only workloads in benchmarks (say `calloc` and then sum all the ints) 
because on x86 the cache is physically mapped and that single zero page can be 
fully L1 resident. Such L1 residence can give a "false" speed advantage 
compared to more aged/populated memory. [ This latter point is not relevant 
here, but seemed worth pointing out anyway -- it was how I personally 
discovered Linux was finally doing this optimization. ]


Re: Creating a new seq is not that fast

2017-04-18 Thread Jehan
For me, a `newSeq[int]()` takes about 23 nanoseconds and a 
`newSeqOfCap[int](128)` takes about 90 nanoseconds (2.5 GHz Intel Core i7), 
including the time for garbage collection, averaged over 100,000,000 
allocations.

This sounds about right. You have some unavoidable costs doing the allocation 
proper, and then for `newSeqOfCap[int](128)`, you have to also initialize about 
1K of additional memory.


Re: Creating a new seq is not that fast

2017-04-18 Thread qqtop
On my machine I find that



s = newSeqOfCap[int](128)


is about 10 times slower than



s = newSeq[int]()
 


Re: Creating a new seq is not that fast

2017-04-18 Thread Varriount
Probably. I wouldn't be surprised if part of the reason Nim's allocator is 
slightly slower is due to zeroing memory too.


Re: Creating a new seq is not that fast

2017-04-18 Thread bpr
> Neither comes close to stack allocation, but again, that's expected.

@Varriount, I added a benchmark with 
[alloca](https://github.com/bpr/nim-vla/blob/master/vla.nim) and using your 
compiler args it was consistently slightly faster than stack allocation via 
array for those sizes. Maybe because alloca doesn't zero the array on 
initialization and array does?


Re: Creating a new seq is not that fast

2017-04-17 Thread Varriount
Unfortunately, creating good benchmarks is hard. The benchmark above has some 
subtle faults that lessen its effectiveness.

First off, MyBuffer is a tuple type: 


type
  MyBuffer = tuple
d: array[128, int]
len: int


Tuple types are object types, which means they can be allocated on the stack. 
The only time object types are not allocated on the stack is when the object 
type is part of a reference type.

Because MyBuffer is a tuple, all tx() has to do is allocate ~129 integers worth 
of memory from the stack, which is a simple bump allocation (all the program 
has to do is move the current stack pointer up by X). A better way to test 
Nim's allocator is to compare it with the system malloc implementation.

Below is my version of the above benchmark. I've not tested it for Windows 
users - the worst that might happen is that you get a rather large file called 
'nul' full of numbers.


import times, random

proc malloc(size: uint): pointer {.header: "", importc: "malloc".}
proc free(p: pointer) {.header: "", importc: "free".}

const bufferSize = 128

type
  MyBuffer = array[bufferSize, int]

proc testStackAllocation(): int =
  var s: MyBuffer
  result = cast[int](addr s)

proc testSequenceAllocation(): int =
  var s = newSeqOfCap[int](bufferSize)
  result = cast[int](addr s)

proc testMallocAllocation(): int =
  var res = malloc(uint(sizeof(MyBuffer)))
  free(res)
  result = cast[int](res)

proc testRandom(): int =
  result = random(7)


proc main =
  # Use writing to /dev/null to prevent compiler optimizations
  when defined(posix):
let nullfh = open("/dev/null", fmReadWrite)
  else:
let nullfh = open("nul") # untested!
  
  var baseline: float
  
  # Establish a baseline time of a really simple operation + writing to 
stdout
  # This way we can esentially measure how fast stdout can be written to, 
and
  # factor that out of other measurements.
  let z = cpuTime()
  for i in 0..10_000_000:
nullfh.write(i)
  baseline = cpuTime() - z
  echo "Baseline time:", baseline
  
  # Template to run test procedures.
  template runProc(testProc: typed, testName: string): untyped =
let t = cpuTime()
for _ in 0..10_000_000:
  var i = testProc()
  nullfh.write(i)
echo "Time for ", testName, ": ", (cpuTime() - t) - baseline
  
  # Test:
  #  - Stack allocation (which is usually a bump allocator)
  #  - Malloc allocation (system defined)
  #  - Nim allocation
  #  - Random number generation
  runProc(testStackAllocation, "stack allocation test")
  runProc(testSequenceAllocation, "sequence allocation test")
  runProc(testMallocAllocation, "malloc allocation test")
  runProc(testRandom, "random number generation test")

main()


Output using various GC backends (Mac OS Sierra, 2.5 GHz Intel Core i7) : 


# nim c -d:release --passC:"-flto" --passL:"-flto" --gc:markAndSweep 
benchmark.nim && ./benchmark
Baseline time: 1.072926
Time for stack allocation test: 0.16431101
Time for sequence allocation test: 1.142116
Time for malloc allocation test: 0.81331399
Time for random number generation test: -0.14664202

# nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim && 
./benchmark
Baseline time: 1.053752
Time for stack allocation test: 0.14567701
Time for sequence allocation test: 1.227938
Time for malloc allocation test: 0.82476394
Time for random number generation test: -0.10941602

# Version of the benchmark with mark and sweep cycle collection disabled
# nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim && 
./benchmark
Baseline time: 1.075432
Time for stack allocation test: 0.146002
Time for sequence allocation test: 1.189319
Time for malloc allocation test: 0.76232396
Time for random number generation test: -0.17652802


As you can see, Nim's allocator is slower than the system malloc allocator, but 
not by much (I chalk some of this up to the compiler being able to use better 
inlining and intrinsics for malloc). Neither comes close to stack allocation, 
but again, that's expected.


Re: Creating a new seq is not that fast

2017-04-17 Thread cdunn2001
This is a problem for me too. My solution is to re-use seqs as often as 
possible.

I have this (in my Ubuntu14 VM): 


10 * test: 2.598e-05
10 * tx: 2.049e-06
10 * t0: 2.965e-06


But in your benchmark, you need to use `i` so the compiler cannot elide any 
code. When I accumulate i: 


10 * test: 4.787e-05
10 * tx: 1.832e-06
10 * t0: 2.265e-06
i=21



Creating a new seq is not that fast

2017-04-17 Thread Stefan_Salewski
Of course not, there is much allocation work involved. But about 500 ns (a few 
thousand CPU cycles) is indeed more as my expectations. I discovered that when 
tuning my chess engine -- using one more tiny seq noticeable reduced 
performance. Of course seq's are convenient, because of existing add() 
operation. A plain array allocated on the stack is faster, fixed size is OK, 
because I have to store only the moves or captures for current position, but I 
would have to track current add position. Chess is recursive, so I can not use 
one single global seq. My current idea is to use a global seq as as first 
buffer for collecting captures, and only swap() that buffer with a new 
allocated seq when it turns out that there are possible captures at all 
available. (I guess swap will not copy data, but only pointers, so it may be 
fast) Or I may use my own defined container which is allocated on the stack. 
For that case I would have to define my own add() and sort() operations. I 
don't think that I will get trouble with stack size...


import times, random

type
  MyBuffer = tuple
d: array[128, int]
len: int

proc test(): int =
  var s = newSeqOfCap[int](128)
  s.len

proc tx(): int =
  var s: MyBuffer
  s.len

proc t0(): int =
  random(7)

proc main =
  var t: float # cpuTime()
  var i: int
  
  t = cpuTime()
  i = test()
  i = test()
  i = test()
  i = test()
  i = test()
  i = test()
  i = test()
  i = test()
  i = test()
  i = test()
  echo "10 * test: ", cpuTime() - t
  
  t = cpuTime()
  i = tx()
  i = tx()
  i = tx()
  i = tx()
  i = tx()
  i = tx()
  i = tx()
  i = tx()
  i = tx()
  i = tx()
  echo "10 * tx: ", cpuTime() - t
  
  t = cpuTime()
  i = t0()
  i = t0()
  i = t0()
  i = t0()
  i = t0()
  i = t0()
  i = t0()
  i = t0()
  i = t0()
  i = t0()
  echo "10 * t0: ", cpuTime() - t

main()

# nim c -d:release t.nim
# 10 * test: 5.013e-06
# 10 * tx: 0.0
# 10 * t0: 0.0