Re: Creating a new seq is not that fast
If zeroed memory is required semantically, then it's better to compare against `calloc` not `malloc` in an apples-to-apples sense. Also, at least on Linux on you need to be careful in benchmarking this sort of thing. As you get to larger allocations the OS can actually give the process many linked copies of a copy-on-write 4096 byte page of all zero bytes. Because of the C-O-W, the initial allocation alone can seem much cheaper than it really is. The work is deferred to being done lazily as memory is written to. This zero page business can also cause confusion with calloc-never-write read-only workloads in benchmarks (say `calloc` and then sum all the ints) because on x86 the cache is physically mapped and that single zero page can be fully L1 resident. Such L1 residence can give a "false" speed advantage compared to more aged/populated memory. [ This latter point is not relevant here, but seemed worth pointing out anyway -- it was how I personally discovered Linux was finally doing this optimization. ]
Re: Creating a new seq is not that fast
For me, a `newSeq[int]()` takes about 23 nanoseconds and a `newSeqOfCap[int](128)` takes about 90 nanoseconds (2.5 GHz Intel Core i7), including the time for garbage collection, averaged over 100,000,000 allocations. This sounds about right. You have some unavoidable costs doing the allocation proper, and then for `newSeqOfCap[int](128)`, you have to also initialize about 1K of additional memory.
Re: Creating a new seq is not that fast
On my machine I find that s = newSeqOfCap[int](128) is about 10 times slower than s = newSeq[int]()
Re: Creating a new seq is not that fast
Probably. I wouldn't be surprised if part of the reason Nim's allocator is slightly slower is due to zeroing memory too.
Re: Creating a new seq is not that fast
> Neither comes close to stack allocation, but again, that's expected. @Varriount, I added a benchmark with [alloca](https://github.com/bpr/nim-vla/blob/master/vla.nim) and using your compiler args it was consistently slightly faster than stack allocation via array for those sizes. Maybe because alloca doesn't zero the array on initialization and array does?
Re: Creating a new seq is not that fast
Unfortunately, creating good benchmarks is hard. The benchmark above has some subtle faults that lessen its effectiveness. First off, MyBuffer is a tuple type: type MyBuffer = tuple d: array[128, int] len: int Tuple types are object types, which means they can be allocated on the stack. The only time object types are not allocated on the stack is when the object type is part of a reference type. Because MyBuffer is a tuple, all tx() has to do is allocate ~129 integers worth of memory from the stack, which is a simple bump allocation (all the program has to do is move the current stack pointer up by X). A better way to test Nim's allocator is to compare it with the system malloc implementation. Below is my version of the above benchmark. I've not tested it for Windows users - the worst that might happen is that you get a rather large file called 'nul' full of numbers. import times, random proc malloc(size: uint): pointer {.header: "", importc: "malloc".} proc free(p: pointer) {.header: "", importc: "free".} const bufferSize = 128 type MyBuffer = array[bufferSize, int] proc testStackAllocation(): int = var s: MyBuffer result = cast[int](addr s) proc testSequenceAllocation(): int = var s = newSeqOfCap[int](bufferSize) result = cast[int](addr s) proc testMallocAllocation(): int = var res = malloc(uint(sizeof(MyBuffer))) free(res) result = cast[int](res) proc testRandom(): int = result = random(7) proc main = # Use writing to /dev/null to prevent compiler optimizations when defined(posix): let nullfh = open("/dev/null", fmReadWrite) else: let nullfh = open("nul") # untested! var baseline: float # Establish a baseline time of a really simple operation + writing to stdout # This way we can esentially measure how fast stdout can be written to, and # factor that out of other measurements. let z = cpuTime() for i in 0..10_000_000: nullfh.write(i) baseline = cpuTime() - z echo "Baseline time:", baseline # Template to run test procedures. template runProc(testProc: typed, testName: string): untyped = let t = cpuTime() for _ in 0..10_000_000: var i = testProc() nullfh.write(i) echo "Time for ", testName, ": ", (cpuTime() - t) - baseline # Test: # - Stack allocation (which is usually a bump allocator) # - Malloc allocation (system defined) # - Nim allocation # - Random number generation runProc(testStackAllocation, "stack allocation test") runProc(testSequenceAllocation, "sequence allocation test") runProc(testMallocAllocation, "malloc allocation test") runProc(testRandom, "random number generation test") main() Output using various GC backends (Mac OS Sierra, 2.5 GHz Intel Core i7) : # nim c -d:release --passC:"-flto" --passL:"-flto" --gc:markAndSweep benchmark.nim && ./benchmark Baseline time: 1.072926 Time for stack allocation test: 0.16431101 Time for sequence allocation test: 1.142116 Time for malloc allocation test: 0.81331399 Time for random number generation test: -0.14664202 # nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim && ./benchmark Baseline time: 1.053752 Time for stack allocation test: 0.14567701 Time for sequence allocation test: 1.227938 Time for malloc allocation test: 0.82476394 Time for random number generation test: -0.10941602 # Version of the benchmark with mark and sweep cycle collection disabled # nim c -d:release --passC:"-flto" --passL:"-flto" benchmark.nim && ./benchmark Baseline time: 1.075432 Time for stack allocation test: 0.146002 Time for sequence allocation test: 1.189319 Time for malloc allocation test: 0.76232396 Time for random number generation test: -0.17652802 As you can see, Nim's allocator is slower than the system malloc allocator, but not by much (I chalk some of this up to the compiler being able to use better inlining and intrinsics for malloc). Neither comes close to stack allocation, but again, that's expected.
Re: Creating a new seq is not that fast
This is a problem for me too. My solution is to re-use seqs as often as possible. I have this (in my Ubuntu14 VM): 10 * test: 2.598e-05 10 * tx: 2.049e-06 10 * t0: 2.965e-06 But in your benchmark, you need to use `i` so the compiler cannot elide any code. When I accumulate i: 10 * test: 4.787e-05 10 * tx: 1.832e-06 10 * t0: 2.265e-06 i=21
Creating a new seq is not that fast
Of course not, there is much allocation work involved. But about 500 ns (a few thousand CPU cycles) is indeed more as my expectations. I discovered that when tuning my chess engine -- using one more tiny seq noticeable reduced performance. Of course seq's are convenient, because of existing add() operation. A plain array allocated on the stack is faster, fixed size is OK, because I have to store only the moves or captures for current position, but I would have to track current add position. Chess is recursive, so I can not use one single global seq. My current idea is to use a global seq as as first buffer for collecting captures, and only swap() that buffer with a new allocated seq when it turns out that there are possible captures at all available. (I guess swap will not copy data, but only pointers, so it may be fast) Or I may use my own defined container which is allocated on the stack. For that case I would have to define my own add() and sort() operations. I don't think that I will get trouble with stack size... import times, random type MyBuffer = tuple d: array[128, int] len: int proc test(): int = var s = newSeqOfCap[int](128) s.len proc tx(): int = var s: MyBuffer s.len proc t0(): int = random(7) proc main = var t: float # cpuTime() var i: int t = cpuTime() i = test() i = test() i = test() i = test() i = test() i = test() i = test() i = test() i = test() i = test() echo "10 * test: ", cpuTime() - t t = cpuTime() i = tx() i = tx() i = tx() i = tx() i = tx() i = tx() i = tx() i = tx() i = tx() i = tx() echo "10 * tx: ", cpuTime() - t t = cpuTime() i = t0() i = t0() i = t0() i = t0() i = t0() i = t0() i = t0() i = t0() i = t0() i = t0() echo "10 * t0: ", cpuTime() - t main() # nim c -d:release t.nim # 10 * test: 5.013e-06 # 10 * tx: 0.0 # 10 * t0: 0.0