If zeroed memory is required semantically, then it's better to compare against 
`calloc` not `malloc` in an apples-to-apples sense.

Also, at least on Linux on you need to be careful in benchmarking this sort of 
thing. As you get to larger allocations the OS can actually give the process 
many linked copies of a copy-on-write 4096 byte page of all zero bytes. Because 
of the C-O-W, the initial allocation alone can seem much cheaper than it really 
is. The work is deferred to being done lazily as memory is written to.

This zero page business can also cause confusion with calloc-never-write 
read-only workloads in benchmarks (say `calloc` and then sum all the ints) 
because on x86 the cache is physically mapped and that single zero page can be 
fully L1 resident. Such L1 residence can give a "false" speed advantage 
compared to more aged/populated memory. [ This latter point is not relevant 
here, but seemed worth pointing out anyway -- it was how I personally 
discovered Linux was finally doing this optimization. ]

Reply via email to