If zeroed memory is required semantically, then it's better to compare against `calloc` not `malloc` in an apples-to-apples sense.
Also, at least on Linux on you need to be careful in benchmarking this sort of thing. As you get to larger allocations the OS can actually give the process many linked copies of a copy-on-write 4096 byte page of all zero bytes. Because of the C-O-W, the initial allocation alone can seem much cheaper than it really is. The work is deferred to being done lazily as memory is written to. This zero page business can also cause confusion with calloc-never-write read-only workloads in benchmarks (say `calloc` and then sum all the ints) because on x86 the cache is physically mapped and that single zero page can be fully L1 resident. Such L1 residence can give a "false" speed advantage compared to more aged/populated memory. [ This latter point is not relevant here, but seemed worth pointing out anyway -- it was how I personally discovered Linux was finally doing this optimization. ]