http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50257
--- Comment #6 from Justin SB <justin at fathomdb dot com> 2011-09-01 17:33:32 UTC --- Thanks so much for the help. I created a test case (attached) and tried out the patch (the second version); it is a big improvement. Here are the timings: g++ -g -O3 --std=c++0x test.cc Without patch: user 0m13.201s With patch: user 0m10.237s --- The special casing is obviously a bit of a hack, so I wondered if we could do better by doing a linear scan if the value is less than a threshold (which would have a much nicer memory pattern). It was a small improvement, but nowhere near what the special-casing of the value 10 does: With linear scan for small n: user 0m12.561s I therefore believe the real win is because the special case allows the compiler to optimize away the entire lookup in __prime_list, because the function is inlined and the value of n is known. --- There is a comment in hashtable_policy: // XXX This is a hack. There's no good reason for any of // _Prime_rehash_policy's member functions to be inline. However, I think that (with the patch) this isn't true any more. There's a big performance win by avoiding touching __prime_list, so at least the special case should remain inline. --- Based on your malloc comment, I'm using tcmalloc which is a bit faster than the default allocator, which is probably why it showed up; the patch still knocks the same ~3 seconds off the time: g++ -g -O3 --std=c++0x test.cc -ltcmalloc Without patch, with -ltcmalloc user 0m10.173s With patch, with -ltcmalloc user 0m7.288s --- In terms of the (non cumulative) cpu cycles spent in the lookup vs new/delete: With -ltcmalloc: Without patch: 44% iter() / 48% memory. With patch: 34% iter() / 63% memory With stock malloc: Without patch: 38% iter() / 58% memory With patch: 18% iter() / 79% memory So memory is indeed a big cost, but as I was using tcmalloc I was able to see that the constructor was surprisingly expensive (spending about as much time in the constructor as I was in allocation or deallocation) --- TLDR: The patch works for me, and I think it's a significant win, because it allows the compiler to optimize __prime_list away entirely.