On Friday 18 March 2005 21:28, Andi Kleen wrote: > On Fri, Mar 18, 2005 at 07:00:06AM -0800, Christoph Lameter wrote: > > On Fri, 18 Mar 2005, Denis Vlasenko wrote: > > > > > NT stores are not about 5% increase. 200%-300%. Provided you are ok with > > > the fact that zeroed page ends up evicted from cache. Luckily, this is > > > exactly > > > what you want with prezeroing. > > > > These are pretty significant results. Maybe its best to use non-temporal > > The differences are actually less. I do not know what Denis benchmarked, > but in my tests the difference was never more than ~10%. He got a zero > too much?
No. See attached. # gcc -O2 0main.c # ./a.out Page clear/copy benchmark program. buffer size: 1 Mb Each test tried 64 times, max and min CPU cycles per page are reported. Please disregard max values. They are due to system interference only. clear_page() tests: normal_clear_page - took 44214 max,12615 min cycles per page normal_clear_page - took 18969 max,12649 min cycles per page repstosl_clear_page - took 19897 max,12655 min cycles per page movq_clear_page - took 39391 max,10782 min cycles per page movntq_clear_page - took 21612 max, 4779 min cycles per page copy_page() tests: .... I'm basically saying that 'microbenchmark-visible' performance of NT stores is 200-300% higher than 'normal' stores. BTW: cache eviction is not an intrisic property of non-temporal stores. It's merely how they're implemented in current CPUs: if NT stores hit cached line, invalidate it and push stores to bus. Else just push stores to bus without reading cacheline from RAM first. It is possible that some future CPU won't evict cacheline if NT stores happened to hit it: "if NT stores hit cached line, MODIFY it and push stores to bus". -- vda
page_asm.tar.bz2
Description: application/tbz