On Friday 18 March 2005 21:28, Andi Kleen wrote:
> On Fri, Mar 18, 2005 at 07:00:06AM -0800, Christoph Lameter wrote:
> > On Fri, 18 Mar 2005, Denis Vlasenko wrote:
> > 
> > > NT stores are not about 5% increase. 200%-300%. Provided you are ok with
> > > the fact that zeroed page ends up evicted from cache. Luckily, this is 
> > > exactly
> > > what you want with prezeroing.
> > 
> > These are pretty significant results. Maybe its best to use non-temporal
> 
> The differences are actually less. I do not know what Denis benchmarked,
> but in my tests the difference was never more than ~10%.  He got a zero
> too much? 

No. See attached.

# gcc -O2 0main.c
# ./a.out
Page clear/copy benchmark program.
buffer size: 1 Mb
Each test tried 64 times, max and min CPU cycles per page are reported.
Please disregard max values. They are due to system interference only.
clear_page() tests:
               normal_clear_page - took 44214 max,12615 min cycles per page
               normal_clear_page - took 18969 max,12649 min cycles per page
             repstosl_clear_page - took 19897 max,12655 min cycles per page
                 movq_clear_page - took 39391 max,10782 min cycles per page
               movntq_clear_page - took 21612 max, 4779 min cycles per page

copy_page() tests:
....

I'm basically saying that 'microbenchmark-visible'
performance of NT stores is 200-300% higher than 'normal' stores.

BTW: cache eviction is not an intrisic property of non-temporal
stores. It's merely how they're implemented in current CPUs:
if NT stores hit cached line, invalidate it and
push stores to bus. Else just push stores to bus
without reading cacheline from RAM first.

It is possible that some future CPU won't evict cacheline
if NT stores happened to hit it: "if NT stores hit cached line,
MODIFY it and push stores to bus".
--
vda

Attachment: page_asm.tar.bz2
Description: application/tbz

Reply via email to