On Sat, Sep 29, 2012 at 04:37:37PM +0200, Andrea Arcangeli wrote:
> But I agree we need to verify it before taking a decision, and that
> the numbers are better than theory, or to rephrase it "let's check the
> theory is right" :)

Okay, microbenchmark:

% cat test_memcmp.c 
#include <assert.h>
#include <stdlib.h>
#include <string.h>

#define MB (1024ul * 1024ul)
#define GB (1024ul * MB)

int main(int argc, char **argv)
{
        char *p;
        int i;

        posix_memalign((void **)&p, 2 * MB, 8 * GB);
        for (i = 0; i < 100; i++) {
                assert(memcmp(p, p + 4*GB, 4*GB) == 0);
                asm volatile ("": : :"memory");
        }
        return 0;
}

huge zero page (initial implementation):

 Performance counter stats for './test_memcmp' (5 runs):

      32356.272845 task-clock                #    0.998 CPUs utilized           
 ( +-  0.13% )
                40 context-switches          #    0.001 K/sec                   
 ( +-  0.94% )
                 0 CPU-migrations            #    0.000 K/sec                  
             4,218 page-faults               #    0.130 K/sec                   
 ( +-  0.00% )
    76,712,481,765 cycles                    #    2.371 GHz                     
 ( +-  0.13% ) [83.31%]
    36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle    
 ( +-  0.28% ) [83.35%]
     1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle    
 ( +-  2.96% ) [66.67%]
   134,355,715,816 instructions              #    1.75  insns per cycle        
                                             #    0.27  stalled cycles per insn 
 ( +-  0.10% ) [83.35%]
    13,526,169,702 branches                  #  418.039 M/sec                   
 ( +-  0.10% ) [83.31%]
         1,058,230 branch-misses             #    0.01% of all branches         
 ( +-  0.91% ) [83.36%]

      32.413866442 seconds time elapsed                                         
 ( +-  0.13% )

virtual huge zero page (the second implementation):

 Performance counter stats for './test_memcmp' (5 runs):

      30327.183829 task-clock                #    0.998 CPUs utilized           
 ( +-  0.13% )
                38 context-switches          #    0.001 K/sec                   
 ( +-  1.53% )
                 0 CPU-migrations            #    0.000 K/sec                  
             4,218 page-faults               #    0.139 K/sec                   
 ( +-  0.01% )
    71,964,773,660 cycles                    #    2.373 GHz                     
 ( +-  0.13% ) [83.35%]
    31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle    
 ( +-  0.40% ) [83.32%]
       773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle    
 ( +-  6.61% ) [66.67%]
   134,982,215,437 instructions              #    1.88  insns per cycle        
                                             #    0.23  stalled cycles per insn 
 ( +-  0.11% ) [83.32%]
    13,509,150,683 branches                  #  445.447 M/sec                   
 ( +-  0.11% ) [83.34%]
         1,017,667 branch-misses             #    0.01% of all branches         
 ( +-  1.07% ) [83.32%]

      30.381324695 seconds time elapsed                                         
 ( +-  0.13% )

On Westmere-EX virtual huge zero page is ~6.7% faster.

-- 
 Kirill A. Shutemov

Attachment: signature.asc
Description: Digital signature

Reply via email to