Re: [Beowulf] Nehalem and Shanghai code performance for our rzf example

Vincent Diepeveen Fri, 16 Jan 2009 09:48:09 -0800

Note that single threaded performance doesn't say a thing,
because when just 1 core runs, nehalem automatically overclocks 1 core.


A very nasty feature.

My experience is that Shanghai scales 4.0 nearly versus nehalem 3.2,
because of the overclocking of 1 core.

So seeing a 9% higher IPC is not very weird.

Thanks,
Vincent

On Jan 16, 2009, at 3:25 PM, Joe Landman wrote:

Hi folks:
Thought you might like to see this. I rewrote the interior loopfor our Riemann Zeta Function (rzf) example for SSE2, and ran it ona Nehalem and on a Shanghai. This code is compute intensive. Theinner loop which had been written as this (some small handoptimization, loop unrolling, etc):
    l[0]=(double)(inf-1 - 0);
    l[1]=(double)(inf-1 - 1);
    l[2]=(double)(inf-1 - 2);
    l[3]=(double)(inf-1 - 3);
    p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
    for(k=start_index;k>end_index;k-=unroll)
       {
          d_pow[0] = l[0];
          d_pow[1] = l[1];
          d_pow[2] = l[2];
          d_pow[3] = l[3];

          for (m=n;m>1;m--)
           {
             d_pow[0] *=  l[0];
             d_pow[1] *=  l[1];
             d_pow[2] *=  l[2];
             d_pow[3] *=  l[3];
           }
          p_sum[0] += one/d_pow[0];
          p_sum[1] += one/d_pow[1];
          p_sum[2] += one/d_pow[2];
          p_sum[3] += one/d_pow[3];

          l[0]-=four;
          l[1]-=four;
          l[2]-=four;
          l[3]-=four;
      }
    sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;

has been rewritten as
__m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ...VLEN] = 0
    __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
    __m128d __DEC = _mm_set_pd1((double)VLEN);
    __m128d __L   = _mm_load_pd(l);

    for(k=start_index;k>end_index;k-=unroll)
       {
          __D_POW       = __L;

          for (m=n;m>1;m--)
           {
             __D_POW    = _mm_mul_pd(__D_POW, __L);
           }
__P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE,__D_POW));
          __L           = _mm_sub_pd(__L, __DEC);

      }

    _mm_store_pd(p_sum,__P_SUM);

    for(k=0;k<VLEN;k++)
     {
       sum += p_sum[k];
     }
The two codes were run on a Nehalem 3.2 GHz (desktop) processor,and a Shanghai 2.3 GHz desktop processor. Here are the results
        Code            CPU     Freq (GHz)      Wall clock (s)
        ------          ------- -------------   --------------

        base            Nehalem 3.2             20.5            
        optimized       Nehalem 3.2             6.72            
        SSE-ized        Nehalem 3.2             3.37

        base            Shanghai 2.3            30.3
        optimized       Shanghai 2.3            7.36            
        SSE-ized        Shanghai 2.3            3.68
        
These are single thread, single core runs. Code scales very well(is one of our example codes for the HPC/programming/parallelization classes we do).
I found it interesting that they started out with the baseline codeperformance tracking the ratio of clock speeds ... The Nehalem hasa 39% faster clock, and showed 48% faster performance, which isabout 9% more than could be accounted for by clock speed alone.The SSE code performance appears to be about 9% different.
I am sure lots of interesting points can be made out of this (beingonly one test, and not the most typical test/use case either, suchpoints may be of dubious value).
I am working on a Cuda version of the above as well, and will tryto compare this to the threaded versions of the above. I amcurious what we can achieve.
Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [email protected]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Nehalem and Shanghai code performance for our rzf example

Reply via email to