Hi folks:

Thought you might like to see this. I rewrote the interior loop for our Riemann Zeta Function (rzf) example for SSE2, and ran it on a Nehalem and on a Shanghai. This code is compute intensive. The inner loop which had been written as this (some small hand optimization, loop unrolling, etc):

    l[0]=(double)(inf-1 - 0);
    l[1]=(double)(inf-1 - 1);
    l[2]=(double)(inf-1 - 2);
    l[3]=(double)(inf-1 - 3);
    p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
    for(k=start_index;k>end_index;k-=unroll)
       {
          d_pow[0] = l[0];
          d_pow[1] = l[1];
          d_pow[2] = l[2];
          d_pow[3] = l[3];

          for (m=n;m>1;m--)
           {
             d_pow[0] *=  l[0];
             d_pow[1] *=  l[1];
             d_pow[2] *=  l[2];
             d_pow[3] *=  l[3];
           }
          p_sum[0] += one/d_pow[0];
          p_sum[1] += one/d_pow[1];
          p_sum[2] += one/d_pow[2];
          p_sum[3] += one/d_pow[3];

          l[0]-=four;
          l[1]-=four;
          l[2]-=four;
          l[3]-=four;
      }
    sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;

has been rewritten as

    __m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ... VLEN] = 0
    __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
    __m128d __DEC = _mm_set_pd1((double)VLEN);
    __m128d __L   = _mm_load_pd(l);

    for(k=start_index;k>end_index;k-=unroll)
       {
          __D_POW       = __L;

          for (m=n;m>1;m--)
           {
             __D_POW    = _mm_mul_pd(__D_POW, __L);
           }

          __P_SUM       = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));

          __L           = _mm_sub_pd(__L, __DEC);

      }

    _mm_store_pd(p_sum,__P_SUM);

    for(k=0;k<VLEN;k++)
     {
       sum += p_sum[k];
     }

The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and a Shanghai 2.3 GHz desktop processor. Here are the results

        Code            CPU     Freq (GHz)      Wall clock (s)
        ------          ------- -------------   --------------

        base            Nehalem 3.2             20.5            
        optimized       Nehalem 3.2             6.72            
        SSE-ized        Nehalem 3.2             3.37

        base            Shanghai 2.3            30.3
        optimized       Shanghai 2.3            7.36            
        SSE-ized        Shanghai 2.3            3.68
        
These are single thread, single core runs. Code scales very well (is one of our example codes for the HPC/programming/parallelization classes we do).

I found it interesting that they started out with the baseline code performance tracking the ratio of clock speeds ... The Nehalem has a 39% faster clock, and showed 48% faster performance, which is about 9% more than could be accounted for by clock speed alone. The SSE code performance appears to be about 9% different.

I am sure lots of interesting points can be made out of this (being only one test, and not the most typical test/use case either, such points may be of dubious value).

I am working on a Cuda version of the above as well, and will try to compare this to the threaded versions of the above. I am curious what we can achieve.

Joe

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [email protected]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to