Note that single threaded performance doesn't say a thing,
because when just 1 core runs, nehalem automatically overclocks 1 core.
A very nasty feature.
My experience is that Shanghai scales 4.0 nearly versus nehalem 3.2,
because of the overclocking of 1 core.
So seeing a 9% higher IPC is not very weird.
Thanks,
Vincent
On Jan 16, 2009, at 3:25 PM, Joe Landman wrote:
Hi folks:
Thought you might like to see this. I rewrote the interior loop
for our Riemann Zeta Function (rzf) example for SSE2, and ran it on
a Nehalem and on a Shanghai. This code is compute intensive. The
inner loop which had been written as this (some small hand
optimization, loop unrolling, etc):
l[0]=(double)(inf-1 - 0);
l[1]=(double)(inf-1 - 1);
l[2]=(double)(inf-1 - 2);
l[3]=(double)(inf-1 - 3);
p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
for(k=start_index;k>end_index;k-=unroll)
{
d_pow[0] = l[0];
d_pow[1] = l[1];
d_pow[2] = l[2];
d_pow[3] = l[3];
for (m=n;m>1;m--)
{
d_pow[0] *= l[0];
d_pow[1] *= l[1];
d_pow[2] *= l[2];
d_pow[3] *= l[3];
}
p_sum[0] += one/d_pow[0];
p_sum[1] += one/d_pow[1];
p_sum[2] += one/d_pow[2];
p_sum[3] += one/d_pow[3];
l[0]-=four;
l[1]-=four;
l[2]-=four;
l[3]-=four;
}
sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
has been rewritten as
__m128d __P_SUM = _mm_set_pd1(0.0); // __P_SUM[0 ...
VLEN] = 0
__m128d __ONE = _mm_set_pd1(1.); // __ONE[0 ... VLEN] = 1
__m128d __DEC = _mm_set_pd1((double)VLEN);
__m128d __L = _mm_load_pd(l);
for(k=start_index;k>end_index;k-=unroll)
{
__D_POW = __L;
for (m=n;m>1;m--)
{
__D_POW = _mm_mul_pd(__D_POW, __L);
}
__P_SUM = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE,
__D_POW));
__L = _mm_sub_pd(__L, __DEC);
}
_mm_store_pd(p_sum,__P_SUM);
for(k=0;k<VLEN;k++)
{
sum += p_sum[k];
}
The two codes were run on a Nehalem 3.2 GHz (desktop) processor,
and a Shanghai 2.3 GHz desktop processor. Here are the results
Code CPU Freq (GHz) Wall clock (s)
------ ------- ------------- --------------
base Nehalem 3.2 20.5
optimized Nehalem 3.2 6.72
SSE-ized Nehalem 3.2 3.37
base Shanghai 2.3 30.3
optimized Shanghai 2.3 7.36
SSE-ized Shanghai 2.3 3.68
These are single thread, single core runs. Code scales very well
(is one of our example codes for the HPC/programming/
parallelization classes we do).
I found it interesting that they started out with the baseline code
performance tracking the ratio of clock speeds ... The Nehalem has
a 39% faster clock, and showed 48% faster performance, which is
about 9% more than could be accounted for by clock speed alone.
The SSE code performance appears to be about 9% different.
I am sure lots of interesting points can be made out of this (being
only one test, and not the most typical test/use case either, such
points may be of dubious value).
I am working on a Cuda version of the above as well, and will try
to compare this to the threaded versions of the above. I am
curious what we can achieve.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [email protected]
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf