Hi,

On Thu 28 Feb 2002 22:19, Brian J. Beesley wrote:

>
> > Back to the subject, I'm wondering about how fast can we do two L-L test
> > in parallel using this SSE2 extensions. Basically, I'm thinking in use
> > two nearest exponents with the same FFT-length. The memory access in FFT
> > phase would be the same, the trig data also the same, the most difficult
> > part would be the carry-and-normalization pass.  Most of the code for a
> > single L-L test could be reused with small modifications using the basic
> > float type (double,double) for SSE2 instead of (double) for normal code. 
> > As result, we would get a L-L result and some seconds (iterations) after
> > a new L-L result.
>
> I'd need some convincing that this would be any better than George's
> method.
>

At the moment is only an idea.  I know George's  method is good, very good. I 
thought that because SSE2 will be a standar on PC's in a year or so. 

> What you're saying is that, with two parallel streams, you'd run one
> assignment in one stream and a second assignment in the other. What the
> Prime95 SSE2 code does is to run only one assignment at a time, with odd
> numbered FFT elements being processed in one stream and adjacent even
> numbered FFT elements being processed in the other.
>
> The difference here is that your method generates memory bus traffic at
> twice the rate. George's method takes advantage of the fact that (with
> properly aligned operands) fetching the "odd" element data automatically
> fetches the adjacent "even" element data.
>
The streams would be alternated :  stream0_data(n) , stream1_data(n), 
stream0_data(n+1), stream1_data(n+1)...

When fetching data(n) for a stream we also get the other. 

> Memory bandwidth is a serious contraint here. I think you need to
> demonstrate that your suggested method has some _big_ advantage, because
> something major is going to be needed to offset the inefficiency caused by
> the memory bottleneck.
>

The memory bottleneck was the first thing I thought, and I was near to 
discard the idea when I realized that the trig bata would be the same, and 
the required memory access would be less than double the single stream scheme.
If a double stream version cost less than double the single one the we can 
speed up the project a bit.
 
Obviously, it requires more investigation.

Regards.

Guillermo.
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to