Hi, On Thu 28 Feb 2002 22:19, Brian J. Beesley wrote:
> > > Back to the subject, I'm wondering about how fast can we do two L-L test > > in parallel using this SSE2 extensions. Basically, I'm thinking in use > > two nearest exponents with the same FFT-length. The memory access in FFT > > phase would be the same, the trig data also the same, the most difficult > > part would be the carry-and-normalization pass. Most of the code for a > > single L-L test could be reused with small modifications using the basic > > float type (double,double) for SSE2 instead of (double) for normal code. > > As result, we would get a L-L result and some seconds (iterations) after > > a new L-L result. > > I'd need some convincing that this would be any better than George's > method. > At the moment is only an idea. I know George's method is good, very good. I thought that because SSE2 will be a standar on PC's in a year or so. > What you're saying is that, with two parallel streams, you'd run one > assignment in one stream and a second assignment in the other. What the > Prime95 SSE2 code does is to run only one assignment at a time, with odd > numbered FFT elements being processed in one stream and adjacent even > numbered FFT elements being processed in the other. > > The difference here is that your method generates memory bus traffic at > twice the rate. George's method takes advantage of the fact that (with > properly aligned operands) fetching the "odd" element data automatically > fetches the adjacent "even" element data. > The streams would be alternated : stream0_data(n) , stream1_data(n), stream0_data(n+1), stream1_data(n+1)... When fetching data(n) for a stream we also get the other. > Memory bandwidth is a serious contraint here. I think you need to > demonstrate that your suggested method has some _big_ advantage, because > something major is going to be needed to offset the inefficiency caused by > the memory bottleneck. > The memory bottleneck was the first thing I thought, and I was near to discard the idea when I realized that the trig bata would be the same, and the required memory access would be less than double the single stream scheme. If a double stream version cost less than double the single one the we can speed up the project a bit. Obviously, it requires more investigation. Regards. Guillermo. _________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers
