At 12:52 AM 9/6/2006, Jason wrote:

>| |     |****************|            | |     |****************|
>|c|     | block i after  | propagate  |c|     | block i+1 after|
>|a|     | itwiddle, IFFT | carries    |a|     | itwiddle, IFFT |
>|r|-->  |    and IDWT    | for ea --> |r|-->  |    and IDWT    | --> ...
>|r|     |    weights     | row        |r|     |    weights     |
>|y|     |****************|            |y|     |****************|
>
>The point is that carries can be propagated while each block is being
>processed, until you finish the last block. Then all the carries have
>to "carriage return" to the next row, the last carry wraps around to
>the first row like before, and the carry vector propagates 10 elements
>into the first block

This is the scheme prime95 currently uses - and as you noted it introduces
a data dependence.  In pass 2, the blocks are truly independent and I can
dole a data block to each thread as needed.  Very easy and efficient.
In pass 1, with the data dependence introduced by carry propagation,
I can see two choices (as an example say there are 1024 data blocks
and 8 threads):

1)  Give each thread a contiguous set of data blocks to crunch through.
Thread 1 gets blocks 0 - 127, thread 2 gets 128 - 255, etc.  Two major
disadvantages:  a) if one CPU core is busy with something else then
7 threads finish and 6 then sit idle waiting for the 8th thread to finish.
b) more memory for carry buffers and more overhead propagating final
carries.

2)  Have the 8 threads start on data blocks 0-7.  Put a lock around the
carry code such that it insures that data blocks are processed in order.
The theory is that all 8 threads race to the carry code, but then the lock
"spaces them apart" such that thread 1 does carries for block 0 and starts
data block 9, then thread 2 starts carries for data block 1 and then starts
data block 10.
The major negative for this method is that there is a limit to the number of
threads that can be kept busy.  If the code is spending 3 units of time in
the FFT code and 1 unit of time in the carry code, then only 4 threads
can be kept busy this way.

I lean toward the second solution.  Does anyone have any real-world
experiences or alternative suggestions? 

_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Reply via email to