Some news, and a question to consider.

I managed to speed up the ATI chain generation by somewhere close to
30%. Currently a single 5850 card can calculate ~650 chains pr second,
meaning my dual card setup can complete a table in less than 2.5 days.

I did this by unrolling my loop a little, and rather than shifting the
output, I use indexed writes to cached memory. Since ALU clauses are run
in parallel with memory fetches in the GPU threading engine this is
almost a pure gain.

I still have some more tricks up my sleeves, but first a question for
Karsten or whoever would like to do the maths:

I am thinking about "merge free table generations", and the procedure
goes like this:

Start with 270M points, and calculate the first round only and write to
disk. Then read that output, and bucket sort the DP1s, eliminating any
merges. For non merges, calculate the second round and write to disk.
Repeat this for every 32 rounds, keeping fewer and fewer chains, and you
will have produced a table containing only merges from the 32nd round.

Clearly this is faster, as disk access is much quicker than calculating
the rounds, but the real question is how much work can you eliminate
this way ? What speedup will you get ?

f




_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Reply via email to