>> BTW, it occurs to me that we can approximate the efficiency of >> parallelization by taking execution counts from a profiler and >> post-processing them. I should do that before buying a new GPU. :-)
>I wonder what you mean by that. If you run your program on a sequential machine and count statements then you can analyze what happens around branches to calculate how your system will parallelize. Simple example: If (foo) { // 3200 hits Bar(); // 3190 hits } Else Baz(); // 10 hits } If the 3200 executions were divided into 100 warps of 32, then the 10 hits on Baz() would be distributed into 9 or 10 different warps. So the Baz code stalls 10 out of 100 warps. The effect would be as if the Baz code were executed on 10% of trials rather than 0.3%. You can calculate the cost of this if-statement as 100% of Bar() + 10% of Baz() on a GPU. Note that the CPU cost is 99.7% of Bar() + 0.3% of Baz(). If Baz takes 1000 cycles and Bar takes 10 cycles, then the GPU costs 110 cycles and the CPU costs 13 cycles. Factor in 32x parallelism and the GPU uses fewer cycles. But in this case the fact that the CPU's cycles are so much faster will outweigh parallelism. Regarding the rest of your post: I think you are onto something. The GPU is different from the CPU, so maybe a different design is a better match. You need to express the playout policy as data rather than code. The downside there is that the memory hierarchy on a GPU imposes huge latencies. Maybe the caches in Fermi will make a big difference here. _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/