>> BTW, it occurs to me that we can approximate the efficiency of
>> parallelization by taking execution counts from a profiler and
>> post-processing them. I should do that before buying a new GPU. :-)

>I wonder what you mean by that.

If you run your program on a sequential machine and count statements then
you can analyze what happens around branches to calculate how your system
will parallelize.

Simple example:

If (foo) {   // 3200 hits
  Bar();     // 3190 hits
}
Else
  Baz();     //   10 hits
}

If the 3200 executions were divided into 100 warps of 32, then the 10 hits
on Baz() would be distributed into 9 or 10 different warps. So the Baz code
stalls 10 out of 100 warps. The effect would be as if the Baz code were
executed on 10% of trials rather than 0.3%.

You can calculate the cost of this if-statement as 100% of Bar() + 10% of
Baz() on a GPU. Note that the CPU cost is 99.7% of Bar() + 0.3% of Baz(). If
Baz takes 1000 cycles and Bar takes 10 cycles, then the GPU costs 110 cycles
and the CPU costs 13 cycles. Factor in 32x parallelism and the GPU uses
fewer cycles. But in this case the fact that the CPU's cycles are so much
faster will outweigh parallelism.

Regarding the rest of your post: I think you are onto something. The GPU is
different from the CPU, so maybe a different design is a better match.

You need to express the playout policy as data rather than code. The
downside there is that the memory hierarchy on a GPU imposes huge latencies.
Maybe the caches in Fermi will make a big difference here.



_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to