In my own gpu experiment (light playouts), registers/memory were the
bounding factors on simulation speed. I expected branching to affect it more
but as long as you have null branches (instead of branches that do
something) then the total execution only takes as long as the longest
branch, which turns out to be not that bad. And someone here did a more
recent gpu-go experiment on a bigger card and it worked even better, which I
expect will be the trend for a while.

~

Chase Albert

On Fri, Oct 23, 2009 at 12:07, Brian Sheppard <sheppar...@aol.com> wrote:

> >I just wondered if this new Fermi GPU solves the issues for go
> >playouts, or don't really make any difference?
>
> My first impression of Fermi is very positive. Fermi contains a lot of
> features that make general purpose computing on a GPU much easier and
> better
> performing.
>
> However, it remains the case that all kernels on a multiprocessor must
> execute the same instruction on each cycle. When executing if/then/else
> logic, this implies that if *any* core needs to execute a branch, then
> *all*
> cores must wait for those instructions to complete.
>
> Playout policies have a lot of if/then/else logic. Sequential processors
> handle such code quite well, because most of it isn't executed. But when
> you
> have 32 playouts executing in parallel, then there is a high chance that
> both branches will be needed. This really cuts into the potential gain.
>
> Amdahl's law is a factor, as well. Amdahl's law says that the gain from
> parallelization is limited when some aspects of the solution execute
> sequentially. For example, the GPU has to generate positions and transfer
> them to the GPU for playout. Generation and transfer are sequential.
> Because
> of such overhead, massively parallel programs generally need very high
> increases from parallelization.
>
> Clock speed is also a factor. CPUS execute at over 3 GHz, and because of
> speculative execution they often execute more than one instruction per
> clock. The GPU generally has a clock rate ~ 1 GHz, and most general purpose
> instructions require multiple clocks. So you must have a large parallel
> speedup just to break even. (Unless you can exploit some of the specialized
> graphics instructions, such as texture mapping, that equate to dozens of
> sequential instructions yet execute as a single instruction on the GPU. I
> don't think computer Go has that possibility.)
>
> So I am not convinced yet, but Fermi is a big step (really many small
> steps)
> in the right direction.
>
> _______________________________________________
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to