Re: [computer-go] CUDA and GPU Performance

Christian Nentwich Wed, 09 Sep 2009 11:38:59 -0700

Steve,

assuming that you can find something meaningful for the GPU to do, theoverhead of this should be relatively negligible.


Option one:

- Copy the board grid to the GPU. Since this is at most 21 x 21 integersif the CPU maintains it in uncompressed form, it should take unnoticableamounts of time- Rediscover the strings/blocks on the GPU. Each thread will do that inparallel, and since they all have the same board, they will execute thesame instructions without divergence (i.e. you get 100% SIMD parallelism)

- Do your work

This should take very little time at runtime. It will be a slight painto program: in "rediscover the strings/block" you cannot use recursionas it is illegal on the GPU.


Option two:

- Maintain the "current" board and its information in the GPU globalmemory at all times- Call a GPU "makeMove" function before calling your GPU routines. Thiswill update the global board and must be locked down to use exactly onethread.- Your GPU routines copy the global memory board to their shared memoryand work as usual.

Option two is probably easier to implement. I don't know which one isfaster.. but I suspect neither will take a lot of time, given the taskto be performed.

If you can find a way of integrating the information passed back by theGPU, it's good news: the GPU runs in parallel to the CPU. Patternmatching using the texture memory may be a good use of time, if it helpsyour engine...


Christian


On 09/09/2009 19:14, steve uurtamo wrote:

thanks for taking the time to do these experiments and to report your
results.  it will (has) saved a nightmarish-sounding investment of
time to learn the order of the speedup for this particular problem.

how much penalty do you estimate there is to pass a board from, say, a
program running on the CPU over to the GPU and have the GPU operate on
the partially-filled board as its starting board?  it seems that a
search that has narrowed down the set of next board moves to something
small could use the GPU to help make very fine distinctions between
those few board moves by offloading the heavy lifting.

s.

2009/9/9<dhillism...@netscape.net>:

Interesting stuff. Thanks for reporting your results.

- Dave Hillis

-----Original Message-----
From: Christian Nentwich<christ...@modeltwozero.com>
To: computer-go<computer-go@computer-go.org>
Sent: Wed, Sep 9, 2009 11:54 am
Subject: [computer-go] CUDA and GPU Performance

I did quite a bit of testing earlier this year on running playout algorithms
on GPUs. Unfortunately, I am too busy to write up a tech report on it, but I
finally brought myself to take the time to write this e-mail at least. See
bottom for conclusions.

For performance testing, I used my CPU board representation, and a CUDA port
of the same (with adjustments), to test the following algorithm:
  - Clear the board
  - Fill the board according to a uniform random policy
  - Avoid filling eyes, according to simple neighbour check
  - Avoid simple ko
  - Count the score and determine the winner

In other words: no tree search is involved, and this is the lightest
possible playout. The raw numbers are as follows:
  - CPU Search: 47,000 playouts per CPU core per second, on an Intel 6600
Core-2 Duo
  - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card

The algorithm running on the GPU is a straight port, with several
optimisations then made to severely restrict memory access. This means the
algorithm is a "naive" sort of parallel algorithm, parallel on a per-board
level like the CPU implementation, rather than per-intersection or some
other sort of highly parallel algorithm.

Memory access other than shared processor memory carries a severe penalty on
the GPU. Instead, all threads running on the GPU at any one time have to
make do with a fast shared memory of 16834 bytes. So:
  - The board was compressed into a bit board, using 2*21 unsigned ints per
thread
  - The count of empty, white and black intersections and the ko position was
also in shared memory per thread
  - String/group/block type information was in global memory, as there was no
way to store it in 16384 bytes

Optimal speed was at 80 threads per block, with 256 blocks. The card had
only 9% processor occupancy, due to the shared memory being almost
exhausted. However, branch divergence was at only 2%, which is not bad at
all - suggesting that the form of parallelism may not be a block. This may
be because the "usual" case of a point either being illegal to play, or a
simple play without a need to merge or remove strings is by far the most
common case.

Conclusions:

I see these results as broadly negative with the current generation of
technology. Per-board parallelism on a GPU is not worth it compared to the
CPU speed and the severe drawbacks of working on a GPU (testing is hard,
unfamiliar environment for most programmers, lots of time to spend on
optimisation, etc).

The problems would be severely compounded by trying to integrate any tree
search, or heavy playouts. Trees are almost impossible to construct on a GPU
because pointers cannot be transferred from the host to the GPU. They could
still be represented using arrays, but the random nature of tree access
would cause huge penalties as it would prevent coalesced memory access.

Highly parallel algorithms (e.g. one thread per intersection) can still be
investigated, but my (unproven!) intuition is that it is not worth it, as
most intersections will be idle on any given move, wasting processor
occupancy time.

My feeling is that GPUs may have some potential in this area, but possibly
in a supplementary role such as running additional pattern matching in the
background, or driving machine learning components.

This e-mail is a bit hurried, so.. questions are welcome!!

Christian

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] CUDA and GPU Performance

Reply via email to