Re: [computer-go] CUDA and GPU Performance

2009-09-13 Thread Petr Baudis
On Sun, Sep 13, 2009 at 01:02:40AM +0200, Vincent Diepeveen wrote:
 
 On Sep 10, 2009, at 12:55 AM, Michael Williams wrote:
 
 Very interesting stuff.  One glimmer of hope is that the memory
 situations should improve over time since memory grows but Go
 boards stay the same size.
 
 
 Well you first have to figure out how fast or slow shifting is on
 the nvidia's before actually going for it :)

Just read the nVidia docs. Shifting has the same cost as addition.

P.S.: The PTX assembler is also documented. Or have you meant some
secret undocumented instruction goodies?

Petr Pasky Baudis
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA and GPU Performance

2009-09-13 Thread Vincent Diepeveen


On Sep 13, 2009, at 10:19 AM, Petr Baudis wrote:


On Sun, Sep 13, 2009 at 01:02:40AM +0200, Vincent Diepeveen wrote:


On Sep 10, 2009, at 12:55 AM, Michael Williams wrote:


Very interesting stuff.  One glimmer of hope is that the memory
situations should improve over time since memory grows but Go
boards stay the same size.



Well you first have to figure out how fast or slow shifting is on
the nvidia's before actually going for it :)


Just read the nVidia docs. Shifting has the same cost as addition.



Document number and url?


P.S.: The PTX assembler is also documented. Or have you meant some
secret undocumented instruction goodies?

Petr Pasky Baudis
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA and GPU Performance

2009-09-13 Thread Petr Baudis
On Sun, Sep 13, 2009 at 10:48:12AM +0200, Vincent Diepeveen wrote:
 
 On Sep 13, 2009, at 10:19 AM, Petr Baudis wrote:
 Just read the nVidia docs. Shifting has the same cost as addition.
 
 
 Document number and url?

http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf

 P.S.: The PTX assembler is also documented. Or have you meant some
 secret undocumented instruction goodies?

http://www.nvidia.com/object/io_1213955209837.html

-- 
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA and GPU Performance

2009-09-13 Thread Vincent Diepeveen

A document of 2 weeks ago where they write at least *something*,
not bad from nvidia, knowing they soon have to give lessons to  
topcoders :)


It's not really systematic approach though. We want a list of all  
instructions with latencies and throughput latency
that belong to it. Also lookup times to the caches, shared memory and  
RAM would be great to know, even when it

would only be bandwidth numbers.

I do see references to sections B and C for a multiplication  
instruction and memory fetch instruction that seems to exist,

but can't find that section at all.

Yet nowhere in document i see which hardware instructions the Nvidia  
hardware supports.


Mind giving page number?

Vincent

On Sep 13, 2009, at 11:43 AM, Petr Baudis wrote:


On Sun, Sep 13, 2009 at 10:48:12AM +0200, Vincent Diepeveen wrote:


On Sep 13, 2009, at 10:19 AM, Petr Baudis wrote:

Just read the nVidia docs. Shifting has the same cost as addition.



Document number and url?


http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/ 
NVIDIA_CUDA_Programming_Guide_2.3.pdf



P.S.: The PTX assembler is also documented. Or have you meant some
secret undocumented instruction goodies?


http://www.nvidia.com/object/io_1213955209837.html

--
Petr Pasky Baudis
A lot of people have my books on their bookshelves.
That's the problem, they need to read them. -- Don Knuth
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA and GPU Performance

2009-09-12 Thread Vincent Diepeveen

Thanks for sharing this Christian,
in my lines comments.

On Sep 9, 2009, at 5:54 PM, Christian Nentwich wrote:



I did quite a bit of testing earlier this year on running playout  
algorithms on GPUs. Unfortunately, I am too busy to write up a tech  
report on it, but I finally brought myself to take the time to  
write this e-mail at least. See bottom for conclusions.


For performance testing, I used my CPU board representation, and a  
CUDA port of the same (with adjustments), to test the following  
algorithm:

 - Clear the board
 - Fill the board according to a uniform random policy
 - Avoid filling eyes, according to simple neighbour check
 - Avoid simple ko
 - Count the score and determine the winner

In other words: no tree search is involved, and this is the  
lightest possible playout. The raw numbers are as follows:
 - CPU Search: 47,000 playouts per CPU core per second, on an Intel  
6600 Core-2 Duo
 - GPU Search: 170,000 playouts per second, on an NVidia Geforce  
285 card




Interesting to know is how many nodes per second this is that hammers  
to the local shared memory
and which IPC you overall had. This 9% might not be very accurate  
read further for that.


nvidia is of course the weakest of all GPU manufacturers currently if  
you run on the gpu's rather than the tesla.
Nvidia is of course most famous as they were the first with gpgpu for  
the big masses and promoting this.
Note in 90s i already stumbled upon implementations of chessprograms  
at graphics cards by hardware designers who did do this as a hobby,

at the time they were higher clocked than cpu's in many occasions.

Nvidia is not really revealing what instructions the gpu's have, so  
this is a major bottleneck in your software program probably.
To start with it is quite possible that entire 'gflops' calculation  
of nvidia is totally based upon combined instructions,

such as multiplyadd.

So streamcore in principe can do at maximum 1 instruction a cycle,  
yet the numbers Nvidia slams around with are based upon
things that are counted as 2 instructions (so 8 flops) whereas in  
reality it is 1 instruction at the gpu.


If you're not using this in the program then that already hammers  
down theoretical gains one can theoretical achieve at nvidia.


Other manufacturers seem more open here.

Note that privately approaching Nvidia also didn't help. It was for  
some potential pilot project a while ago for simulation software  
(simulation of whatever
generics up to fighter jet software). Obviously a pilot is a pilot  
project and not some massive deal. Nvidia indicated only wanting to  
reveal *perhaps* something

in case of a paper deal of really a lot of 'tesla' cards.

As we know those are a bit pricey (not for the simulator clients), so  
a deal of hundreds or thousands of pre-ordered cards, that would not  
work out of course.

A pilot project is there to convince something is possible.

We know that the bandwidth from the gpu's for gpgpu work is a lot  
worse than from the tesla versions of nvidia,
which is throughout your story the problem. However i see solutions  
there.


Also your result is not so bad. Most likely you wrote quite efficient  
software in CUDA.
So basically what so to speak limited your nps is the fact that your  
software program probably is doing very little work a node.


When using more knowledge, it is quite possible that you hardly  
slowdown, as latency to the memory is simply what you keep hitting.

So maybe you can make the program lossless more knowledgeable.

Also what i am missing fro myour side is some technical data with  
respect to how many bytes you lookup in shared RAM for each  
streamcore at each node.
For example, i might be confusing device RAM here, i thought  
cacheline size is 256 bytes per read to each core.


So if you use considerable less than that, it's seriously slower of  
course.


The algorithm running on the GPU is a straight port, with several  
optimisations then made to severely restrict memory access. This  
means the algorithm is a naive sort of parallel algorithm,  
parallel on a per-board level like the CPU implementation, rather  
than per-intersection or some other sort of highly parallel algorithm.


Memory access other than shared processor memory carries a severe  
penalty on the GPU. Instead, all threads running on the GPU at any  
one time have to make do with a fast shared memory of 16834 bytes. So:
 - The board was compressed into a bit board, using 2*21 unsigned  
ints per thread


2 * 21 * 32 = 1344 bits

Maybe it is the case the gpu is very ugly slow in bit manipulations,
these things have not been designed to do some sort of squared  
cryptography.


 - The count of empty, white and black intersections and the ko  
position was also in shared memory per thread


it doesn't speed up near the edges to do things dirty cheap and illegal
and just gamble that illegal results don't backtrack to root ?
(this as their playstrength is real weak yet)

Re: [computer-go] CUDA and GPU Performance

2009-09-12 Thread Vincent Diepeveen


On Sep 9, 2009, at 11:57 PM, Christian Nentwich wrote:


Mark,

let me try to add some more context to answer your questions. When  
I say in my conclusion that it's not worth it, I mean it's not  
worth using the GPU to run playout algorithms of the sort that are  
in use today. There may be many other algorithms that form part of  
Go engines where the GPU can provide an order-of-magnitude speedup.  
Still more where the GPU can run in parallel with the CPU to help.


In my experiments, a CPU core got 47,000 playouts per second and  
the GPU 170,000. But:
  - My computer has two cores (so it gets 94,000 playouts with 2  
threads)


later generion quadcores hardly have a higher ipc than core2.

the core2 dual has a brilliant IPC.

Nehalem is hardly faster than phenom2 nor core2 for diep. It's really  
the compiler quality and tricks with turboboost
that lure the audience (like the experimental testmachine at  
testsites gets cooled down to far under 20C, entire machine
all components, as there is a powerincrease of 10% when moving from  
25C to 50C, ever seen at home a machine that's
cooler than 50C? Additionally the turboboost gets manually  
overclocked/put to +600Mhz and the RAM is a type of RAM
you can't realy afford bla bla bla). Of course this is multibillion  
companies and every single one of them tries to outdo another

one to look better.

So really you should compare it 1 to 1 powerwise.

the gtx285 then is not so impressive. It's on par with quadcore  
nehalems in terms of gflops per watt.
I wouldn't say it's an outdated gpu, as it is a fast gpu, but for  
gpgpu it obviously is slow.


The latest AMD gpu is however 4 times better here.

So your result is  maximum factor 2 off for the core2 playouts there  
at a chip that in other areas is on par with Nehalem.

You beat it factor 2 there.

8 core machines is not a fair compare as those have 2 sockets. So you  
should compare that with the 4 tesla setup, it has 960 streamcores.


The only fair Nvidia compare with quadcores is when using tesla. Now  
i realize it is like $1800 a piece nearly, which is a lot for a GPU  
on stereoids,

yet that's a fair compare to be honest.

If we compare things let's compare fair. A 8 core nehalem is the  
maximum number of cores intel can deliver single machine as a fast  
machine.
I'm skipping the single memory controller 24 core box now from a year  
ago (dunnington).


The 8 core setup you really should compare with the Tesla times 4  
cpu's so that's 960 cores.


In reality you take a 300 euro card now. What's there for 300 euro  
from intel or AMD, not a 3.2Ghz i7-965 that's for sure,

as that thing has a cost of 1000+ euro.

So effectively you lose a factor 2 at most, your thing at the nvidia  
is still scaling better then.


As for the parallel speedup one would get out of game tree search  
with so many threads versus 4 fast cores, this is a reality.

Yes that's not so efficient yet.

However there is a solution there to get a good speedup (at least for  
chess) that i figured out on paper. If there is 1 solution i bet  
there is more
and also solutions for computer-go. The problem as always is getting  
funded to carry something out like that, as software on a gpu doesn't  
sell

of course.

  - My computer's processor (intel core duo 6600) is 3 years old,  
and far from state of the art
  - My graphics card (Geforce 285) on the other hand, is recently  
purchased and one of the top graphics cards


That means that my old CPU already gets more than twice the speed  
of the GPU. An Intel Nehalem processor would surely beat it, let  
alone an 8-core system. Bearing in mind the severe drawbacks of the  
GPU - these are not general purpose processors, there is much you  
can't do on them - this limits their usefulness with this  
algorithm. Compare this speedup to truly highly parallel  
algorithms: random number generation, matrix multiplication, monte- 
carlo simulation of options (which are highly parallel because  
there is no branching and little data); you see speedups of 10x to  
100x over the CPU with those.




matrixmultiplication is not THAT easy to solve well. Initial attempts  
were 20% efficient at Nvidia gpu's when using faster approximations  
(so not stupid simple manner which is
ugly slow, but FFT wise), and that was for ideal sized matrice. So if  
in the lab it already is that inefficient that means there is  
problems everywhere.


it's maturing rapidly now however.

The 9% occupancy may be puzzling but there is little that can be  
done about that. This, and the talk about threads and blocks would  
take a while to explain, because GPUs don't work like general  
purpose CPUs. They are SIMD processors meaning that each processor  
can run many threads in parallel on different items of data but  
only if *all threads are executing the same instruction*. There is  
only one instruction decoding stage per processor cycle. If any  
if statements or loops diverge, threads will be 

Re: [computer-go] CUDA and GPU Performance

2009-09-12 Thread Vincent Diepeveen


On Sep 10, 2009, at 12:55 AM, Michael Williams wrote:

Very interesting stuff.  One glimmer of hope is that the memory  
situations should improve over time since memory grows but Go  
boards stay the same size.




Well you first have to figure out how fast or slow shifting is on the  
nvidia's before actually going for it :)


Remember this is graphics processors, not CPU's.
Even at some cpu i remember the p4-prescott it was more than or equal  
to 7 cycles to shift right.


Vincent

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA and GPU Performance

2009-09-10 Thread Christian Nentwich

Rene,

you're absolutely right, it's completely fishy! But don't worry, you're 
work is not in vain :) I noticed this morning, when I read your mail, 
that I had included the 9x9 results in my original mail instead of 
19x19! Indeed, for 19x19 the results are even worse. Here's a complete 
rundown:


- 9x9 CPU: 47,000 playouts per core per second
- 9x9 GPU: 170,000 playouts per second

- 19x19 CPU: 9,800 playouts per core per second
- 19x19 GPU: 11,000 playouts per second

I did mention in another mail that the performance difference for 9x9 
should be larger, I think. What I didn't realise was that I had reported 
the 9x9 numbers by mistake!


Additional statistics:
 - Processor occupancy for 19x19 was 6% instead of 9%
 - Branch divergence was less than half a percent. It was 2% for 9x9. 
This is perhaps because of the larger board size causing more moves to 
fall onto empty intersections, or fewer moves requiring merges/captures.


Christian



René van de Veerdonk wrote:

Christian,

Would you care to provide some more detail on your implementation for 
the playouts? Your results are very impressive. At 19x19 Go using 
bit-boards, your implementation is roughly 7x as fast as the bitboard 
implementation I presented just a few weeks back, and also outperforms 
libEgo by about a factor of two.


René

On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich 
christ...@modeltwozero.com mailto:christ...@modeltwozero.com wrote:


Mark,

let me try to add some more context to answer your questions. When
I say in my conclusion that it's not worth it, I mean it's not
worth using the GPU to run playout algorithms of the sort that are
in use today. There may be many other algorithms that form part of
Go engines where the GPU can provide an order-of-magnitude
speedup. Still more where the GPU can run in parallel with the CPU
to help.

In my experiments, a CPU core got 47,000 playouts per second and
the GPU 170,000. But:
 - My computer has two cores (so it gets 94,000 playouts with 2
threads)
 - My computer's processor (intel core duo 6600) is 3 years old,
and far from state of the art
 - My graphics card (Geforce 285) on the other hand, is recently
purchased and one of the top graphics cards

That means that my old CPU already gets more than twice the speed
of the GPU. An Intel Nehalem processor would surely beat it, let
alone an 8-core system. Bearing in mind the severe drawbacks of
the GPU - these are not general purpose processors, there is much
you can't do on them - this limits their usefulness with this
algorithm. Compare this speedup to truly highly parallel
algorithms: random number generation, matrix multiplication,
monte-carlo simulation of options (which are highly parallel
because there is no branching and little data); you see speedups
of 10x to 100x over the CPU with those.

The 9% occupancy may be puzzling but there is little that can be
done about that. This, and the talk about threads and blocks would
take a while to explain, because GPUs don't work like general
purpose CPUs. They are SIMD processors meaning that each processor
can run many threads in parallel on different items of data but
only if *all threads are executing the same instruction*. There is
only one instruction decoding stage per processor cycle. If any
if statements or loops diverge, threads will be serialised until
they join again. The 9% occupancy is a function of the amount of
data needed to perform the task, and the branch divergence (caused
by the playouts being different). There is little that can be done
about it other than use a completely different algorithm.

If you google CUDA block threads you will find out more. In
short, the GPU runs like a grid cluster. In each block, 64 threads
run in parallel, conceptually. On the actual hardware, in each
processor 16 threads from one block will execute followed by 16
from another (half-warps). If any threads are blocked (memory
reads costs ~400 cycles!) then threads from another block are
scheduled instead. So the answer is: yes, there are 64 * 80
threads conceptually but they're not always scheduled at the same
time.

Comments on specific questions below.

If paralellism is what you're looking for, why not have one
thread per
move candidate? Use that to collect AMAF statistics. 16Kb is
not a lot
to work with, so the statistics may have to be shared.
 


One thread per move candidate is feasible with the architecture I
used, since every thread has its own board. I have not implemented
AMAF, so I cannot comment on the statistics bit, but the output
of your algorithm is typically not in the 16k shared memory
anyway. You'd write that to global memory (1GB). Would uniform
random playouts be good enough for this though?



Re: [computer-go] CUDA and GPU Performance

2009-09-10 Thread Heikki Levanto

Interesting stuff. I don't have the skills nor the time to make such
experiments myself, but here is a simple idea:

When using a bitmap representation of the board, it is quite possible to find
all eye-like points with a constant number of bit-shifting operations. That
should reduce the number of branches.

It might even be possible to check liberties and remove captured stones with
constant number of loopings, at least for most reasonable board positions
(the number would have to be unreasonably large for pathological
positions with very long and narrow groups).

This still leaves a few places where the code will have to branch, most
notably choosing a random legal move. And the ending of the simulation, of
course.

If we solve all these, it should be possible to run simple playouts in
parallel, taking full advantage of the SIMD nature of the GPU. Of course,
each playout would be slower than necesary, but maybe that would even out
with many enough done in parallel. 

-H

-- 
Heikki Levanto   In Murphy We Turst heikki (at) lsd (dot) dk

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] CUDA and GPU Performance

2009-09-09 Thread René van de Veerdonk
Christian,
Would you care to provide some more detail on your implementation for the
playouts? Your results are very impressive. At 19x19 Go using bit-boards,
your implementation is roughly 7x as fast as the bitboard implementation I
presented just a few weeks back, and also outperforms libEgo by about a
factor of two.

René

On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich 
christ...@modeltwozero.com wrote:

 Mark,

 let me try to add some more context to answer your questions. When I say in
 my conclusion that it's not worth it, I mean it's not worth using the GPU
 to run playout algorithms of the sort that are in use today. There may be
 many other algorithms that form part of Go engines where the GPU can provide
 an order-of-magnitude speedup. Still more where the GPU can run in parallel
 with the CPU to help.

 In my experiments, a CPU core got 47,000 playouts per second and the GPU
 170,000. But:
  - My computer has two cores (so it gets 94,000 playouts with 2 threads)
  - My computer's processor (intel core duo 6600) is 3 years old, and far
 from state of the art
  - My graphics card (Geforce 285) on the other hand, is recently purchased
 and one of the top graphics cards

 That means that my old CPU already gets more than twice the speed of the
 GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core
 system. Bearing in mind the severe drawbacks of the GPU - these are not
 general purpose processors, there is much you can't do on them - this limits
 their usefulness with this algorithm. Compare this speedup to truly highly
 parallel algorithms: random number generation, matrix multiplication,
 monte-carlo simulation of options (which are highly parallel because there
 is no branching and little data); you see speedups of 10x to 100x over the
 CPU with those.

 The 9% occupancy may be puzzling but there is little that can be done about
 that. This, and the talk about threads and blocks would take a while to
 explain, because GPUs don't work like general purpose CPUs. They are SIMD
 processors meaning that each processor can run many threads in parallel on
 different items of data but only if *all threads are executing the same
 instruction*. There is only one instruction decoding stage per processor
 cycle. If any if statements or loops diverge, threads will be serialised
 until they join again. The 9% occupancy is a function of the amount of data
 needed to perform the task, and the branch divergence (caused by the
 playouts being different). There is little that can be done about it other
 than use a completely different algorithm.

 If you google CUDA block threads you will find out more. In short, the
 GPU runs like a grid cluster. In each block, 64 threads run in parallel,
 conceptually. On the actual hardware, in each processor 16 threads from one
 block will execute followed by 16 from another (half-warps). If any
 threads are blocked (memory reads costs ~400 cycles!) then threads from
 another block are scheduled instead. So the answer is: yes, there are 64 *
 80 threads conceptually but they're not always scheduled at the same time.

 Comments on specific questions below.

 If paralellism is what you're looking for, why not have one thread per
 move candidate? Use that to collect AMAF statistics. 16Kb is not a lot
 to work with, so the statistics may have to be shared.


 One thread per move candidate is feasible with the architecture I used,
 since every thread has its own board. I have not implemented AMAF, so I
 cannot comment on the statistics bit, but the output of your algorithm is
 typically not in the 16k shared memory anyway. You'd write that to global
 memory (1GB). Would uniform random playouts be good enough for this though?

  Another question I'd have is whether putting two graphics card would
 double the capacity.


 Yes it would. It would pretty much precisely double it (the grid to
 schedule over just gets larger, but there is no additional overhead).

  Did you try this for 9x9 or 19x19?


 I used 19x19. If you do it for 9x9, you can probably run 128 threads per
 block because of the smaller board representation. The speedup would be
 correspondingly larger (4x or more). I chose 19x19 because of the severe
 memory limitations of the architecture; it seemed that 9x9 would just make
 my life a bit too easy for comfort...


 Christian

 ___
 computer-go mailing list
 computer-go@computer-go.org
 http://www.computer-go.org/mailman/listinfo/computer-go/

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/