Re: [computer-go] CUDA and GPU Performance
On Sun, Sep 13, 2009 at 01:02:40AM +0200, Vincent Diepeveen wrote: On Sep 10, 2009, at 12:55 AM, Michael Williams wrote: Very interesting stuff. One glimmer of hope is that the memory situations should improve over time since memory grows but Go boards stay the same size. Well you first have to figure out how fast or slow shifting is on the nvidia's before actually going for it :) Just read the nVidia docs. Shifting has the same cost as addition. P.S.: The PTX assembler is also documented. Or have you meant some secret undocumented instruction goodies? Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA and GPU Performance
On Sep 13, 2009, at 10:19 AM, Petr Baudis wrote: On Sun, Sep 13, 2009 at 01:02:40AM +0200, Vincent Diepeveen wrote: On Sep 10, 2009, at 12:55 AM, Michael Williams wrote: Very interesting stuff. One glimmer of hope is that the memory situations should improve over time since memory grows but Go boards stay the same size. Well you first have to figure out how fast or slow shifting is on the nvidia's before actually going for it :) Just read the nVidia docs. Shifting has the same cost as addition. Document number and url? P.S.: The PTX assembler is also documented. Or have you meant some secret undocumented instruction goodies? Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA and GPU Performance
On Sun, Sep 13, 2009 at 10:48:12AM +0200, Vincent Diepeveen wrote: On Sep 13, 2009, at 10:19 AM, Petr Baudis wrote: Just read the nVidia docs. Shifting has the same cost as addition. Document number and url? http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf P.S.: The PTX assembler is also documented. Or have you meant some secret undocumented instruction goodies? http://www.nvidia.com/object/io_1213955209837.html -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA and GPU Performance
A document of 2 weeks ago where they write at least *something*, not bad from nvidia, knowing they soon have to give lessons to topcoders :) It's not really systematic approach though. We want a list of all instructions with latencies and throughput latency that belong to it. Also lookup times to the caches, shared memory and RAM would be great to know, even when it would only be bandwidth numbers. I do see references to sections B and C for a multiplication instruction and memory fetch instruction that seems to exist, but can't find that section at all. Yet nowhere in document i see which hardware instructions the Nvidia hardware supports. Mind giving page number? Vincent On Sep 13, 2009, at 11:43 AM, Petr Baudis wrote: On Sun, Sep 13, 2009 at 10:48:12AM +0200, Vincent Diepeveen wrote: On Sep 13, 2009, at 10:19 AM, Petr Baudis wrote: Just read the nVidia docs. Shifting has the same cost as addition. Document number and url? http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/ NVIDIA_CUDA_Programming_Guide_2.3.pdf P.S.: The PTX assembler is also documented. Or have you meant some secret undocumented instruction goodies? http://www.nvidia.com/object/io_1213955209837.html -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA and GPU Performance
Thanks for sharing this Christian, in my lines comments. On Sep 9, 2009, at 5:54 PM, Christian Nentwich wrote: I did quite a bit of testing earlier this year on running playout algorithms on GPUs. Unfortunately, I am too busy to write up a tech report on it, but I finally brought myself to take the time to write this e-mail at least. See bottom for conclusions. For performance testing, I used my CPU board representation, and a CUDA port of the same (with adjustments), to test the following algorithm: - Clear the board - Fill the board according to a uniform random policy - Avoid filling eyes, according to simple neighbour check - Avoid simple ko - Count the score and determine the winner In other words: no tree search is involved, and this is the lightest possible playout. The raw numbers are as follows: - CPU Search: 47,000 playouts per CPU core per second, on an Intel 6600 Core-2 Duo - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card Interesting to know is how many nodes per second this is that hammers to the local shared memory and which IPC you overall had. This 9% might not be very accurate read further for that. nvidia is of course the weakest of all GPU manufacturers currently if you run on the gpu's rather than the tesla. Nvidia is of course most famous as they were the first with gpgpu for the big masses and promoting this. Note in 90s i already stumbled upon implementations of chessprograms at graphics cards by hardware designers who did do this as a hobby, at the time they were higher clocked than cpu's in many occasions. Nvidia is not really revealing what instructions the gpu's have, so this is a major bottleneck in your software program probably. To start with it is quite possible that entire 'gflops' calculation of nvidia is totally based upon combined instructions, such as multiplyadd. So streamcore in principe can do at maximum 1 instruction a cycle, yet the numbers Nvidia slams around with are based upon things that are counted as 2 instructions (so 8 flops) whereas in reality it is 1 instruction at the gpu. If you're not using this in the program then that already hammers down theoretical gains one can theoretical achieve at nvidia. Other manufacturers seem more open here. Note that privately approaching Nvidia also didn't help. It was for some potential pilot project a while ago for simulation software (simulation of whatever generics up to fighter jet software). Obviously a pilot is a pilot project and not some massive deal. Nvidia indicated only wanting to reveal *perhaps* something in case of a paper deal of really a lot of 'tesla' cards. As we know those are a bit pricey (not for the simulator clients), so a deal of hundreds or thousands of pre-ordered cards, that would not work out of course. A pilot project is there to convince something is possible. We know that the bandwidth from the gpu's for gpgpu work is a lot worse than from the tesla versions of nvidia, which is throughout your story the problem. However i see solutions there. Also your result is not so bad. Most likely you wrote quite efficient software in CUDA. So basically what so to speak limited your nps is the fact that your software program probably is doing very little work a node. When using more knowledge, it is quite possible that you hardly slowdown, as latency to the memory is simply what you keep hitting. So maybe you can make the program lossless more knowledgeable. Also what i am missing fro myour side is some technical data with respect to how many bytes you lookup in shared RAM for each streamcore at each node. For example, i might be confusing device RAM here, i thought cacheline size is 256 bytes per read to each core. So if you use considerable less than that, it's seriously slower of course. The algorithm running on the GPU is a straight port, with several optimisations then made to severely restrict memory access. This means the algorithm is a naive sort of parallel algorithm, parallel on a per-board level like the CPU implementation, rather than per-intersection or some other sort of highly parallel algorithm. Memory access other than shared processor memory carries a severe penalty on the GPU. Instead, all threads running on the GPU at any one time have to make do with a fast shared memory of 16834 bytes. So: - The board was compressed into a bit board, using 2*21 unsigned ints per thread 2 * 21 * 32 = 1344 bits Maybe it is the case the gpu is very ugly slow in bit manipulations, these things have not been designed to do some sort of squared cryptography. - The count of empty, white and black intersections and the ko position was also in shared memory per thread it doesn't speed up near the edges to do things dirty cheap and illegal and just gamble that illegal results don't backtrack to root ? (this as their playstrength is real weak yet)
Re: [computer-go] CUDA and GPU Performance
On Sep 9, 2009, at 11:57 PM, Christian Nentwich wrote: Mark, let me try to add some more context to answer your questions. When I say in my conclusion that it's not worth it, I mean it's not worth using the GPU to run playout algorithms of the sort that are in use today. There may be many other algorithms that form part of Go engines where the GPU can provide an order-of-magnitude speedup. Still more where the GPU can run in parallel with the CPU to help. In my experiments, a CPU core got 47,000 playouts per second and the GPU 170,000. But: - My computer has two cores (so it gets 94,000 playouts with 2 threads) later generion quadcores hardly have a higher ipc than core2. the core2 dual has a brilliant IPC. Nehalem is hardly faster than phenom2 nor core2 for diep. It's really the compiler quality and tricks with turboboost that lure the audience (like the experimental testmachine at testsites gets cooled down to far under 20C, entire machine all components, as there is a powerincrease of 10% when moving from 25C to 50C, ever seen at home a machine that's cooler than 50C? Additionally the turboboost gets manually overclocked/put to +600Mhz and the RAM is a type of RAM you can't realy afford bla bla bla). Of course this is multibillion companies and every single one of them tries to outdo another one to look better. So really you should compare it 1 to 1 powerwise. the gtx285 then is not so impressive. It's on par with quadcore nehalems in terms of gflops per watt. I wouldn't say it's an outdated gpu, as it is a fast gpu, but for gpgpu it obviously is slow. The latest AMD gpu is however 4 times better here. So your result is maximum factor 2 off for the core2 playouts there at a chip that in other areas is on par with Nehalem. You beat it factor 2 there. 8 core machines is not a fair compare as those have 2 sockets. So you should compare that with the 4 tesla setup, it has 960 streamcores. The only fair Nvidia compare with quadcores is when using tesla. Now i realize it is like $1800 a piece nearly, which is a lot for a GPU on stereoids, yet that's a fair compare to be honest. If we compare things let's compare fair. A 8 core nehalem is the maximum number of cores intel can deliver single machine as a fast machine. I'm skipping the single memory controller 24 core box now from a year ago (dunnington). The 8 core setup you really should compare with the Tesla times 4 cpu's so that's 960 cores. In reality you take a 300 euro card now. What's there for 300 euro from intel or AMD, not a 3.2Ghz i7-965 that's for sure, as that thing has a cost of 1000+ euro. So effectively you lose a factor 2 at most, your thing at the nvidia is still scaling better then. As for the parallel speedup one would get out of game tree search with so many threads versus 4 fast cores, this is a reality. Yes that's not so efficient yet. However there is a solution there to get a good speedup (at least for chess) that i figured out on paper. If there is 1 solution i bet there is more and also solutions for computer-go. The problem as always is getting funded to carry something out like that, as software on a gpu doesn't sell of course. - My computer's processor (intel core duo 6600) is 3 years old, and far from state of the art - My graphics card (Geforce 285) on the other hand, is recently purchased and one of the top graphics cards That means that my old CPU already gets more than twice the speed of the GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core system. Bearing in mind the severe drawbacks of the GPU - these are not general purpose processors, there is much you can't do on them - this limits their usefulness with this algorithm. Compare this speedup to truly highly parallel algorithms: random number generation, matrix multiplication, monte- carlo simulation of options (which are highly parallel because there is no branching and little data); you see speedups of 10x to 100x over the CPU with those. matrixmultiplication is not THAT easy to solve well. Initial attempts were 20% efficient at Nvidia gpu's when using faster approximations (so not stupid simple manner which is ugly slow, but FFT wise), and that was for ideal sized matrice. So if in the lab it already is that inefficient that means there is problems everywhere. it's maturing rapidly now however. The 9% occupancy may be puzzling but there is little that can be done about that. This, and the talk about threads and blocks would take a while to explain, because GPUs don't work like general purpose CPUs. They are SIMD processors meaning that each processor can run many threads in parallel on different items of data but only if *all threads are executing the same instruction*. There is only one instruction decoding stage per processor cycle. If any if statements or loops diverge, threads will be
Re: [computer-go] CUDA and GPU Performance
On Sep 10, 2009, at 12:55 AM, Michael Williams wrote: Very interesting stuff. One glimmer of hope is that the memory situations should improve over time since memory grows but Go boards stay the same size. Well you first have to figure out how fast or slow shifting is on the nvidia's before actually going for it :) Remember this is graphics processors, not CPU's. Even at some cpu i remember the p4-prescott it was more than or equal to 7 cycles to shift right. Vincent ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA and GPU Performance
Rene, you're absolutely right, it's completely fishy! But don't worry, you're work is not in vain :) I noticed this morning, when I read your mail, that I had included the 9x9 results in my original mail instead of 19x19! Indeed, for 19x19 the results are even worse. Here's a complete rundown: - 9x9 CPU: 47,000 playouts per core per second - 9x9 GPU: 170,000 playouts per second - 19x19 CPU: 9,800 playouts per core per second - 19x19 GPU: 11,000 playouts per second I did mention in another mail that the performance difference for 9x9 should be larger, I think. What I didn't realise was that I had reported the 9x9 numbers by mistake! Additional statistics: - Processor occupancy for 19x19 was 6% instead of 9% - Branch divergence was less than half a percent. It was 2% for 9x9. This is perhaps because of the larger board size causing more moves to fall onto empty intersections, or fewer moves requiring merges/captures. Christian René van de Veerdonk wrote: Christian, Would you care to provide some more detail on your implementation for the playouts? Your results are very impressive. At 19x19 Go using bit-boards, your implementation is roughly 7x as fast as the bitboard implementation I presented just a few weeks back, and also outperforms libEgo by about a factor of two. René On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich christ...@modeltwozero.com mailto:christ...@modeltwozero.com wrote: Mark, let me try to add some more context to answer your questions. When I say in my conclusion that it's not worth it, I mean it's not worth using the GPU to run playout algorithms of the sort that are in use today. There may be many other algorithms that form part of Go engines where the GPU can provide an order-of-magnitude speedup. Still more where the GPU can run in parallel with the CPU to help. In my experiments, a CPU core got 47,000 playouts per second and the GPU 170,000. But: - My computer has two cores (so it gets 94,000 playouts with 2 threads) - My computer's processor (intel core duo 6600) is 3 years old, and far from state of the art - My graphics card (Geforce 285) on the other hand, is recently purchased and one of the top graphics cards That means that my old CPU already gets more than twice the speed of the GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core system. Bearing in mind the severe drawbacks of the GPU - these are not general purpose processors, there is much you can't do on them - this limits their usefulness with this algorithm. Compare this speedup to truly highly parallel algorithms: random number generation, matrix multiplication, monte-carlo simulation of options (which are highly parallel because there is no branching and little data); you see speedups of 10x to 100x over the CPU with those. The 9% occupancy may be puzzling but there is little that can be done about that. This, and the talk about threads and blocks would take a while to explain, because GPUs don't work like general purpose CPUs. They are SIMD processors meaning that each processor can run many threads in parallel on different items of data but only if *all threads are executing the same instruction*. There is only one instruction decoding stage per processor cycle. If any if statements or loops diverge, threads will be serialised until they join again. The 9% occupancy is a function of the amount of data needed to perform the task, and the branch divergence (caused by the playouts being different). There is little that can be done about it other than use a completely different algorithm. If you google CUDA block threads you will find out more. In short, the GPU runs like a grid cluster. In each block, 64 threads run in parallel, conceptually. On the actual hardware, in each processor 16 threads from one block will execute followed by 16 from another (half-warps). If any threads are blocked (memory reads costs ~400 cycles!) then threads from another block are scheduled instead. So the answer is: yes, there are 64 * 80 threads conceptually but they're not always scheduled at the same time. Comments on specific questions below. If paralellism is what you're looking for, why not have one thread per move candidate? Use that to collect AMAF statistics. 16Kb is not a lot to work with, so the statistics may have to be shared. One thread per move candidate is feasible with the architecture I used, since every thread has its own board. I have not implemented AMAF, so I cannot comment on the statistics bit, but the output of your algorithm is typically not in the 16k shared memory anyway. You'd write that to global memory (1GB). Would uniform random playouts be good enough for this though?
Re: [computer-go] CUDA and GPU Performance
Interesting stuff. I don't have the skills nor the time to make such experiments myself, but here is a simple idea: When using a bitmap representation of the board, it is quite possible to find all eye-like points with a constant number of bit-shifting operations. That should reduce the number of branches. It might even be possible to check liberties and remove captured stones with constant number of loopings, at least for most reasonable board positions (the number would have to be unreasonably large for pathological positions with very long and narrow groups). This still leaves a few places where the code will have to branch, most notably choosing a random legal move. And the ending of the simulation, of course. If we solve all these, it should be possible to run simple playouts in parallel, taking full advantage of the SIMD nature of the GPU. Of course, each playout would be slower than necesary, but maybe that would even out with many enough done in parallel. -H -- Heikki Levanto In Murphy We Turst heikki (at) lsd (dot) dk ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA and GPU Performance
Christian, Would you care to provide some more detail on your implementation for the playouts? Your results are very impressive. At 19x19 Go using bit-boards, your implementation is roughly 7x as fast as the bitboard implementation I presented just a few weeks back, and also outperforms libEgo by about a factor of two. René On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich christ...@modeltwozero.com wrote: Mark, let me try to add some more context to answer your questions. When I say in my conclusion that it's not worth it, I mean it's not worth using the GPU to run playout algorithms of the sort that are in use today. There may be many other algorithms that form part of Go engines where the GPU can provide an order-of-magnitude speedup. Still more where the GPU can run in parallel with the CPU to help. In my experiments, a CPU core got 47,000 playouts per second and the GPU 170,000. But: - My computer has two cores (so it gets 94,000 playouts with 2 threads) - My computer's processor (intel core duo 6600) is 3 years old, and far from state of the art - My graphics card (Geforce 285) on the other hand, is recently purchased and one of the top graphics cards That means that my old CPU already gets more than twice the speed of the GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core system. Bearing in mind the severe drawbacks of the GPU - these are not general purpose processors, there is much you can't do on them - this limits their usefulness with this algorithm. Compare this speedup to truly highly parallel algorithms: random number generation, matrix multiplication, monte-carlo simulation of options (which are highly parallel because there is no branching and little data); you see speedups of 10x to 100x over the CPU with those. The 9% occupancy may be puzzling but there is little that can be done about that. This, and the talk about threads and blocks would take a while to explain, because GPUs don't work like general purpose CPUs. They are SIMD processors meaning that each processor can run many threads in parallel on different items of data but only if *all threads are executing the same instruction*. There is only one instruction decoding stage per processor cycle. If any if statements or loops diverge, threads will be serialised until they join again. The 9% occupancy is a function of the amount of data needed to perform the task, and the branch divergence (caused by the playouts being different). There is little that can be done about it other than use a completely different algorithm. If you google CUDA block threads you will find out more. In short, the GPU runs like a grid cluster. In each block, 64 threads run in parallel, conceptually. On the actual hardware, in each processor 16 threads from one block will execute followed by 16 from another (half-warps). If any threads are blocked (memory reads costs ~400 cycles!) then threads from another block are scheduled instead. So the answer is: yes, there are 64 * 80 threads conceptually but they're not always scheduled at the same time. Comments on specific questions below. If paralellism is what you're looking for, why not have one thread per move candidate? Use that to collect AMAF statistics. 16Kb is not a lot to work with, so the statistics may have to be shared. One thread per move candidate is feasible with the architecture I used, since every thread has its own board. I have not implemented AMAF, so I cannot comment on the statistics bit, but the output of your algorithm is typically not in the 16k shared memory anyway. You'd write that to global memory (1GB). Would uniform random playouts be good enough for this though? Another question I'd have is whether putting two graphics card would double the capacity. Yes it would. It would pretty much precisely double it (the grid to schedule over just gets larger, but there is no additional overhead). Did you try this for 9x9 or 19x19? I used 19x19. If you do it for 9x9, you can probably run 128 threads per block because of the smaller board representation. The speedup would be correspondingly larger (4x or more). I chose 19x19 because of the severe memory limitations of the architecture; it seemed that 9x9 would just make my life a bit too easy for comfort... Christian ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/