Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization
I used stochastic sampling at internal nodes, because of this. > During the forward simulation phase of SEARCH, the action at each node x is > selected by sampling a ∼ π¯(·|x). > As a result, the full imaginary trajectory is generated consistently > according to policy π¯. > In this section, we establish our main claim namely that AlphaZero’s action > selection criteria can be interpreted as approximating the solution to a > regularized policy-optimization objective. I think they say UCT and PUCT is approximation of direct π¯ sampling, but I haven't understood section 3 well. 2020年7月20日(月) 2:51 Daniel : > > @Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are > meant to be used during training, no? > I think I understand ACT and LEARN but I am not sure about SEARCH for which > they say this: > > > During search, we propose to stochastically sample actions according to π¯ > > instead of the deterministic action selection rule of Eq. 1. > > This sounds much like the random selection done at the root with temperature, > but this time applied at internal nodes. > Does it mean the pUCT formula is not used? Why does the selection have to be > stochastic now? > On selection, you compute π_bar every time from (q, π_theta, n_visits) so I > suppose π_bar has everything it needs to balance exploration and exploitation. > > > On Sun, Jul 19, 2020 at 8:10 AM David Wu wrote: >> >> I imagine that at low visits at least, "ACT" behaves similarly to Leela >> Zero's "LCB" move selection, which also has the effect of sometimes >> selecting a move that is not the max-visits move, if its value estimate has >> recently been found to be sufficiently larger to balance the fact that it is >> lower prior and lower visits (at least, typically, this is why the move >> wouldn't have been the max visits move in the first place). It also scales >> in an interesting way with empirical observed playout-by-playout variance of >> moves, but I think by far the important part is that it can use sufficiently >> confident high value to override max-visits. >> >> The gain from "LCB" in match play I recall is on the very very rough order >> of 100 Elo, although it could be less or more depending on match conditions >> and what neural net is used and other things. So for LZ at least, "ACT"-like >> behavior at low visits is not new. >> >> >> On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki >> wrote: >>> >>> Hi, >>> >>> I couldn't improve leela zero's strength by implementing SEARCH and ACT. >>> https://github.com/zakki/leela-zero/commits/regularized_policy >>> >>> 2020年7月17日(金) 2:47 Rémi Coulom : >>> > >>> > This looks very interesting. >>> > >>> > From a quick glance, it seems the improvement is mainly when the number >>> > of playouts is small. Also they don't test on the game of Go. Has anybody >>> > tried it? >>> > >>> > I will take a deeper look later. >>> > >>> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek wrote: >>> >> >>> >> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/ >>> >> >>> >> >>> >> -- >>> >> Honesty is a very expensive gift. So, don't expect it from cheap people >>> >> - Warren Buffett >>> >> http://tayek.com/ >>> >> >>> >> ___ >>> >> Computer-go mailing list >>> >> Computer-go@computer-go.org >>> >> http://computer-go.org/mailman/listinfo/computer-go >>> > >>> > ___ >>> > Computer-go mailing list >>> > Computer-go@computer-go.org >>> > http://computer-go.org/mailman/listinfo/computer-go >>> >>> >>> >>> -- >>> Kensuke Matsuzaki >>> ___ >>> Computer-go mailing list >>> Computer-go@computer-go.org >>> http://computer-go.org/mailman/listinfo/computer-go >> >> ___ >> Computer-go mailing list >> Computer-go@computer-go.org >> http://computer-go.org/mailman/listinfo/computer-go > > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go -- Kensuke Matsuzaki ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization
@Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are meant to be used during training, no? I think I understand ACT and LEARN but I am not sure about SEARCH for which they say this: > During search, we propose to stochastically sample actions according to π¯ instead of the deterministic action selection rule of Eq. 1. This sounds much like the random selection done at the root with temperature, but this time applied at internal nodes. Does it mean the pUCT formula is not used? Why does the selection have to be stochastic now? On selection, you compute π_bar every time from (q, π_theta, n_visits) so I suppose π_bar has everything it needs to balance exploration and exploitation. On Sun, Jul 19, 2020 at 8:10 AM David Wu wrote: > I imagine that at low visits at least, "ACT" behaves similarly to Leela > Zero's "LCB" move selection, which also has the effect of sometimes > selecting a move that is not the max-visits move, if its value estimate has > recently been found to be sufficiently larger to balance the fact that it > is lower prior and lower visits (at least, typically, this is why the move > wouldn't have been the max visits move in the first place). It also scales > in an interesting way with empirical observed playout-by-playout variance > of moves, but I think by far the important part is that it can use > sufficiently confident high value to override max-visits. > > The gain from "LCB" in match play I recall is on the very very rough order > of 100 Elo, although it could be less or more depending on match conditions > and what neural net is used and other things. So for LZ at least, > "ACT"-like behavior at low visits is not new. > > > On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki > wrote: > >> Hi, >> >> I couldn't improve leela zero's strength by implementing SEARCH and ACT. >> https://github.com/zakki/leela-zero/commits/regularized_policy >> >> 2020年7月17日(金) 2:47 Rémi Coulom : >> > >> > This looks very interesting. >> > >> > From a quick glance, it seems the improvement is mainly when the number >> of playouts is small. Also they don't test on the game of Go. Has anybody >> tried it? >> > >> > I will take a deeper look later. >> > >> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek wrote: >> >> >> >> >> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/ >> >> >> >> >> >> -- >> >> Honesty is a very expensive gift. So, don't expect it from cheap >> people - Warren Buffett >> >> http://tayek.com/ >> >> >> >> ___ >> >> Computer-go mailing list >> >> Computer-go@computer-go.org >> >> http://computer-go.org/mailman/listinfo/computer-go >> > >> > ___ >> > Computer-go mailing list >> > Computer-go@computer-go.org >> > http://computer-go.org/mailman/listinfo/computer-go >> >> >> >> -- >> Kensuke Matsuzaki >> ___ >> Computer-go mailing list >> Computer-go@computer-go.org >> http://computer-go.org/mailman/listinfo/computer-go >> > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization
I imagine that at low visits at least, "ACT" behaves similarly to Leela Zero's "LCB" move selection, which also has the effect of sometimes selecting a move that is not the max-visits move, if its value estimate has recently been found to be sufficiently larger to balance the fact that it is lower prior and lower visits (at least, typically, this is why the move wouldn't have been the max visits move in the first place). It also scales in an interesting way with empirical observed playout-by-playout variance of moves, but I think by far the important part is that it can use sufficiently confident high value to override max-visits. The gain from "LCB" in match play I recall is on the very very rough order of 100 Elo, although it could be less or more depending on match conditions and what neural net is used and other things. So for LZ at least, "ACT"-like behavior at low visits is not new. On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki wrote: > Hi, > > I couldn't improve leela zero's strength by implementing SEARCH and ACT. > https://github.com/zakki/leela-zero/commits/regularized_policy > > 2020年7月17日(金) 2:47 Rémi Coulom : > > > > This looks very interesting. > > > > From a quick glance, it seems the improvement is mainly when the number > of playouts is small. Also they don't test on the game of Go. Has anybody > tried it? > > > > I will take a deeper look later. > > > > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek wrote: > >> > >> > https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/ > >> > >> > >> -- > >> Honesty is a very expensive gift. So, don't expect it from cheap people > - Warren Buffett > >> http://tayek.com/ > >> > >> ___ > >> Computer-go mailing list > >> Computer-go@computer-go.org > >> http://computer-go.org/mailman/listinfo/computer-go > > > > ___ > > Computer-go mailing list > > Computer-go@computer-go.org > > http://computer-go.org/mailman/listinfo/computer-go > > > > -- > Kensuke Matsuzaki > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization
Hi, I couldn't improve leela zero's strength by implementing SEARCH and ACT. https://github.com/zakki/leela-zero/commits/regularized_policy 2020年7月17日(金) 2:47 Rémi Coulom : > > This looks very interesting. > > From a quick glance, it seems the improvement is mainly when the number of > playouts is small. Also they don't test on the game of Go. Has anybody tried > it? > > I will take a deeper look later. > > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek wrote: >> >> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/ >> >> >> -- >> Honesty is a very expensive gift. So, don't expect it from cheap people - >> Warren Buffett >> http://tayek.com/ >> >> ___ >> Computer-go mailing list >> Computer-go@computer-go.org >> http://computer-go.org/mailman/listinfo/computer-go > > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go -- Kensuke Matsuzaki ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization
This looks very interesting. >From a quick glance, it seems the improvement is mainly when the number of playouts is small. Also they don't test on the game of Go. Has anybody tried it? I will take a deeper look later. On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek wrote: > > https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/ > > > -- > Honesty is a very expensive gift. So, don't expect it from cheap people - > Warren Buffett > http://tayek.com/ > > ___ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go > ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
[Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization
https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/ -- Honesty is a very expensive gift. So, don't expect it from cheap people - Warren Buffett http://tayek.com/ ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
Exactly you don't check super ko in the playouts and you generally stop playouts at 3 * boardsize * boardsize. But once you start adding more logic into the playouts (like not playing in single point eyes) they tend not to hit this limit. Von meinem iPhone gesendet Am 28.03.2015 um 11:53 schrieb hughperkins2 hughperki...@gmail.com: Oh wait, superko check is not that cheap, but it is so rare, you can probably ignore it in playouts, and jist check befote submitting a move to the server. If its superko, then jist pass pethaps. ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
Well you could check if a field is empty or not but you could also check if you're putting stones in a dead area or if it would be a suicide, etc. On Sat, Mar 28, 2015 at 10:15:32AM +0100, ?lvaro Begu? wrote: I am not sure I understand the question. The only thing that is typically not checked in the playouts is superko. What other validity checks are you performing? Álvaro. On Sat, Mar 28, 2015 at 9:54 AM, holger krekel hol...@merlinux.eu wrote: On Sat, Mar 28, 2015 at 08:51 +0100, folkert wrote: Hi, For a monte carlo search, are only valid moves performed? Or does it work from beginning to the end of a playout using whatever free position is available on the board? I am also interested in the question. Because I read here that people can do 25k playouts per second while my program can only do ~ 20 per second when doing full validity checks on all steps. Do you have a reference, some context for the 25K playouts? holger Folkert van Heusden -- -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go -- about me:http://holgerkrekel.net/about-me/ contracting: http://merlinux.eu ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go Folkert van Heusden -- MultiTail is a versatile tool for watching logfiles and output of commands. Filtering, coloring, merging, diff-view, etc. http://www.vanheusden.com/multitail/ -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
20 playouts is a bit slow Folkert. But I've been there, too! Now my bot is at around 1000 19x19 pps and 5000 on 9x9 (but I also check to not play into single point eyes). That's still quite slow and further optimisation will be required. But OTOH on a 4 core machine without UCT (only the AMAF) heuristic I get around 1250ELO on CGOS 13x13 ( http://cgos.boardspace.net/13x13/cross/Imrscl-017-A-N.html). Urban On Sat, Mar 28, 2015 at 11:30 AM, Petr Baudis pa...@ucw.cz wrote: On Sat, Mar 28, 2015 at 10:15:32AM +0100, Álvaro Begué wrote: I am not sure I understand the question. The only thing that is typically not checked in the playouts is superko. What other validity checks are you performing? There is a few: (i) No single stone suicide. Can't think of how to playout without that, you wouldn't know when to stop playing. (ii) No multi stone suicide. Sometimes in the fastest implementations this rule is relaxed (i.e. you proclaim that you are using Tromp-Taylor or New Zealand rules). But in real-world engines, you need to check things like self-atari or 2-liberty semeai and then a suicide check naturally falls out from that too. (iii) No ko. This is a pretty cheap check, and without it again your playouts may end up looping. BTW, the fastest playouts are probably done by libego and that's open source. But as I said in the past - consider not optimizing first; write some reasonable playout heuristics first and *then* profile your code. Petr Baudis ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go -- Blog: http://bettong.net/ Twitter: https://twitter.com/ujh Homepage: http://www.urbanhafner.com/ ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
20 playouts per second is pretty slow. Are you sure youre not just looping around endlessly for 1 moves or something? Playing in a dead area is legal, as is filling your own eyes etc. Only suicide, and ko is illegal. Even superko is a pretty cheap check. ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
Because I read here that people can do 25k playouts per second while my program can only do ~ 20 per second when doing full validity checks on all steps. Do you have a reference, some context for the 25K playouts? Look for the topic What's a good playout speed? in the mailinglist archives. http://computer-go.org/pipermail/computer-go/ Folkert van Heusden -- Curious about the inner workings of your car? Then check O2OO: it'll tell you all that there is to know about your car's engine! http://www.vanheusden.com/O2OO/ -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
On Sat, Mar 28, 2015 at 08:51 +0100, folkert wrote: Hi, For a monte carlo search, are only valid moves performed? Or does it work from beginning to the end of a playout using whatever free position is available on the board? I am also interested in the question. Because I read here that people can do 25k playouts per second while my program can only do ~ 20 per second when doing full validity checks on all steps. Do you have a reference, some context for the 25K playouts? holger Folkert van Heusden -- -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go -- about me:http://holgerkrekel.net/about-me/ contracting: http://merlinux.eu ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
I am not sure I understand the question. The only thing that is typically not checked in the playouts is superko. What other validity checks are you performing? Álvaro. On Sat, Mar 28, 2015 at 9:54 AM, holger krekel hol...@merlinux.eu wrote: On Sat, Mar 28, 2015 at 08:51 +0100, folkert wrote: Hi, For a monte carlo search, are only valid moves performed? Or does it work from beginning to the end of a playout using whatever free position is available on the board? I am also interested in the question. Because I read here that people can do 25k playouts per second while my program can only do ~ 20 per second when doing full validity checks on all steps. Do you have a reference, some context for the 25K playouts? holger Folkert van Heusden -- -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go -- about me:http://holgerkrekel.net/about-me/ contracting: http://merlinux.eu ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
[Computer-go] monte carlo search; all valid moves?
Hi, For a monte carlo search, are only valid moves performed? Or does it work from beginning to the end of a playout using whatever free position is available on the board? Because I read here that people can do 25k playouts per second while my program can only do ~ 20 per second when doing full validity checks on all steps. Folkert van Heusden -- -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] monte carlo search; all valid moves?
Oh wait, superko check is not that cheap, but it is so rare, you can probably ignore it in playouts, and jist check befote submitting a move to the server. If its superko, then jist pass pethaps. ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [computer-go] monte carlo
What method are you guys using for the monte carlo search? What do you do? I pick at random a move, then for the resulting board that comes out of that pick another move and so on. Then, after that I evaulate the number of stones etc. What do you guys look at to select a move using monte carlo? On Sat, Oct 17, 2009 at 05:02:33PM +0200, Folkert van Heusden wrote: People, I'm trying to implement a monthecarlo algorithm in my go program. Now the results are dramatic: the elo-rating of my go program drops from 1150 to below 700. I tried: - evaluate the number of captured stone - evaluate strategic elements (without MC this strategic eval gives that 1150 elo). Currently my program can evaluate 500 scenes per second and I let it think for 5 seconds. What could be the cause of this dramatic results? Wrong evaluation? Not enough nodes processed? Folkert van Heusden -- www.vanheusden.com/multitail - multitail is tail on steroids. multiple windows, filtering, coloring, anything you can think of -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] monte carlo
On Sun, Oct 25, 2009 at 06:52:56PM +0100, Folkert van Heusden wrote: What method are you guys using for the monte carlo search? What do you do? I pick at random a move, then for the resulting board that comes out of that pick another move and so on. Then, after that I evaulate the number of stones etc. Do you mean you count the score? Make sure you actually count the stone right, and (quite important) that you are picking out moves really at random, not skewed in any way. What do you guys look at to select a move using monte carlo? People use heuristics to prefer captures, escaping atari, play local moves that match certain 3x3 patterns etc. - there is plenty of papers covering this, I recommend you look at them. It's fairly important to have all the heuristics somewhat randomized (depending on other parts of my engine, I've found any value between 60% to 90% probability of applying each heuristic optimal). -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] monte carlo
What method are you guys using for the monte carlo search? What do you do? I pick at random a move, then for the resulting board that comes out of that pick another move and so on. Then, after that I evaulate the number of stones etc. Do you mean you count the score? Make sure you actually count the stone right, and (quite important) that you are picking out moves really at random, not skewed in any way. Yes, the score indeed. Well, only the number of stones removed from the board. I did not implement an algorithm to count areas belonging to a color yet. What do you guys look at to select a move using monte carlo? People use heuristics to prefer captures, escaping atari, play local moves that match certain 3x3 patterns etc. - there is plenty of papers covering this, I recommend you look at them. It's fairly important to have all the heuristics somewhat randomized (depending on other parts of my engine, I've found any value between 60% to 90% probability of applying each heuristic optimal). Yeah, I rated most of them equally. Some are weighted in depending on applicability. May I ask: how many board-settings (one move + all evaluation) do your programs calculate per second? Folkert van Heusden -- Ever wonder what is out there? Any alien races? Then please support the s...@home project: setiathome.ssl.berkeley.edu -- Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] monte carlo
On Sun, Oct 25, 2009 at 10:16:58PM +0100, Folkert van Heusden wrote: What method are you guys using for the monte carlo search? What do you do? I pick at random a move, then for the resulting board that comes out of that pick another move and so on. Then, after that I evaulate the number of stones etc. Do you mean you count the score? Make sure you actually count the stone right, and (quite important) that you are picking out moves really at random, not skewed in any way. Yes, the score indeed. Well, only the number of stones removed from the board. I did not implement an algorithm to count areas belonging to a color yet. That would seem to have only quite limited correlation with winning the game. May I ask: how many board-settings (one move + all evaluation) do your programs calculate per second? Depends on the hardware. ;-) We usually count in number of complete playouts per second. On 9x9, libego can play i think about 100k playouts per second on decent CPU, but is very light. Pachi can do 35k light playouts per second and 15k heavy playouts per second on the same CPU (single-threaded). Both Pachi and libego are written in a compiled language; when your speed is in the right order of magnitude, I've found playout move selection improvements much more profitable than further optimizing for speed. -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] monte carlo
Many Faces of go is currently much slower. On 9x9 it does about 18k playouts per second, on two CPU cores, so about 9K per thread. The average 9x9 game is about 92 moves, so there are about 1.6 million board-settings per second. The playouts are far from random. Many moves are local responses to the last move, and when there is no forced local response, the global moves are not fully random either. My original light code did about 55K playouts per second on one CPU core, but was far weaker. David May I ask: how many board-settings (one move + all evaluation) do your programs calculate per second? Depends on the hardware. ;-) We usually count in number of complete playouts per second. On 9x9, libego can play i think about 100k playouts per second on decent CPU, but is very light. Pachi can do 35k light playouts per second and 15k heavy playouts per second on the same CPU (single-threaded). Both Pachi and libego are written in a compiled language; when your speed is in the right order of magnitude, I've found playout move selection improvements much more profitable than further optimizing for speed. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
I admit I had trouble understanding the details of the paper. What I think is the biggest problem for applying this to bigger (up to 19x19) games is that you somehow need access to the true value of a move, i.e. it's a win or a loss. On the 5x5 board they used, this might be approximated pretty well, but there's no chance on 19x19 to do so. Am 13.08.2009 um 05:14 schrieb Michael Williams: After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
In future papers they should avoid using a strong authority like Fuego for the training and instead force it to learn from a naive uniform random playout policy (with 100x or 1000x more playouts) and then build on that with an iterative approach (as was suggested in the paper). I also had another thought. Since they are training the policy to maximize the balance and not the winrate, wouldn't you be able to extract more information from each trial by using the score instead of the game result? The normal pitfalls to doing so do not apply here. Isaac Deutsch wrote: I admit I had trouble understanding the details of the paper. What I think is the biggest problem for applying this to bigger (up to 19x19) games is that you somehow need access to the true value of a move, i.e. it's a win or a loss. On the 5x5 board they used, this might be approximated pretty well, but there's no chance on 19x19 to do so. Am 13.08.2009 um 05:14 schrieb Michael Williams: After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
A web search turned up a 2 page and an 8 page version. I read the short one. I agree that it's promising work that requires some follow- up research. Now that you've read it so many times, what excites you about it? Can you envision a way to scale it to larger patterns and boards on modern hardware? Sent from my iPhone On Aug 12, 2009, at 11:14 PM, Michael Williams michaelwilliam...@gmail.com wrote: After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte- Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave --- - ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Is anyone (besides the authors) doing research based on this? Well, Pebbles does apply reinforcement learning (RL) to improve its playout policy. But not in the manner described in that paper. There are practical obstacles to directly applying that paper. To directly apply that paper, you must have a CrazyStone playout design, wherein you maintain 3x3 neighborhoods around each point. Pebbles has a Mogo playout design, where you check for patterns only around the last move (or two). To directly pursue this would require a rewrite. Right now, there is no published evidence that the Mogo design is inferior. In fact, two of the world's best programs use the Mogo design (Mogo, Fuego). So I am unwilling to make that commitment. I would also have to research how to scale that paper to realistic conditions, including 1) 9x9 boards at a minimum. 2) Self-play, instead of assuming an oracle. 3) Playout after a UCT/RAVE search rather than pure MC. 4) Pattern sets that have ~1 million parameters. 5) Pattern sets that have more general geometry than 3x3, perhaps. My guess is that all of these research problems are solvable. But that's a lot of work to do. If I had to face this task list, I would put it off until later, because there is always an easier way to make progress. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
. Pebbles has a Mogo playout design, where you check for patterns only around the last move (or two). In MoGo, it's not only around the last move (at least with some probability and when there are empty spaces in the board); this is the fill board modification. (this provides a big improvement in 19x19 with big numbers of simulations, see http://www.lri.fr/~rimmel/publi/EK_explo.pdf , Fig 3 page 8 - not only quantitative improvement, but also, according to players, a qualitative improvement in the way mogo plays) Best regards, Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
. Pebbles has a Mogo playout design, where you check for patterns only around the last move (or two). In MoGo, it's not only around the last move (at least with some probability and when there are empty spaces in the board); this is the fill board modification. Just to clarify: I was not saying that Mogo's policy consisted *solely* of looking for patterns around the last move. Merely that it does not look for patterns around *every* point, which other playout policies (e.g., CrazyStone, if I understand Remi's papers correctly) appear to do. The RL paper seems to require that playout design. If you want to prioritize patterns around every point then you need a more sophisticated board representation than Pebbles uses. BTW, FillBoard seems to help Pebbles, too. A few percent better on 9x9 games. No testing on larger boards. YMMV, and like everything about computer go: all predictions are guaranteed to be wrong, or your money back. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Just to clarify: I was not saying that Mogo's policy consisted *solely* of looking for patterns around the last move. Merely that it does not look for patterns around *every* point, which other playout policies (e.g., CrazyStone, if I understand Remi's papers correctly) appear to do. The RL paper seems to require that playout design. Fine! BTW, FillBoard seems to help Pebbles, too. A few percent better on 9x9 games. No testing on larger boards. YMMV, and like everything about computer go: all predictions are guaranteed to be wrong, or your money back. For us the improvement is essential in 19x19 - I'll find that for the generality of fillboard if it helps also for you :-) the loss of diversity due to patterns is really clear in some situations, so the problem solved by fillboard is understandable, so I believe it should work also for you - but, as you say, all predictions in computer-go are almost guaranteed to be wrong :-) Olivier ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
After about the 5th reading, I'm concluding that this is an excellent paper. Is anyone (besides the authors) doing research based on this? There is a lot to do. David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte-Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Has anyone tried this algorithm improvement on bigger boards and can give us a result? Link to original message: http://computer-go.org/pipermail/computer-go/2009-April/018159.html Thanks, ibd So maybe I could get 600 more Elo points with your method. And even more on 19x19. I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. As to whether we can extrapolate, I hope so :-) I share the same feeling that improving the simulation policy will be more impactful on bigger boards with longer simulations. On the other hand I've been surprised by many things in MC Go, so nothing is certain until we try it! -Dave -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte Carlo on GPU
See the April 30 2009 posting: http://www.tobiaspreis.de/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi, We used alpha=0.1. There may well be a better setting of alpha, but this appeared to work nicely in our experiments. -Dave On 3-May-09, at 2:01 AM, elife wrote: Hi Dave, In your experiments what's the constant value alpha you set? Thanks. 2009/5/1 David Silver sil...@cs.ualberta.ca: Yes, in our experiments they were just constant numbers M=N=100. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, If M and N are the same, is there any reason to run M simulations and N simulations separately? What happens if you combine them and calculate V and g in the single loop? I think it gives the wrong answer to do it in a single loop. Note that the simulation outcomes z are used to compute both V and g, so that they are quite strongly correlated. In general, if random variables X and Y are correlated then E[X]E[Y] != E[XY]. A simple example of this problem would be sampling E[X]E[X] for some random variable X. Let's say X is actually just noise with mean zero. If you just take one sample x1, then x1*x1 is always +ve, and you would estimate E[X]E[X]=E[X^2]0. But if you take two independent samples x1 and x2, then x1*x2 would be +ve half the time and -ve half the time, and you would correctly estimate E[X]E[X]=0. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: 2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g) This is strange: you do not take lost playouts into consideration. I believe there is a problem with your estimation of the gradient. Suppose for instance that you count z = +1 for a win, and z = -1 for a loss. Then you would take lost playouts into consideration. This makes me a little suspicious. The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Rémi Coulom wrote: The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. Rémi More precisely: you should estimate the value of N playouts as Sum p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i / Sum p_i. This would be better. Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ? Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Remi, This is strange: you do not take lost playouts into consideration. I believe there is a problem with your estimation of the gradient. Suppose for instance that you count z = +1 for a win, and z = -1 for a loss. Then you would take lost playouts into consideration. This makes me a little suspicious. Sorry, I should have made it clear that this assumes that we are treating black wins as z=1 and white wins as z=0. In this special case, the gradient is the average of games in which black won. But yes, more generally you need to include games won by both sides. The algorithms in the paper still cover this case - I was just trying to simplify their description to make it easy to understand the ideas. The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. The gradient already compensates for the playout policy (equation 9), so in fact it would bias the gradient to sample uniformly at random! In equation 9, the gradient is taken with respect to the playout policy parameters. Using the product rule (third line), the gradient is equal to the playout policy probabilities multiplied by the sum likelihood ratios multiplied by the simulation outcomes z. This gradient can be computed by sampling playouts instead of multiplying by the playout policy probabilities. This is also why games with outcomes of z=0 can be ignored - they don't affect this gradient computation. More precisely: you should estimate the value of N playouts as Sum p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i / Sum p_i. This would be better. Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ? I think a similar point applies here. We care about the expected value of the playout policy, which can be sampled directly from playouts, instead of multiplying by the policy probabilities. You would only need importance sampling if you were using a different playout policy to the one which you are evaluating. But I guess I'm not seeing any good reason to do this? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: Sorry, I should have made it clear that this assumes that we are treating black wins as z=1 and white wins as z=0. In this special case, the gradient is the average of games in which black won. But yes, more generally you need to include games won by both sides. The algorithms in the paper still cover this case - I was just trying to simplify their description to make it easy to understand the ideas. I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. The gradient already compensates for the playout policy (equation 9), so in fact it would bias the gradient to sample uniformly at random! Yes, you are right. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Remi, I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Okay, now I understand your point :-) It's a good question - and I think you're right. In REINFORCE any baseline can be subtracted from the reward, without affecting the expected gradient, but possibly reducing its variance. The baseline leading to the best estimate is indeed the average reward. So it should be the case that {-1,+1} would estimate the gradient g more efficiently than {0,1}, assuming that we see similar numbers of black wins as white wins across the training set. So to answer your question, we can safely modify the algorithm to use (z-b) instead of z, where b is the average reward. This would then make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of step-size). I don't think this would have affected the results we presented (because all of the learning algorithms converged anyway, at least approximately, during training) but it could be an important modification for larger boards. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
I wish I was smart :( David Silver wrote: Hi Remi, I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Okay, now I understand your point :-) It's a good question - and I think you're right. In REINFORCE any baseline can be subtracted from the reward, without affecting the expected gradient, but possibly reducing its variance. The baseline leading to the best estimate is indeed the average reward. So it should be the case that {-1,+1} would estimate the gradient g more efficiently than {0,1}, assuming that we see similar numbers of black wins as white wins across the training set. So to answer your question, we can safely modify the algorithm to use (z-b) instead of z, where b is the average reward. This would then make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of step-size). I don't think this would have affected the results we presented (because all of the learning algorithms converged anyway, at least approximately, during training) but it could be an important modification for larger boards. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, Thanks for the detailed explanation. M, N and alpha are constant numbers, right? What did you set them to? You're welcome! Yes, in our experiments they were just constant numbers M=N=100. The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. OK, I guess it is the formula 5 in the paper. Yes, exactly. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). I still don't understand this. Is it the formula 6? Could you please give me an example like the above? Yes that's right, this is equation 6. Okay, let's continue the example above. Let's say that in position s, using the current theta, moves a, b and c will be selected with probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was actually selected. Now consider pattern 1, this is matched after a. But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- (0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a) = [0.2,0.3,-0.3,-0.2]. In other words, psi tells us whether each pattern was actually matched more or less than we could have expected. Hope this helps. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
IMO other people's equations/code/ideas/papers always seem smarter than your own. The stuff you understand and do yourself just seems like common sense, and the stuff you don't always has a mystical air of complexity, at least until you understand it too :-) On 30-Apr-09, at 1:59 PM, Michael Williams wrote: I wish I was smart :( ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: Yes, in our experiments they were just constant numbers M=N=100. If M and N are the same, is there any reason to run M simulations and N simulations separately? What happens if you combine them and calculate V and g in the single loop? Okay, let's continue the example above. Let's say that in position s, using the current theta, moves a, b and c will be selected with probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was actually selected. Now consider pattern 1, this is matched after a. But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- (0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a) = [0.2,0.3,-0.3,-0.2]. In other words, psi tells us whether each pattern was actually matched more or less than we could have expected. I understood what psi was. I am not sure how it works, but anyway I can see your algorithm now. Thanks. -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, Could you give us the source code which you used? Your algorithm is too complicated, so it would be very helpful if possible. Actually I think the source code would be much harder to understand! It is written inside RLGO, and makes use of a substantial existing framework that would take a lot of effort to understand. (On a separate note I am considering making RLGO open source at some point, but I'd prefer to spend some effort cleaning it up before making it public). But I think maybe Algorithm 1 is much easier than you think: A: Estimate value V* of every position in a training set, using deep rollouts. B: Repeat, for each position in the training set 1. Run M simulations, estimate value of position (call this V) 2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g) 3. Adjust parameters by alpha * (V* - V) * g The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). It's also very important to be careful about signs and the colour to play - it's easy to make a mistake and follow the gradient in the wrong direction. Is that any clearer? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: Hi Michael, But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position. But isn't the actual value of the position either a win or a loss? On such small boards, can't you assume that Fuego is able to correctly determin who is winning and round it's evaluation to the nearest win/loss? i.e. if it evaluates the position to 0.674, that gets rounded to 1. If such an assumption about Fuego's ability to read the position on a small board is valid, then it should improve the results of your balanced simulation strategy, right? Or am I missing something? It's true that 5x5 Go is solved, so in principle we could have used the true minimax values. However we chose to use an approach that can scale to larger boards, which means that we should treat the expert evaluations as approximate. And in fact Fuego was not always accurate on 6x6 boards, as we used only 10k simulations in our training set. Also, I think it really helps to have soft rather than hard expert evaluations. We want a simulation policy that helps differentiate e.g. a 90% winning position from an 85% winning position. Rounding all the expert evaluations to 0 or 1 would lose much of this advantage. -Dave By this argument (your last paragraph), you need to do some magical number of simulations for the training data. Not enough simulations and you have too much noise. And infinite simulations gives you hard 0 or 1 results. But I can't argue with your results. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: [computer go] Monte-Carlo Simulation Balancing
Hi Remi, What komi did you use for 5x5 and 6x6 ? I used 7.5 komi for both board sizes. I find it strange that you get only 70 Elo points from supervised learning over uniform random. Don't you have any feature for atari extension ? This one alone should improve strength immensely (extend string in atari because of the previous move). Actually no. The features are very simple, and know how to capture but not how to defend ataris. I'm sure that a better set of features could improve by more than 70 Elo, but I expect we would still see a benefit to balancing the weights correctly. For example, the Fuego policy defends ataris and follows several other common-sense rules, but the results in 5x5 and 6x6 show that it is not well balanced on small boards. Let us extrapolate: I got more than 200 Elo points of improvements from my patterns in 9x9 over uniform random (I never really measured that, it may be even more than 200). I guess you got more than 200 Elo on 9x9, in Mogo (Gelly et al. 2006) the improvement from uniform random was at least 400 Elo and I think your simulation policy is probably at least as strong. By the way I was sad to hear you're not working on Crazystone any more. Is it because you are you busy with other projects? So maybe I could get 600 more Elo points with your method. And even more on 19x19. I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. As to whether we can extrapolate, I hope so :-) I share the same feeling that improving the simulation policy will be more impactful on bigger boards with longer simulations. On the other hand I've been surprised by many things in MC Go, so nothing is certain until we try it! -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: A: Estimate value V* of every position in a training set, using deep rollouts. B: Repeat, for each position in the training set 1. Run M simulations, estimate value of position (call this V) 2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g) 3. Adjust parameters by alpha * (V* - V) * g Thanks for the detailed explanation. M, N and alpha are constant numbers, right? What did you set them to? The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. OK, I guess it is the formula 5 in the paper. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). I still don't understand this. Is it the formula 6? Could you please give me an example like the above? -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, I like you idea, but why do you use only 5x5 and 6x6 Go? 1. Our second algorithm, two-ply simulation balancing, requires a training set of two-ply rollouts. Rolling out every position from a complete two-ply search is very expensive on larger board sizes, so we would probably have to consider some subset of leaf positions. We wanted to analyse the full algorithm first, before we started making approximations. 2. We can generate a lot more data on small boards, to give high confidence on the results we report. 3. IMO it's important to do the science to understand underlying principles first, and then scale up to bigger boards, more complex Monte-Carlo searches, etc. I don't think the 200+ Elo improvement is so impressive I agree that it would be much more impressive to report positive results on larger boards. But perhaps it is already interesting that tuning the balance of the simulation policy can make a big difference on small boards? Also, larger boards mean longer simulations, and so the importance of simulation balancing should become even more exaggerated. because the previous approaches were not optimized for such a small boards. I'm not sure what you mean here? The supervised learning and reinforcement learning approaches that we compared against are both trained on the small boards, i.e. the pattern weights are specifically optimised for that size of board. I agree that the handcrafted policy from Fuego was not optimised for small boards, which is why it performed poorly. But perhaps this is also interesting, i.e. it suggests that a handcrafted policy for 9x9 Go may be significantly suboptimal when used in 19x19 Go. So automatically learning a simulation policy may ultimately prove to be very beneficial. I'm looking forward to your results on larger boards. Me too :-) Coming soon, will let you know how it goes. But I hope that others will try these ideas too, it's always much better to compare multiple implementations of the same algorithm. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: I don't think the 200+ Elo improvement is so impressive I agree that it would be much more impressive to report positive results on larger boards. But perhaps it is already interesting that tuning the balance of the simulation policy can make a big difference on small boards? Also, larger boards mean longer simulations, and so the importance of simulation balancing should become even more exaggerated. What komi did you use for 5x5 and 6x6 ? I find it strange that you get only 70 Elo points from supervised learning over uniform random. Don't you have any feature for atari extension ? This one alone should improve strength immensely (extend string in atari because of the previous move). Let us extrapolate: I got more than 200 Elo points of improvements from my patterns in 9x9 over uniform random (I never really measured that, it may be even more than 200). So maybe I could get 600 more Elo points with your method. And even more on 19x19. I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi I think rating differences are emplified on larger boards. This is easy to see if you think about it this way : Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I would call 4-Go where instead of playing one game, you play simultenously 4 games and determine the winner by calculating the sum of the scores of the four games. Certainly rating differences would be bigger with 4-go than with go (given the same two players). This explains why rating differences are bigger on 19x19 than 9x9. Ivan ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
A simplistic model that helps explain this is golf. On a single hole, even a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger occasionally shoots a 1 over par on some hole and even weak amateurs occasionally par or even birdie a hole.It's not going to happen a lot, but it's not ridiculous either. Years ago I taught a player how to golf, and on his third time out with me, he hit a hole in one on a short par 3. If Tiger Woods had been playing with us, he would have lost that hole to this beginner. But in a 9 hole match, the odds go down enormously - for all practical purposes there is no chance. I kind of think of GO like that, even though it's a pretty simplistic model. Each move is like a hole of golf, it can be a good shot or a bad one. With GO, however, probably a LOT of your moves are just as good as the moves of a good player. But it's the ones that fall short, that kill you. Go on a big board is like 18 holes of golf compared to just 1 or 2 holes of golf. The better player is far more likely to win the 18 hole match than the 1 hole match. - Don On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois dubois_i...@yahoo.fr wrote: I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi I think rating differences are emplified on larger boards. This is easy to see if you think about it this way : Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I would call 4-Go where instead of playing one game, you play simultenously 4 games and determine the winner by calculating the sum of the scores of the four games. Certainly rating differences would be bigger with 4-go than with go (given the same two players). This explains why rating differences are bigger on 19x19 than 9x9. Ivan ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
also, i'm not sure that a lot of most amateurs' moves are very good. the spectrum of bad moves is wide, it's just that it takes someone many stones stronger to severely punish small differences between good and nearly-good moves. among players of relatively similar strength, these differences will go unnoticed and unpunished. s. 2009/4/28 Don Dailey dailey@gmail.com: A simplistic model that helps explain this is golf. On a single hole, even a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger occasionally shoots a 1 over par on some hole and even weak amateurs occasionally par or even birdie a hole. It's not going to happen a lot, but it's not ridiculous either. Years ago I taught a player how to golf, and on his third time out with me, he hit a hole in one on a short par 3. If Tiger Woods had been playing with us, he would have lost that hole to this beginner. But in a 9 hole match, the odds go down enormously - for all practical purposes there is no chance. I kind of think of GO like that, even though it's a pretty simplistic model. Each move is like a hole of golf, it can be a good shot or a bad one. With GO, however, probably a LOT of your moves are just as good as the moves of a good player. But it's the ones that fall short, that kill you. Go on a big board is like 18 holes of golf compared to just 1 or 2 holes of golf. The better player is far more likely to win the 18 hole match than the 1 hole match. - Don On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois dubois_i...@yahoo.fr wrote: I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi I think rating differences are emplified on larger boards. This is easy to see if you think about it this way : Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I would call 4-Go where instead of playing one game, you play simultenously 4 games and determine the winner by calculating the sum of the scores of the four games. Certainly rating differences would be bigger with 4-go than with go (given the same two players). This explains why rating differences are bigger on 19x19 than 9x9. Ivan ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
A fair number of amateur moves are exactly what a pro would use; this is no great surprise, since the moves are actually copied from professional games. The weaker player can blindly emulate, but does not know why a pro plays A instead of B or C; does not know how to punish slack moves. I've been taking some classes from an 8d pro. Sadly, I'm at the bottom of the class. The top players in these classes are 8d amateurs; the pro can give them 3 stones and win. When going over their games, many of their moves are quite respectable, but the pro is able to find mistakes which can be exploited with devastating effect. You can't afford to be a little bit off against a pro, but weaker players find it hard to exploit slack moves. When explaining his moves, the pro often uses very deep reading. We've all seen simple cases, where move A is advisable if a ladder works, otherwise B is recommended. The pro knows six moves in advance that he'll need to read that ladder, before the ladder is actually on the board. These are the simplest expositions I've seen; things get more complex from there. Terry McIntyre terrymcint...@yahoo.com Government is an association of men who do violence to the rest of us. - Leo Tolstoy From: steve uurtamo uurt...@gmail.com To: computer-go computer-go@computer-go.org Sent: Tuesday, April 28, 2009 5:09:20 PM Subject: Re: [computer-go] Monte-Carlo Simulation Balancing also, i'm not sure that a lot of most amateurs' moves are very good. the spectrum of bad moves is wide, it's just that it takes someone many stones stronger to severely punish small differences between good and nearly-good moves. among players of relatively similar strength, these differences will go unnoticed and unpunished. s. 2009/4/28 Don Dailey dailey@gmail.com: A simplistic model that helps explain this is golf. On a single hole, even a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger occasionally shoots a 1 over par on some hole and even weak amateurs occasionally par or even birdie a hole.It's not going to happen a lot, but it's not ridiculous either. Years ago I taught a player how to golf, and on his third time out with me, he hit a hole in one on a short par 3. If Tiger Woods had been playing with us, he would have lost that hole to this beginner. But in a 9 hole match, the odds go down enormously - for all practical purposes there is no chance. I kind of think of GO like that, even though it's a pretty simplistic model. Each move is like a hole of golf, it can be a good shot or a bad one. With GO, however, probably a LOT of your moves are just as good as the moves of a good player. But it's the ones that fall short, that kill you. Go on a big board is like 18 holes of golf compared to just 1 or 2 holes of golf. The better player is far more likely to win the 18 hole match than the 1 hole match. - Don On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois dubois_i...@yahoo.fr wrote: I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi I think rating differences are emplified on larger boards. This is easy to see if you think about it this way : Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I would call 4-Go where instead of playing one game, you play simultenously 4 games and determine the winner by calculating the sum of the scores of the four games. Certainly rating differences would be bigger with 4-go than with go (given the same two players). This explains why rating differences are bigger on 19x19 than 9x9. Ivan ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: because the previous approaches were not optimized for such a small boards. I'm not sure what you mean here? The supervised learning and reinforcement learning approaches that we compared against are both trained on the small boards, i.e. the pattern weights are specifically optimised for that size of board. Ah, sorry. I missed it. I agree that the handcrafted policy from Fuego was not optimised for small boards, which is why it performed poorly. But perhaps this is also interesting, i.e. it suggests that a handcrafted policy for 9x9 Go may be significantly suboptimal when used in 19x19 Go. So automatically learning a simulation policy may ultimately prove to be very beneficial. It is surprising Fuego was far worse than uniform random on 5x5 and 6x6. These results show the particularity of the very small boards. I suppose 5x5 is very different from 9x9 but 19x19 is not so from 9x9. I'm looking forward to your results on larger boards. Me too :-) Coming soon, will let you know how it goes. But I hope that others will try these ideas too, it's always much better to compare multiple implementations of the same algorithm. Could you give us the source code which you used? Your algorithm is too complicated, so it would be very helpful if possible. -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
But I'm only trying to make a point, not pin the analogy down perfectly. Naturally the stronger the player, the more likely his moves will conform to the level of the top players. The basic principle is that the longer the contest, the more opportunities a strong player has to demonstrate his superiority over an inferior player and/or recover from mistakes or surprises.In my golf analogy, my friend hits a hole in one. Let's say it was on the first hole of 9 against Tiger Woods. Tiger likely would have hit it in 2 strokes, and my beginner friend would be ahead by 1 stroke!But that 1 stroke lead would almost certainly disappear by the time the next hole was completed, as my friend was typically taking 7 or 8 shots on that par 3 course! Another way to look at this, is that a random legal move generator has a chance to play a perfect game, although that chance is almost infinitesimal. But even that small chance can be an order of magnitude larger if you can shorted the contest even by 1 or 2 moves. - Don On Tue, Apr 28, 2009 at 8:09 PM, steve uurtamo uurt...@gmail.com wrote: also, i'm not sure that a lot of most amateurs' moves are very good. the spectrum of bad moves is wide, it's just that it takes someone many stones stronger to severely punish small differences between good and nearly-good moves. among players of relatively similar strength, these differences will go unnoticed and unpunished. s. 2009/4/28 Don Dailey dailey@gmail.com: A simplistic model that helps explain this is golf. On a single hole, even a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger occasionally shoots a 1 over par on some hole and even weak amateurs occasionally par or even birdie a hole.It's not going to happen a lot, but it's not ridiculous either. Years ago I taught a player how to golf, and on his third time out with me, he hit a hole in one on a short par 3. If Tiger Woods had been playing with us, he would have lost that hole to this beginner. But in a 9 hole match, the odds go down enormously - for all practical purposes there is no chance. I kind of think of GO like that, even though it's a pretty simplistic model. Each move is like a hole of golf, it can be a good shot or a bad one. With GO, however, probably a LOT of your moves are just as good as the moves of a good player. But it's the ones that fall short, that kill you. Go on a big board is like 18 holes of golf compared to just 1 or 2 holes of golf. The better player is far more likely to win the 18 hole match than the 1 hole match. - Don On Tue, Apr 28, 2009 at 1:53 PM, Ivan Dubois dubois_i...@yahoo.fr wrote: I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. Rémi I think rating differences are emplified on larger boards. This is easy to see if you think about it this way : Somehow a 19x19 board is like 4 9x9 boards. Let us define a new game that I would call 4-Go where instead of playing one game, you play simultenously 4 games and determine the winner by calculating the sum of the scores of the four games. Certainly rating differences would be bigger with 4-go than with go (given the same two players). This explains why rating differences are bigger on 19x19 than 9x9. Ivan ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Finally! I guess you can add this technique to your list, Lukasz. David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte-Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Very interesting, thanks. If I understand correctly, your method makes your program 250 Elo points stronger than my pattern-learning algorithm on 5x5 and 6x6, by just learning better weights. That is impressive. I had stopped working on my program for a very long time, but maybe I will go back to work and try your algorithm. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Hi Remi, If I understand correctly, your method makes your program 250 Elo points stronger than my pattern-learning algorithm on 5x5 and 6x6, by just learning better weights. Yes, although this is just in a very simple MC setting. Also we did not compare directly to the algorithm you used (minorisation/maximisation), but to a different algorithm for maximising the likelihood of an expert policy. But as the softmax policy is a log transformation of the generalised Bradley-Terry model, I think this comparison is still valid. I had stopped working on my program for a very long time, but maybe I will go back to work and try your algorithm. I would be really interested to know how this works out! Please let me know. Best wishes -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
My favorite part: One natural idea is to use the learned simulation policy in Monte-Carlo search, and generate new deep search values, in an iterative cycle. But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position. But isn't the actual value of the position either a win or a loss? On such small boards, can't you assume that Fuego is able to correctly determin who is winning and round it's evaluation to the nearest win/loss? i.e. if it evaluates the position to 0.674, that gets rounded to 1. If such an assumption about Fuego's ability to read the position on a small board is valid, then it should improve the results of your balanced simulation strategy, right? Or am I missing something? David Silver wrote: Hi everyone, Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. Abstract In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5x5 and 6x6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte-Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and equal overall performance to a sophisticated Go engine. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Hi Michael, But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position. But isn't the actual value of the position either a win or a loss? On such small boards, can't you assume that Fuego is able to correctly determin who is winning and round it's evaluation to the nearest win/loss? i.e. if it evaluates the position to 0.674, that gets rounded to 1. If such an assumption about Fuego's ability to read the position on a small board is valid, then it should improve the results of your balanced simulation strategy, right? Or am I missing something? It's true that 5x5 Go is solved, so in principle we could have used the true minimax values. However we chose to use an approach that can scale to larger boards, which means that we should treat the expert evaluations as approximate. And in fact Fuego was not always accurate on 6x6 boards, as we used only 10k simulations in our training set. Also, I think it really helps to have soft rather than hard expert evaluations. We want a simulation policy that helps differentiate e.g. a 90% winning position from an 85% winning position. Rounding all the expert evaluations to 0 or 1 would lose much of this advantage. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: Please find attached my ICML paper with Gerry Tesauro on automatically learning a simulation policy for Monte-Carlo Go. Our preliminary results show a 200+ Elo improvement over previous approaches, although our experiments were restricted to simple Monte-Carlo search with no tree on small boards. I like you idea, but why do you use only 5x5 and 6x6 Go? I don't think the 200+ Elo improvement is so impressive because the previous approaches were not optimized for such a small boards. I'm looking forward to your results on larger boards. -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
Finally got a window of time. It's called the epsilon trick. Here is a cut and paste of Lukasz' explanation from the archives, including how to calculate N. - Dave Hillis -- -Original Message- From: [EMAIL PROTECTED] To: computer-go@computer-go.org Sent: Sun, 14 Jan 2007 4:50 AM Subject: [computer-go] Fast UCT (epsilon trick) Hi I've looked back into the article and I see that it is not trivial to understand how does the epsilon trick (ET) works. Therefore I will describe it here on the example of UCT. The bottleneck of UCT algorithm is choosing a move. Program has to iterate over all children evaluating their UCB value. We want to use of this loop rarely. One way to do it is to group playouts together. I.e. play 2 or more MC simulations per each tree traversal. But this is not good enough as it doesn't distribute the playouts over the tree and some parts will not get enough of them. The ET works as follows: With each node of the tree program keeps additional information: - child_to_visit - visits_to_go When we visit node, if node.visits_to_go 0 then node-visits_to_go - node.visits_to_go - 1 node = child_to_visit So we have predetermined strategy to go to particular child visits_to_go times. But if node.visis_to_go = 0 then // we have to set the strategy first node-child_to_visit = UCT_find_child (node) node-visits_to_go = Epsilon * node-visited_count // round up As we can see, when Epsilon is small or when a particular node is visited not too many times, UCT-ET is identical to UCT. But when node was visited more times (near the root, far from leaves), then UCT-ET may save some time and go to precalculated child. That's it. I hope You like it :) Łukasz Lew PS. In Df-PN, good Epsilon was 1/4 I believe that here it will be much smaller. 1/64 maybe. Or even one may change the Epsilon equation to: node-visits_to_go = Epsilon * sqrt(node-visited_count) If someone will do some experiments, I'm eager to hear the results. PS2 float InvSqrt (float x) { float xhalf = 0.5f * x; int i = *(int*)x; i = 0x5f3759df - (i1); x = *(float*)i; x = x * (1.5f - xhalf * x * x); return x; } ;-) ___ -- -Original Message- From: [EMAIL PROTECTED] To: computer-go@computer-go.org Sent: Wed, 3 Dec 2008 12:56 am Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot Hmm.. it could be that N is picked randomly each time... now I can't seem to find the description and my memory is not serving me well here. Perhaps someone else remembers? I'll try to track it down. - Dave Hillis -Original Message- From: Michael Williams [EMAIL PROTECTED] To: computer-go computer-go@computer-go.org Sent: Wed, 3 Dec 2008 12:14 am Subject: Re: [computer-go] Monte-Carlo Tree Search reference b ot That's going to repeat the same exact path through the tree three times, isn't it? If so, it seems like it would be more efficient to do N playouts from the leaf after each traversal of the tree. [EMAIL PROTECTED] wrote: There is another speedup trick that might interest you. IIRC Lukasz Lew came up with it, but I forget what he called it. After calculating the selected move for an internal node (going through UCT and RAVE and whatnot) store that move. Then, for the next N visits to that node (where N is 5 or 10 or so), just repeat that move without having to calculate what the move would be. - Dave Hillis -Original Message- From: Mark Boon [EMAIL PROTECTED] To: computer-go computer-go@computer-go.org Sent: Tue, 2 Dec 2008 11:17 pm Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot I have made some minor performance improvements and this is as far as I intend to take this particular project. I might make some small changes if necessary, but most likely I'll leave this largely unchanged from now. I had set myself as an arbitrary goal that it should do at least 20K playouts. But with real liberties, AMAF and a RAVE formula I got stuck in the 16K-17K range. According to my profiler that is mostly due to the expensive formula used to compare nodes, where it says it spends 25% of total time. The formula I'm using is: beta * (virtu al-win-ratio + RAVE) + (1-beta) * (win-ratio + UCT) beta = 1 - log(parent-visits) / 20 UCT = exploration-factor *sqrt( log(parent-visits) / (nr-visits+1) ) RAVE = exploration-factor *sqrt( log(parent-visits) / (nr-virtual-visits+1) ) There are probably quite a few possibilities still to tune this program with regards to playing streng th and performance. But I felt it doesn't help to obscure the implementation by too many details. The implementation of the search algorithm was entirely game-independent, until I introduced AMAF. I didn't see how to fix that, as there's no way getting around
Re: [computer-go] Monte-Carlo Tree Search reference bot
What I did (was a long time ago, I don't know if it is still used in Mogo), is to compute the m best moves every so often and most of the time just do the max over those m moves. m was on the order of 5, and every so often was an increasing function like sqrt or log (I don't remember). That speeds up quite a lot (when the max becomes the bottleneck), and if you tune the parameters well you almost don't loose any strength at a fixed number of playouts. Sylvain 2008/12/3 [EMAIL PROTECTED]: There is another speedup trick that might interest you. IIRC Lukasz Lew came up with it, but I forget what he called it. After calculating the selected move for an internal node (going through UCT and RAVE and whatnot) store that move. Then, for the next N visits to that node (where N is 5 or 10 or so), just repeat that move without having to calculate what the move would be. - Dave Hillis -Original Message- From: Mark Boon [EMAIL PROTECTED] To: computer-go computer-go@computer-go.org Sent: Tue, 2 Dec 2008 11:17 pm Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot I have made some minor performance improvements and this is as far as I intend to take this particular project. I might make some small changes if necessary, but most likely I'll leave this largely unchanged from now. I had set myself as an arbitrary goal that it should do at least 20K playouts. But with real liberties, AMAF and a RAVE formula I got stuck in the 16K-17K range. According to my profiler that is mostly due to the expensive formula used to compare nodes, where it says it spends 25% of total time. The formula I'm using is: beta * (virtual-win-ratio + RAVE) + (1-beta) * (win-ratio + UCT) beta = 1 - log(parent-visits) / 20 UCT = exploration-factor *sqrt( log(parent-visits) / (nr-visits+1) ) RAVE = exploration-factor *sqrt( log(parent-visits) / (nr-virtual-visits+1) ) There are probably quite a few possibilities still to tune this program with regards to playing strength and performance. But I felt it doesn't help to obscure the implementation by too many details. The implementation of the search algorithm was entirely game-independent, until I introduced AMAF. I didn't see how to fix that, as there's no way getting around that it's based on the fact that a move is represented by a single coordinate, which is mapped to an array to store the statistical values. But strip the AMAF part, which is a block of 30 odd lines of code, and I think it can be used for other games basically as-is. I did this not because I ever see myself program another game, but because in my experience in doing so I get a cleaner separation between modules. At 2,000 playouts, it's still quite a bit weaker than the plain MC-AMAF refbot. It wins only about 33%. But that's probably because the 1,000-2,000 playouts range is the sweet-spot for that particular type of playing engine. It doesn't scale from there, whereas the MCTS ref-bot only just gets warmed up with 2,000 playouts. This leads me to a question. I suppose it might be of some interest to put this bot up on CGOS. But what parameters to use? The main one being the number of playouts, naturally. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ Tis the season to save your money! Get the new AOL Holiday Toolbar for money saving offers and gift ideas. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
On Wed, Dec 3, 2008 at 05:17, Mark Boon [EMAIL PROTECTED] wrote: I had set myself as an arbitrary goal that it should do at least 20K playouts. But with real liberties, AMAF and a RAVE formula I got stuck in the 16K-17K range. According to my profiler that is mostly due to the expensive formula used to compare nodes, where it says it spends 25% of total time. The formula I'm using is: beta * (virtual-win-ratio + RAVE) + (1-beta) * (win-ratio + UCT) beta = 1 - log(parent-visits) / 20 UCT = exploration-factor *sqrt( log(parent-visits) / (nr-visits+1) ) RAVE = exploration-factor *sqrt( log(parent-visits) / (nr-virtual-visits+1) ) Hi, This is probably something you already do, but just in case you don't, here comes a simple optimization trick: since parents-visits is an integer value and in a somewhat bounded range (0..max_playouts), the logarithm can be precomputed in a table. Then even sqrt(log()) can be precomputed also. Similarly, a table with sqrt(integer()) can be used for the other values. best regards, Vlad ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
On 3-dec-08, at 06:09, Sylvain Gelly wrote: What I did (was a long time ago, I don't know if it is still used in Mogo), is to compute the m best moves every so often and most of the time just do the max over those m moves. m was on the order of 5, and every so often was an increasing function like sqrt or log (I don't remember). That speeds up quite a lot (when the max becomes the bottleneck), and if you tune the parameters well you almost don't loose any strength at a fixed number of playouts. I thought about that, but I was afraid the code would become too obscure. After all, this is supposed to be a reference implementation. But maybe I should actually give it a try to see what it would look like. I also played with precomputing a table of the log() function but found little speedup. Instead I decided to (p)recompute the log(nr- parent-visits) only once in 8 times. According to my profiler that reduced the log() from using 8% to 1% of computation time. Another thing I did was replace log(nr-virtual-parent-visits) in the RAVE calculation by the same log(nr-parent-visits) used in the UCT calculation, cutting both the number of times I need to call log in half and reducing the amount of storage in the node a little. After these modifications I didn't notice any degradation in play. In terms of memory use, I did a few obvious things to keep storage of the nodes to a minimum. I had seen people write about memory usage of the tree, but never understood the concerns. But that was because I always expanded the tree one node at a time. Of course, if you create all the children in one go like this reference implementation does, it's a little bit a different story. But still, I have to push my computer hard to run out of space before I run out of time so I didn't see much need to reduce memory too aggressively. Using just a moderately small number of playouts before expansion is already enough to never have memory problems. But if memory is a concern to anyone I see two obvious ways to make make significant reductions. One is to flatten the data-structure, reducing the object overhead Java imposes. That could save up to a third. The other is to use smaller placeholders for nodes that have been created, but not yet visited. Once a node gets visited it can be replaced by a full-blown node, but until then all you need is the move-coordinate and the AMAF statistics. That should save up to 75% or so. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
On Wed, Dec 03, 2008 at 09:55:07AM -0200, Mark Boon wrote: I thought about that, but I was afraid the code would become too obscure. After all, this is supposed to be a reference implementation. But maybe I should actually give it a try to see what it would look like. I agree that the reference implementation should be maximally clear code. Performance is not all that important here, correctness is. Having said that, I can't help responding to one detail: I had seen people write about memory usage of the tree, but never understood the concerns. One thing to remember is that more memory use means more cache misses, and more access to the main memory. On modern computers, those can cost as much as executing a thousand instructions! So memory optimizing can often pay off, also with respect to time! Of course, Java does a lot of memory management behind the scenes, so optimizing such can be harder... But when writing in C or C++, it does make sense. -H -- Heikki Levanto In Murphy We Turst heikki (at) lsd (dot) dk ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
On 3-dec-08, at 10:31, Heikki Levanto wrote: Having said that, I can't help responding to one detail: I had seen people write about memory usage of the tree, but never understood the concerns. One thing to remember is that more memory use means more cache misses, and more access to the main memory. On modern computers, those can cost as much as executing a thousand instructions! So memory optimizing can often pay off, also with respect to time! Of course, Java does a lot of memory management behind the scenes, so optimizing such can be harder... But when writing in C or C++, it does make sense. I've seen this stated often. But color me skeptical. If what you say is true, then the MC-AMAF bot, which hardly uses any memory at all and which fits in the secondary cache entirely, code and data combined, should show an enormous speed-up in terms of number of playouts per second. Instead it does only about twice the number of playouts per second as the tree-search version, which can be almost if not entirely explained by the extra work that needs to be done to traverse and maintain the tree. If this is because I use Java, then Don's concise C implementation of the MC-AMAF bot should be a lot faster than my bloated Java version. Which, ahem, is not the case at all. I think a well-crafted C program can be up to twice as fast as a similar Java program. But I doubt that has much to do with cache misses. I think in computer-Go, trying to control cache misses is futile no matter the language. Maybe you can do it for a relatively trivial, well-understood and stable sub-part. But then you risk losing the gain again as soon as it gets incorporated into a larger whole. So your efforts go to waste. So this is in the show me category for me. You can use C or C++ if you like, but show me a tree-search bot that speeds up the number of playouts significantly by reducing the node-size. Maybe it's possible, but I don't believe it until I see it. This could be different for Chess programs, at least I've seen it claimed often. But I don't envy them, it must be excruciatingly frustrating to deal with. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
On Wed, Dec 03, 2008 at 11:24:22AM -0200, Mark Boon wrote: Heikki wrote: One thing to remember is that more memory use means more cache misses, and more access to the main memory. On modern computers, those can cost as much as executing a thousand instructions! So memory optimizing can often pay off, also with respect to time! I've seen this stated often. But color me skeptical. Maybe I was quoting common wisdom without having checked it personally. I only remember one case - not related to go at all - where optimizing the memory stuff made a noticeable difference. I would also like to see hard evidence. But not badly enough to put in the necessary work to get some ;-) - H -- Heikki Levanto In Murphy We Turst heikki (at) lsd (dot) dk ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
I had a chess program years ago that was blindingly fast on some computers, very slow on others. It was all about the cache. The move generator was hard coded for each piece on each square. For instance a white pawn on d7 had it's very own move specialized move generator. There was a function for each piece of each color on each square. By doing this I avoided the majority of conditional tests. I did not have to test for board edge, pawn on 7th, pawn on the second rank, etc. All the logic was unrolled and I basically had to write a program to write the move generator. The machine I developed this program on had a lot of cache and it was not a problem. But when running on, for instance a laptop, it would slow down enormously - something like 4 to 1 where other cache friendly programs would slow down perhaps 25%. I don't keep up much with processor technologies now and I don't know how much data vs instruction cache modern setups have, but I know it is far better than it used to be. I bet that old program would run well on modern computers. One trick, which you are probably doing, is to keep data structures together in memory. For instance it is better to have children next to each other so that you are not accessing memory all over the place, perhaps incurring several cache misses instead of one. On Wed, 2008-12-03 at 11:24 -0200, Mark Boon wrote: On 3-dec-08, at 10:31, Heikki Levanto wrote: Having said that, I can't help responding to one detail: I had seen people write about memory usage of the tree, but never understood the concerns. One thing to remember is that more memory use means more cache misses, and more access to the main memory. On modern computers, those can cost as much as executing a thousand instructions! So memory optimizing can often pay off, also with respect to time! Of course, Java does a lot of memory management behind the scenes, so optimizing such can be harder... But when writing in C or C++, it does make sense. I've seen this stated often. But color me skeptical. If what you say is true, then the MC-AMAF bot, which hardly uses any memory at all and which fits in the secondary cache entirely, code and data combined, should show an enormous speed-up in terms of number of playouts per second. Instead it does only about twice the number of playouts per second as the tree-search version, which can be almost if not entirely explained by the extra work that needs to be done to traverse and maintain the tree. If this is because I use Java, then Don's concise C implementation of the MC-AMAF bot should be a lot faster than my bloated Java version. Which, ahem, is not the case at all. I think a well-crafted C program can be up to twice as fast as a similar Java program. But I doubt that has much to do with cache misses. I think in computer-Go, trying to control cache misses is futile no matter the language. Maybe you can do it for a relatively trivial, well-understood and stable sub-part. But then you risk losing the gain again as soon as it gets incorporated into a larger whole. So your efforts go to waste. So this is in the show me category for me. You can use C or C++ if you like, but show me a tree-search bot that speeds up the number of playouts significantly by reducing the node-size. Maybe it's possible, but I don't believe it until I see it. This could be different for Chess programs, at least I've seen it claimed often. But I don't envy them, it must be excruciatingly frustrating to deal with. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ signature.asc Description: This is a digitally signed message part ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
On Wed, 2008-12-03 at 11:24 -0200, Mark Boon wrote: If this is because I use Java, then Don's concise C implementation of the MC-AMAF bot should be a lot faster than my bloated Java version. I don't remember the numbers, but my own java and C implementation were written in the same style - same basic data structure and just about the only difference was language. I did do some kind of list trick using pointers than you cannot do in java, at least not easily and efficiently but I don't think it made that much of a difference. The Java version was not that bad for speed compared to C, I think C was something like 2/3 of the speed or close to that.But if you got into heavy duty data structures Java hurts - you won't get as big a tree in Java for instance and I'm sure caching will be a bigger issue. - Don signature.asc Description: This is a digitally signed message part ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
I have made some minor performance improvements and this is as far as I intend to take this particular project. I might make some small changes if necessary, but most likely I'll leave this largely unchanged from now. I had set myself as an arbitrary goal that it should do at least 20K playouts. But with real liberties, AMAF and a RAVE formula I got stuck in the 16K-17K range. According to my profiler that is mostly due to the expensive formula used to compare nodes, where it says it spends 25% of total time. The formula I'm using is: beta * (virtual-win-ratio + RAVE) + (1-beta) * (win-ratio + UCT) beta = 1 - log(parent-visits) / 20 UCT = exploration-factor *sqrt( log(parent-visits) / (nr-visits+1) ) RAVE = exploration-factor *sqrt( log(parent-visits) / (nr-virtual- visits+1) ) There are probably quite a few possibilities still to tune this program with regards to playing strength and performance. But I felt it doesn't help to obscure the implementation by too many details. The implementation of the search algorithm was entirely game- independent, until I introduced AMAF. I didn't see how to fix that, as there's no way getting around that it's based on the fact that a move is represented by a single coordinate, which is mapped to an array to store the statistical values. But strip the AMAF part, which is a block of 30 odd lines of code, and I think it can be used for other games basically as-is. I did this not because I ever see myself program another game, but because in my experience in doing so I get a cleaner separation between modules. At 2,000 playouts, it's still quite a bit weaker than the plain MC- AMAF refbot. It wins only about 33%. But that's probably because the 1,000-2,000 playouts range is the sweet-spot for that particular type of playing engine. It doesn't scale from there, whereas the MCTS ref- bot only just gets warmed up with 2,000 playouts. This leads me to a question. I suppose it might be of some interest to put this bot up on CGOS. But what parameters to use? The main one being the number of playouts, naturally. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
There is another speedup trick that might interest you. IIRC Lukasz Lew came up with it, but I forget what he called it. After calculating the selected move for an internal node (going through UCT and RAVE and whatnot) store that move. Then, for the next N visits to that node (where N is 5 or 10 or so), just repeat that move without having to calculate what the move would be. - Dave Hillis -Original Message- From: Mark Boon [EMAIL PROTECTED] To: computer-go computer-go@computer-go.org Sent: Tue, 2 Dec 2008 11:17 pm Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot I have made some minor performance improvements and this is as far as I intend to take this particular project. I might make some small changes if necessary, but most likely I'll leave this largely unchanged from now.? ? I had set myself as an arbitrary goal that it should do at least 20K playouts. But with real liberties, AMAF and a RAVE formula I got stuck in the 16K-17K range. According to my profiler that is mostly due to the expensive formula used to compare nodes, where it says it spends 25% of total time. The formula I'm using is:? ? ? beta * (virtual-win-ratio + RAVE) + (1-beta) * (win-ratio + UCT)? ? ? beta = 1 - log(parent-visits) / 20? ? UCT = exploration-factor *sqrt( log(parent-visits) / (nr-visits+1) )? ? RAVE = exploration-factor *sqrt( log(parent-visits) / (nr-virtual-visits+1) )? ? There are probably quite a few possibilities still to tune this program with regards to playing strength and performance. But I felt it doesn't help to obscure the implementation by too many details.? ? The implementation of the search algorithm was entirely game-independent, until I introduced AMAF. I didn't see how to fix that, as there's no way getting around that it's based on the fact that a move is represented by a single coordinate, which is mapped to an array to store the statistical values. But strip the AMAF part, which is a block of 30 odd lines of code, and I think it can be used for other games basically as-is. I did this not because I ever see myself program another game, but because in my experience in doing so I get a cleaner separation between modules.? ? At 2,000 playouts, it's still quite a bit weaker than the plain MC-AMAF refbot. It wins only about 33%. But that's probably because the 1,000-2,000 playouts range is the sweet-spot for that particular type of playing engine. It doesn't scale from there, whereas the MCTS ref-bot only just gets warmed up with 2,000 playouts.? ? This leads me to a question. I suppose it might be of some interest to put this bot up on CGOS. But what parameters to use? The main one being the number of playouts, naturally.? ? Mark? ___? computer-go mailing list? [EMAIL PROTECTED] http://www.computer-go.org/mailman/listinfo/computer-go/? ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
That's going to repeat the same exact path through the tree three times, isn't it? If so, it seems like it would be more efficient to do N playouts from the leaf after each traversal of the tree. [EMAIL PROTECTED] wrote: There is another speedup trick that might interest you. IIRC Lukasz Lew came up with it, but I forget what he called it. After calculating the selected move for an internal node (going through UCT and RAVE and whatnot) store that move. Then, for the next N visits to that node (where N is 5 or 10 or so), just repeat that move without having to calculate what the move would be. - Dave Hillis -Original Message- From: Mark Boon [EMAIL PROTECTED] To: computer-go computer-go@computer-go.org Sent: Tue, 2 Dec 2008 11:17 pm Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot I have made some minor performance improvements and this is as far as I intend to take this particular project. I might make some small changes if necessary, but most likely I'll leave this largely unchanged from now. I had set myself as an arbitrary goal that it should do at least 20K playouts. But with real liberties, AMAF and a RAVE formula I got stuck in the 16K-17K range. According to my profiler that is mostly due to the expensive formula used to compare nodes, where it says it spends 25% of total time. The formula I'm using is: beta * (virtual-win-ratio + RAVE) + (1-beta) * (win-ratio + UCT) beta = 1 - log(parent-visits) / 20 UCT = exploration-factor *sqrt( log(parent-visits) / (nr-visits+1) ) RAVE = exploration-factor *sqrt( log(parent-visits) / (nr-virtual-visits+1) ) There are probably quite a few possibilities still to tune this program with regards to playing strength and performance. But I felt it doesn't help to obscure the implementation by too many details. The implementation of the search algorithm was entirely game-independent, until I introduced AMAF. I didn't see how to fix that, as there's no way getting around that it's based on the fact that a move is represented by a single coordinate, which is mapped to an array to store the statistical values. But strip the AMAF part, which is a block of 30 odd lines of code, and I think it can be used for other games basically as-is. I did this not because I ever see myself program another game, but because in my experience in doing so I get a cleaner separation between modules. At 2,000 playouts, it's still quite a bit weaker than the plain MC-AMAF refbot. It wins only about 33%. But that's probably because the 1,000-2,000 playouts range is the sweet-spot for that particular type of playing engine. It doesn't scale from there, whereas the MCTS ref-bot only just gets warmed up with 2,000 playouts. This leads me to a question. I suppose it might be of some interest to put this bot up on CGOS. But what parameters to use? The main one being the number of playouts, naturally. Mark ___ computer-go mailing list computer-go@computer-go.org mailto:computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ Tis the season to save your money! Get the new AOL Holiday Toolbar http://toolbar.aol.com/holiday/download.html?ncid=emlweusdown0008 for money saving offers and gift ideas. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Tree Search reference bot
Hmm.. it could be that N is picked randomly each time... now I can't seem to find the description and my memory is not serving me well here. Perhaps someone else remembers? I'll try to track it down. - Dave Hillis -Original Message- From: Michael Williams [EMAIL PROTECTED] To: computer-go computer-go@computer-go.org Sent: Wed, 3 Dec 2008 12:14 am Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot That's going to repeat the same exact path through the tree three times, isn't it? If so, it seems like it would be more efficient to do N playouts from the leaf after each traversal of the tree.? ? [EMAIL PROTECTED] wrote:? There is another speedup trick that might interest you. IIRC Lukasz Lew came up with it, but I forget what he called it. After calculating the selected move for an internal node (going through UCT and RAVE and whatnot) store that move. Then, for the next N visits to that node (where N is 5 or 10 or so), just repeat that move without having to calculate what the move would be.? - Dave Hillis? -Original Message-? From: Mark Boon [EMAIL PROTECTED]? To: computer-go computer-go@computer-go.org? Sent: Tue, 2 Dec 2008 11:17 pm? Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot? I have made some minor performance improvements and this is as far as I intend to take this particular project. I might make some small changes if necessary, but most likely I'll leave this largely unchanged from now. I had set myself as an arbitrary goal that it should do at least 20K playouts. But with real liberties, AMAF and a RAVE formula I got stuck in the 16K-17K range. According to my profiler that is mostly due to the expensive formula used to compare nodes, where it says it spends 25% of total time. The formula I'm using is: beta * (virtual-win-ratio + RAVE) + (1-beta) * (win-ratio + UCT) beta = 1 - log(parent-visits) / 20 UCT = exploration-factor *sqrt( log(parent-visits) / (nr-visits+1) ) RAVE = exploration-factor *sqrt( log(parent-visits) / (nr-virtual-visits+1) ) There are probably quite a few possibilities still to tune this program with regards to playing strength and performance. But I felt it doesn't help to obscure the implementation by too many details. The implementation of the search algorithm was entirely game-independent, until I introduced AMAF. I didn't see how to fix that, as there's no way getting around that it's based on the fact that a move is represented by a single coordinate, which is mapped to an array to store the statistical values. But strip the AMAF part, which is a block of 30 odd lines of code, and I think it can be used for other games basically as-is. I did this not because I ever see myself program another game, but because in my experience in doing so I get a cleaner separation between modules. At 2,000 playouts, it's still quite a bit weaker than the plain MC-AMAF refbot. It wins only about 33%. But that's probably because the 1,000-2,000 playouts range is the sweet-spot for that particular type of playing engine. It doesn't scale from there, whereas the MCTS ref-bot only just gets warmed up with 2,000 playouts. This leads me to a question. I suppose it might be of some interest to put th is bot up on CGOS. But what parameters to use? The main one being the number of playouts, naturally. Mark ___ computer-go mailing list computer-go@computer-go.org mailto:computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ? Tis the season to save your money! Get the new AOL Holiday Toolbar http://toolbar.aol.com/holiday/download.html?ncid=emlweusdown0008 for money saving offers and gift ideas.? ? ___? computer-go mailing list? [EMAIL PROTECTED] http://www.computer-go.org/mailman/listinfo/computer-go/? ? ___? computer-go mailing list? [EMAIL PROTECTED] http://www.computer-go.org/mailman/listinfo/computer-go/? ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] Monte-Carlo Tree Search reference bot
My own experiments showed a very big increase in strength especially with about 1000 light-playouts. I had many features and parameters witch may make a difference. But my main hypothesis was that it shouldn't ... I think that i weighted real simulations (aka first move) 10 times more than virtual one (aka AMAF moves). I assume that by expansions you mean allowing data for every child of a node. So let's assume that your root node is called A, you would then allocate for every of it's child. Then when make up a simulation running from A, you would first apply the UCT formula for determining wich first move you should make ... then you would both get back the AMAF score for every moves that has been played during the simulation for virtual scoring AND you would use the true first move for Real scoring (normal UCT part) Of course, this shouldn't behave exactly like AMAF, because for one thing you can get a better judgment for the first move because of UCT, and then you also make the first move weight more. (although you probably technically should also do so with AMAF despite that the first move is still 100% random. Which is part of what the weighted AMAF does. But first move weight would be arbitrary thus defying the purpose of a standard way) Last, you could try to downgrade the weight of the virtual score part as the number of simulation grows. like with a (if nbsim 1000) winRatio=[realRatio *(1000+nbsim)]/3000+[virtualRatio*(2000-(1000+nbsim))]/3000 I'm not sure about the formula :) but you get the idea of taking more and more into account the realRatio. I think that what i did was that i didn't take the realRatio into accout at all at first for estimating the winRatio, and then increased it linearly to the point where i wouldn't take into accout the virtualRatio. The drawback with all that, is that you will certainly have to make several versions, although they would share a lot of code. I don't think that trying to get One version behave as several is a good idea. You probably should keep the code safe, once you have made a lot of testing of it. For reference purpose. Then you'll be able to exactly know the strength of a particular piece of software, even in two years from now. I din't do that. I tried to make it evolve. And with poor svn versionning capacity, and poor management of the measures/software link : i'm now to a point where all i can do is make random guesses, although i made millions of experiments in the last past years :) I just can't link those experiments back to a fixed piece of software. Then, if by any means, you think a code is bug free. Which is hard to bet on, you really should put that somewhere, with all the proof (and experiments) you can associate with it. (Even on a web page if you ask me , or a thesis) To me that is the most usefull part of the standard ref bot : because a lot of persons have tried it, you can be sure of the exact link between the software, and the experimental results. It makes assertions reproducibles, and that's really great. From: [EMAIL PROTECTED] Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot Date: Fri, 28 Nov 2008 01:48:09 -0200 To: computer-go@computer-go.org CC: On 27-nov-08, at 19:50, Denis fidaali wrote: So, you use AMAF for simulating the first UCT evaluations ? I though the classical way to use AMAF, was to affect only the win/lose ratio portion of the uct equation. Obvioulsy it should be allowed to use an arbitrary large number of AMAF simulation accumulating them longer than what it take to expand a node. I think that a classical way to affect the win/ratio is to decrease the effect of the AMAF correction as the number of simulation grows. If you test with a very low number of simulation (in the 1000 - 3000 range), i think you should be able to get out a very nice improvement out of the AMAF version. If you don't, i would think that something is wrong somewhere. What test process do you use for this version ? I tested it mostly doing 2,000 playouts. When AMAF is true I create a map of virtual-win values of all the moves played during a playout. These values get accumulated over all the playouts (not just the first ones). The 'virtual-value' of a move is calculated as follows: exploration-factor * UCT + ( (nr-wins*2 + nr-virtual-wins) / (nr- playouts*2 + nr-virtual-playouts)) where the exploration-factor is currently sqrt(0.2) and UCT is sqrt ( log( nr-parent-playouts ) / ( nr-playouts+1) ) Like I said, I haven't had time to experiment much so this formula may not be any good. I had also expected to see some positive effect of the virtual-win / virtual-playout ratio from AMAF, but I see none. Of course it's also possible I have a different kind of bug still. What happens in my 'formula' is that when it never expands beyond the first level, which is what happens if the number of simulations is equal
Re: [computer-go] Monte-Carlo Tree Search reference bot
: because a lot of persons have tried it, you can be sure of the exact link between the software, and the experimental results. It makes assertions reproducibles, and that's really great. From: [EMAIL PROTECTED] Subject: Re: [computer-go] Monte-Carlo Tree Search reference bot Date: Fri, 28 Nov 2008 01:48:09 -0200 To: computer-go@computer-go.org CC: On 27-nov-08, at 19:50, Denis fidaali wrote: So, you use AMAF for simulating the first UCT evaluations ? I though the classical way to use AMAF, was to affect only the win/lose ratio portion of the uct equation. Obvioulsy it should be allowed to use an arbitrary large number of AMAF simulation accumulating them longer than what it take to expand a node. I think that a classical way to affect the win/ratio is to decrease the effect of the AMAF correction as the number of simulation grows. If you test with a very low number of simulation (in the 1000 - 3000 range), i think you should be able to get out a very nice improvement out of the AMAF version. If you don't, i would think that something is wrong somewhere. What test process do you use for this version ? I tested it mostly doing 2,000 playouts. When AMAF is true I create a map of virtual-win values of all the moves played during a playout. These values get accumulated over all the playouts (not just the first ones). The 'virtual-value' of a move is calculated as follows: exploration-factor * UCT + ( (nr-wins*2 + nr-virtual-wins) / (nr- playouts*2 + nr-virtual-playouts)) where the exploration-factor is currently sqrt(0.2) and UCT is sqrt ( log( nr-parent-playouts ) / ( nr-playouts+1) ) Like I said, I haven't had time to experiment much so this formula may not be any good. I had also expected to see some positive effect of the virtual-win / virtual-playout ratio from AMAF, but I see none. Of course it's also possible I have a different kind of bug still. What happens in my 'formula' is that when it never expands beyond the first level, which is what happens if the number of simulations is equal to the number of simulations before expansion, the virtual- value becomes completely determined by nr-virtual-wins / nr-virtual- playouts making it equivalent to the original ref-bot. In case it does expand further and creates a tree, the actual win-loss ratio is weighed twice as heavily as the virtual win-loss ratio. That seemed like a reasonable first try. I have tried a few others, but usually didn't get much different results or much worse results. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ Souhaitez vous « être au bureau sans y être » ? Oui je le veux ! ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Tree Search reference bot
I have added a MCTS implementation to my reference-bot project. It allows you to specify how many nodes to search, after how many nodes to expand and whether to use AMAF or not. If you specify the number of nodes before expansion to be the same as the total number of nodes to search and set AMAF to true you get the equivalent of Don's original ref-bot. When you set AMAF to false and set the number of nodes before epxansion to 1 you get my original UCT search algorithm. Between those extremes there might be a good formula to combine AMAF with tree-search, but unfortunately I haven't had time lately to look for one. The few small attempts I made show no benefit using AMAF in tree-search, only when used on a single level. The contrast between the two exptremes is very stark, so I'm actually convinced there should be a way to use both. This implementation is also quite a bit slower than my original search algorithm but I also didn't have time yet to trace it. It might simply be due to the different expansion method, which is much more expensive with a value of 1. Also, searching for the best UCT node gets more expensive with more (unused) nodes on each level. Using a higher expansion value may alleviate the performance hit. Anyway I think this is a reasonable starting point. At first I intended to create a different project for the search reference bot. But half of the code (the MC part) is the same. And I don't want to end up having to maintain the same code in two places. I also didn't want to separate out some code into a separate library and making the structure for the simple ref-bot more complicated. This organization may need some more thought though. I'll update the project pages tomorrow. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] Monte-Carlo Tree Search reference bot
So, you use AMAF for simulating the first UCT evaluations ? I though the classical way to use AMAF, was to affect only the win/lose ratio portion of the uct equation. Obvioulsy it should be allowed to use an arbitrary large number of AMAF simulation accumulating them longer than what it take to expand a node. I think that a classical way to affect the win/ratio is to decrease the effect of the AMAF correction as the number of simulation grows. If you test with a very low number of simulation (in the 1000 - 3000 range), i think you should be able to get out a very nice improvement out of the AMAF version. If you don't, i would think that something is wrong somewhere. What test process do you use for this version ? From: [EMAIL PROTECTED] Date: Thu, 27 Nov 2008 18:15:34 -0200 To: computer-go@computer-go.org CC: Subject: [computer-go] Monte-Carlo Tree Search reference bot I have added a MCTS implementation to my reference-bot project. It allows you to specify how many nodes to search, after how many nodes to expand and whether to use AMAF or not. If you specify the number of nodes before expansion to be the same as the total number of nodes to search and set AMAF to true you get the equivalent of Don's original ref-bot. When you set AMAF to false and set the number of nodes before epxansion to 1 you get my original UCT search algorithm. Between those extremes there might be a good formula to combine AMAF with tree-search, but unfortunately I haven't had time lately to look for one. The few small attempts I made show no benefit using AMAF in tree-search, only when used on a single level. The contrast between the two exptremes is very stark, so I'm actually convinced there should be a way to use both. This implementation is also quite a bit slower than my original search algorithm but I also didn't have time yet to trace it. It might simply be due to the different expansion method, which is much more expensive with a value of 1. Also, searching for the best UCT node gets more expensive with more (unused) nodes on each level. Using a higher expansion value may alleviate the performance hit. Anyway I think this is a reasonable starting point. At first I intended to create a different project for the search reference bot. But half of the code (the MC part) is the same. And I don't want to end up having to maintain the same code in two places. I also didn't want to separate out some code into a separate library and making the structure for the simple ref-bot more complicated. This organization may need some more thought though. I'll update the project pages tomorrow. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ _ Email envoyé avec Windows Live Hotmail. Dites adieux aux spam et virus, passez à Hotmail ! C'est gratuit ! http://www.windowslive.fr/hotmail/default.asp___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
I use a method inititally from the Mogo team that sorts of randomizes the position before running the heavy playout. One simply plays uniformly random *non contact* moves. The effect of this is that it preserves the shapes of stones on the board, but it prevents the heavy playouts from playing long sequences deterministically. I have not tested this, but I felt that the evaluation on large boards got less noisy and/or biased doing this. Magnus Quoting Michael Williams [EMAIL PROTECTED]: It seems move selection in the playouts should be very random at first and more deterministic toward the end of the playout. Has anyone tried that? ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
I have to add that it is possible that a large part of the advantage from using heavy playouts in valkyria comes from using the same code to bias the the exploration part of MCTS. I could probably test it by simply relying completely on AMAF with the proximity heuristic as the only bias. A second experiment would be to rewrite the playouts and make them superfast. -Magnus Quoting Mark Boon [EMAIL PROTECTED]: On 17-nov-08, at 02:42, George Dahl wrote: So you say that: ...I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation., could you elaborate at all? This is very interesting. That almost suggests it might be fruitful to use the patterns in the tree, but keep lighter playouts. That's exactly what it's suggesting. But as I said, I need to do some more testing to make a hard case for that. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- Magnus Persson Berlin, Germany ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
So you say that: ...I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation., could you elaborate at all? This is very interesting. That almost suggests it might be fruitful to use the patterns in the tree, but keep lighter playouts. If I understand David Fotland [1] (and you) correctly this is the same approach Many Faces uses. Keep the playouts light, but it is worth spending effort to guide the MCTS tree. I also think Remi was saying the same in his paper [2] on using move patterns. E.g. in section 4.1 he says he just used a few features in the simulations, but used all the features to control the progressive widening of the monte-carlo search tree (section 4.2). Darren [1]: http://www.mail-archive.com/computer-go@computer-go.org/msg09759.html [2]: http://remi.coulom.free.fr/Amsterdam2007/ -- Darren Cook, Software Researcher/Developer http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic open source dictionary/semantic network) http://dcook.org/work/ (About me and my work) http://dcook.org/blogs.html (My blogs and articles) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
On Sat, Nov 15, 2008 at 11:38:34PM +0100, [EMAIL PROTECTED] wrote: Being a computer scientist but new to go, i can grasp some of the theory. The question I was trying to get across was: In a game of self play, if both parties are employing only monte carlo, surely its not a good conceptual representation of a human, and if the reinforcement learning is based on random simulations wouldnt it be very weak when playing a real human? Here is another amateur answering. The way I understand it, modern Monte Carlo programs do not even try to emulate a human player with a random player - obviously that would not work. What they do is that they build a quite traditional search tree starting from the current position. They use a random playout as a crude way to evaluate a position. Based on this evaluation, they decide which branch of the tree to expand. This is the way I understand the random playouts: If, in a given position, white is clearly ahead, he will win the game if both parts play perfect moves. He is also likely to win if both parts play reasonably good moves (say, like human amateurs), but there is a bit more of a chance that one player hits upon a good combination which the other misses, so the result is not quite as reliable. If the playouts are totally random, there is still a better chance for white to win, because both parts make equally bad moves. The results have much more variation, of course. So far it does not sound like a very good proposal, but things change if you consider the facts that we don't have perfecr oracles, and good humans are slow to play out a position, and can not be integrated into a computer program. Whereas random playouts can be done awfully fast, tens of thousands of playouts in a second. Averaging the reuslts gives a fair indication of who is more likely to win from that position, just what is needed to decide which part of the search tree to expand. The 'random' playouts are not totally random, they include a minimum of tactical rules (do not fill own eyes, do not pass as long as there are valid moves). Even this little will produce a few blind spots, moves that the playouts can not see, and systematically wrong results. Adding more go-specific knowledge can make the results much better (more likely to be right), but can also add some more blind spots. And it costs time, reducing the number of playouts the program can make. Hope that explains something of the mystery Regards Heikki -- Heikki Levanto In Murphy We Turst heikki (at) lsd (dot) dk ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
On Sunday 16 November 2008, Heikki Levanto wrote: On Sat, Nov 15, 2008 at 11:38:34PM +0100, [EMAIL PROTECTED] wrote: Being a computer scientist but new to go, i can grasp some of the theory. The question I was trying to get across was: In a game of self play, if both parties are employing only monte carlo, surely its not a good conceptual representation of a human, and if the reinforcement learning is based on random simulations wouldnt it be very weak when playing a real human? Here is another amateur answering. The way I understand it, modern Monte Carlo programs do not even try to emulate a human player with a random player - obviously that would not work. What they do is that they build a quite traditional search tree starting from the current position. They use a random playout as a crude way to evaluate a position. Based on this evaluation, they decide which branch of the tree to expand. This is the way I understand the random playouts: If, in a given position, white is clearly ahead, he will win the game if both parts play perfect moves. He is also likely to win if both parts play reasonably good moves (say, like human amateurs), but there is a bit more of a chance that one player hits upon a good combination which the other misses, so the result is not quite as reliable. If the playouts are totally random, there is still a better chance for white to win, because both parts make equally bad moves. The results have much more variation, of course. So far it does not sound like a very good proposal, but things change if you consider the facts that we don't have perfecr oracles, and good humans are slow to play out a position, and can not be integrated into a computer program. Whereas random playouts can be done awfully fast, tens of thousands of playouts in a second. Averaging the reuslts gives a fair indication of who is more likely to win from that position, just what is needed to decide which part of the search tree to expand. Do you know what use (if any) is made of the standard deviation of the results? The 'random' playouts are not totally random, they include a minimum of tactical rules (do not fill own eyes, do not pass as long as there are valid moves). Even this little will produce a few blind spots, moves that the playouts can not see, and systematically wrong results. Adding more go-specific knowledge can make the results much better (more likely to be right), but can also add some more blind spots. And it costs time, reducing the number of playouts the program can make. Hope that explains something of the mystery Regards Heikki ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: FW: computer-go] Monte carlo play?
In a game of self play, if both parties are employing only monte carlo, ... random simulations... wouldnt it be very weak... ... and some playing around I am clearly mistaken because its works quite well. Yes, it doesn't make sense but it does indeed seem to work :-). Plain Monte-Carlo bots *are* rather weak. But they work better than might be expected, and they are fairly simple, which is extremely motivating for potential Go programmers!-) Instead of having to learn all of Go and a lot of AI before being able to start building anything, one can start quickly, and then improve together with the bot. And the current top programs claim plain Monte-Carlo bots as their remote ancestors, so it isn't a dead end, either (though complete rewrites may be needed somewhere along the path;-). Another way to think about playouts is by comparing them with a human player processing through the ranks: 1 only knowing the rules makes for very inefficient/slow/buggy play 2 playing lots of games quickly has often been recommended to get a better feeling for the game; personally, I don't like fast games(*), but at least the worst inadequacies tend to show up even if we only play equally clueless opponents (probably because weaknesses aren't evenly distributed, so that peaks in understanding in one player happen to match weaknesses in understanding in the other) 3 human players learn over time, until all players in a group have reached an equal level (or have gone as far as they can, with the effort they can afford to put into the game) 4 now further progress depends on innovation in the existing pool of players, on joining a different player pool, or on bringing in other players, preferably players whose strengths expose weaknesses in the current pool (similar effects can be achieved by reading books or observing games) If we allow the learning outcome in 2/3 to be represented as patterns (wait a minute, I've fallen into that trap before, so I shouldn't do this, I usually do this, but that would be faster - I wonder whether it'll work), then random playouts can emulate a weak form of this process. Instead of learning from lots of games over time, plain Monte-Carlo bots extrapolate from lots of dumb games, everytime they consider their next move. Random games are really too dumb to be considered games, but bots can play lots of them in a short time, so they can at least see trends (a mixture of on-the-spot learning and very bad reading ahead). If we assume that the quantity compensates somewhat for the quality, a Monte-Carlo bot is a bit like beginners after lots of games who always forget their lessons learned when the session ends. They may have an idea of who is ahead and what is alive, and be able to organise their games based on those ideas, but a stronger player (who remembers his lessons) may be able to rip their ideas apart (you might both think that group is alive but what do you do if I play here? and who is ahead if you lose that group?). And then there is the scenario of strong player visits club and humiliates local champion (you might think that playing there kills that group, but what if they reply here?). Adding patterns or other biases to make playouts heavier (both slower and more relevant for evaluation) is similar to improving reading skills, but with differences: bots already read to the end, so eliminating useless moves doesn't improve the playout depth, it improves at best the density of information to be gained from playouts (so you don't have to play as many of them, or can get more out of those you do play); but playouts are also still used for on-the-spot-learning, and biases have limited scope for success, so one has to be careful not to ruin this aspect. Using a tree on top of the playouts is not just a way to represent the knowledge extracted from the playouts, but also helps to direct playout effort towards relevant situations, and gives a place where traditional Go knowledge can be built in without affecting the playouts directly (so they become more of an evaluation function than a candidate move generator). One problem is that the top programs tend to be fairly close in strength. There are variations in method, so there is some further progress, but how much of the human going-through-the-ranks evolution has been captured in current top programs, I don't know. I suspect that it is less than their ranks would indicate. There is certainly reading going on, discussion, and inviting stronger players or different bots. Oh, and if any of this seems to make sense, remember that I just made it all up!-) You might want to consult these slides for overviews from folks who have been in the business: Old-fashioned Computer Go vs Monte-Carlo Go; by Bruno Bouzy, Invited tutorial, IEEE 2007 Symposium on Computational Intelligence in Games, CIG 07, Hawaï, USA, 1st April 2007. Abstract:
Re: [computer-go] FW: computer-go] Monte carlo play?
Hello Heikki, Heikki Levanto: [EMAIL PROTECTED]: On Sat, Nov 15, 2008 at 11:38:34PM +0100, [EMAIL PROTECTED] wrote: Being a computer scientist but new to go, i can grasp some of the theory. The question I was trying to get across was: In a game of self play, if both parties are employing only monte carlo, surely its not a good conceptual representation of a human, and if the reinforcement learning is based on random simulations wouldnt it be very weak when playing a real human? Here is another amateur answering. The way I understand it, modern Monte Carlo programs do not even try to emulate a human player with a random player - obviously that would not work. I believe CrazyStone's use of patterns does so and it seems successful. Hideki What they do is that they build a quite traditional search tree starting from the current position. They use a random playout as a crude way to evaluate a position. Based on this evaluation, they decide which branch of the tree to expand. This is the way I understand the random playouts: If, in a given position, white is clearly ahead, he will win the game if both parts play perfect moves. He is also likely to win if both parts play reasonably good moves (say, like human amateurs), but there is a bit more of a chance that one player hits upon a good combination which the other misses, so the result is not quite as reliable. If the playouts are totally random, there is still a better chance for white to win, because both parts make equally bad moves. The results have much more variation, of course. So far it does not sound like a very good proposal, but things change if you consider the facts that we don't have perfecr oracles, and good humans are slow to play out a position, and can not be integrated into a computer program. Whereas random playouts can be done awfully fast, tens of thousands of playouts in a second. Averaging the reuslts gives a fair indication of who is more likely to win from that position, just what is needed to decide which part of the search tree to expand. The 'random' playouts are not totally random, they include a minimum of tactical rules (do not fill own eyes, do not pass as long as there are valid moves). Even this little will produce a few blind spots, moves that the playouts can not see, and systematically wrong results. Adding more go-specific knowledge can make the results much better (more likely to be right), but can also add some more blind spots. And it costs time, reducing the number of playouts the program can make. Hope that explains something of the mystery Regards Heikki -- [EMAIL PROTECTED] (Kato) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
On Sun, Nov 16, 2008 at 11:46:28AM +, D Gilder wrote: This is the way I understand the random playouts: If, in a given position, white is clearly ahead, he will win the game if both parts play perfect moves. He is also likely to win if both parts play reasonably good moves (say, like human amateurs), but there is a bit more of a chance that one player hits upon a good combination which the other misses, so the result is not quite as reliable. If the playouts are totally random, there is still a better chance for white to win, because both parts make equally bad moves. The results have much more variation, of course. So far it does not sound like a very good proposal, but things change if you consider the facts that we don't have perfecr oracles, and good humans are slow to play out a position, and can not be integrated into a computer program. Whereas random playouts can be done awfully fast, tens of thousands of playouts in a second. Averaging the reuslts gives a fair indication of who is more likely to win from that position, just what is needed to decide which part of the search tree to expand. Do you know what use (if any) is made of the standard deviation of the results? Now statistics isn't my strong point, but one of the first and most successfull systems was called UCT for Upper Confidence Bound. It calculated some sort of expected error, and added that to the winning ratio. Then it expanded the branch that had the highest value. If that expansion was a win, the error margin would get smaller, but the average result would get higher. Perhaps some other branch would be tried next, but this one would still be a good candidate. If the result was a loss, the average would drop, and so would the error, so this move would become much less likely to be expanded. I am sure someone who understands statistics will soon jump in and correct my explanation :-) - Heikki -- Heikki Levanto In Murphy We Turst heikki (at) lsd (dot) dk ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
Quoting Hideki Kato [EMAIL PROTECTED]: Heikki Levanto: [EMAIL PROTECTED]: The way I understand it, modern Monte Carlo programs do not even try to emulate a human player with a random player - obviously that would not work. I believe CrazyStone's use of patterns does so and it seems successful. With Valkyria I try to follow two principles in heavy playouts. 1) In contact fights there are a lot of shapes that are played most of the time. Thus Valkyria checks each move played if there is an obvious local response to it. If so it plays it deterministcally. In many situations there are two or more such candidates and then it plays one of those moves. 2) In many positions the last move played does not trigger any obvious response, and then a random move is chosen uniformly 3) There are moves that are inferior 100% of the time both locally and globally. These moves are pruned if they are selected and a new random move is chosen as long as there are moves left to try. I got hundreds of handcoded patterns for both 1 and 3. It would be too time consuming to test these patterns, so I use my knowledge and intuition (European 2 Dan) to simply decide what patterns to include. So Valkyria has a lot of go knowledge, but mostly such knowledge that all go players have up to some strength such as perhaps 8-10 kyu. It has no knowledge about global matters. The beauty of MC-evaluation is that globally strong moves are most of the time evaluated better than globally weak moves. Heavy playouts removes noise from MC-evaluation and makes it more sensitive to the true value of moves. Still there are biases with all heavy playouts, but they are overcome with MC Tree Search (MCTS) that corrects mistakes in the evaluation recursively. Here are my latest scaling experiment on 9x9 for Valkyria. Valkyria plays 1150 random games per second on my 4 year old laptop. This test is against gnugo 3.7.10 assumed to be Elo 1800. Most datapoints are based on 500 games. N sims means Valkyria playes N heavy playouts per move played. Winrates are in %. N sims WinRate Elo (rel Gnu) 47 7.4 1361 94 22 1580 188 37 1708 375 53 1821 750 69.91946 150081.22054 300088 2146 600092.62239 12000 94 2278 24000 97.22416 48000 97.42429 the heavy playouts of Valkyria needs just 375 random games per move to match gnugo using only 0.3 seconds per move. And even using only 47 simulations per move it can still win. So obviously the heavy playout code of Valkyria is much weaker ( Elo 1361) than Gnugo and most human opponents, but compared to CGOS a lot of programs witho no knowledge are about the same level, although they uses 2000 simulations or more. Searching efficiently using MCTS with AMAF it apparently can be made arbitrarily strong. Hope this explains how both the nature of playouts and the MCTS contributes to the playing strength of a program. Should one go heavy or light? I do not know, I feel that Valkyria is a little bit too slow on equivalent hardware against most top programs. On the other hand I think it could be tweaked and improved upon. Perhaps it can even be made faster by removing code that does not improve playing strength. And there is probably still room for adding code that improves strength without a noticable slowdown. I just know that is a lot of hard work doing it the way I did it. Best Magnus ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Re: FW: computer-go] Monte carlo play?
On Sun, 16 Nov 2008, Claus Reinke wrote: ... better feeling for the game; personally, I don't like fast games(*), but ... But there is this saying: Play quick, lose quick, learn quick! :) Thomas ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
The random playouts or even heavy playouts are not intended to emulate a human player. Heikki is exactly right. It's a crude measurement of how good the position is. The moves in a random playout are horrible and so are the moves in a heavy playout. In fact, improving them arbitrarily will probably hurt the playouts. Random playouts work reasonably well because to a certain extent bad moves cancel each other. Chrilly called this mutual stupidity I think. For example, a group of stones may be weak and ultimately lost, but it is still possible to defend that group if the attacker isn't diligent. What happens in the playouts is that the attacker is NOT diligent, but neither is the defender! So the result comes out correct anyway. Of course this is not reliable, but it's amazingly good at getting a reasonable picture of what is weak, what is strong and who owns what. You can improve that by adding some knowledge to the playouts, but you must do this with great care. In my example above, let's say you add a piece of knowledge that causes the defender to succeed. You can argue that the playouts play better go now, but the conclusion that you come to for the group we are talking about is now wrong. - Don On Sun, 2008-11-16 at 10:08 +0100, Heikki Levanto wrote: On Sat, Nov 15, 2008 at 11:38:34PM +0100, [EMAIL PROTECTED] wrote: Being a computer scientist but new to go, i can grasp some of the theory. The question I was trying to get across was: In a game of self play, if both parties are employing only monte carlo, surely its not a good conceptual representation of a human, and if the reinforcement learning is based on random simulations wouldnt it be very weak when playing a real human? Here is another amateur answering. The way I understand it, modern Monte Carlo programs do not even try to emulate a human player with a random player - obviously that would not work. What they do is that they build a quite traditional search tree starting from the current position. They use a random playout as a crude way to evaluate a position. Based on this evaluation, they decide which branch of the tree to expand. This is the way I understand the random playouts: If, in a given position, white is clearly ahead, he will win the game if both parts play perfect moves. He is also likely to win if both parts play reasonably good moves (say, like human amateurs), but there is a bit more of a chance that one player hits upon a good combination which the other misses, so the result is not quite as reliable. If the playouts are totally random, there is still a better chance for white to win, because both parts make equally bad moves. The results have much more variation, of course. So far it does not sound like a very good proposal, but things change if you consider the facts that we don't have perfecr oracles, and good humans are slow to play out a position, and can not be integrated into a computer program. Whereas random playouts can be done awfully fast, tens of thousands of playouts in a second. Averaging the reuslts gives a fair indication of who is more likely to win from that position, just what is needed to decide which part of the search tree to expand. The 'random' playouts are not totally random, they include a minimum of tactical rules (do not fill own eyes, do not pass as long as there are valid moves). Even this little will produce a few blind spots, moves that the playouts can not see, and systematically wrong results. Adding more go-specific knowledge can make the results much better (more likely to be right), but can also add some more blind spots. And it costs time, reducing the number of playouts the program can make. Hope that explains something of the mystery Regards Heikki signature.asc Description: This is a digitally signed message part ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] FW: computer-go] Monte carlo play?
Yes, Valkyria does a lot of ladder reading as well. Actually pattern matching in the case of Valkyria is a little unclear, it is a decision trees where the leaves are often procedure calls that looks at a larger portion of the board. The ladder code is called for various reasons in the tree. Is 3.7.10 level 10 weaker than the default settings? I will run a test using 5000 playouts and 375 playouts to see if it makes a difference. Also have you tried this version on CGOS9x9? This version http://cgos.boardspace.net/9x9/cross/Valkyria3.2.6_C2.html used a machine similar to yours and reached 3000 playouts per second using two threads. Your code is then 8 times faster and should be much stronger. -Magnus Quoting David Fotland [EMAIL PROTECTED]: I thought Valkyria does local search (ladders) during the playouts. Many Faces is lighter on the playouts. I have 17 local 3x3 patterns, then go to uniform random without filling eyes. Against Gnugo 3.7.10 level 10 on 9x9, with 5000 playouts, I win 92%, so our performance is similar. I'm doing 23K playouts per second on my 2.2 GHz Core 2 Duo, so my performance might be a little better, depending on the specs of your old machine. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] FW: computer-go] Monte carlo play?
I think I added a small capture bias, but it didn't make much difference. Sorry, I forgot that it isn't quite pure random. Before the uniform random, if there is an enemy one liberty group on the board, with some small probability, I capture it. A pattern includes don't cares and is matched in any orientation. The 3x3 patterns are only matched adjacent to the last move (8 local places). David -Original Message- From: [EMAIL PROTECTED] [mailto:computer-go- [EMAIL PROTECTED] On Behalf Of Jason House Sent: Sunday, November 16, 2008 5:06 PM To: computer-go Subject: Re: [computer-go] FW: computer-go] Monte carlo play? On Nov 16, 2008, at 11:18 AM, David Fotland [EMAIL PROTECTED] games.com wrote: I thought Valkyria does local search (ladders) during the playouts. Many Faces is lighter on the playouts. I have 17 local 3x3 patterns, then go to uniform random without filling eyes. No capture bias in the playouts? I thought that was a big strength boost. Out of curiosity, how do you count your patterns. For example, is it still one pattern if it includes a don't care? How about rotations/ reflections of the same basic pattern? Against Gnugo 3.7.10 level 10 on 9x9, with 5000 playouts, I win 92%, so our performance is similar. I'm doing 23K playouts per second on my 2.2 GHz Core 2 Duo, so my performance might be a little better, depending on the specs of your old machine. David -Original Message- From: [EMAIL PROTECTED] [mailto:computer-go- [EMAIL PROTECTED] On Behalf Of Magnus Persson Sent: Sunday, November 16, 2008 5:45 AM To: computer-go@computer-go.org Subject: Re: [computer-go] FW: computer-go] Monte carlo play? Quoting Hideki Kato [EMAIL PROTECTED]: Heikki Levanto: [EMAIL PROTECTED]: The way I understand it, modern Monte Carlo programs do not even try to emulate a human player with a random player - obviously that would not work. I believe CrazyStone's use of patterns does so and it seems successful. With Valkyria I try to follow two principles in heavy playouts. 1) In contact fights there are a lot of shapes that are played most of the time. Thus Valkyria checks each move played if there is an obvious local response to it. If so it plays it deterministcally. In many situations there are two or more such candidates and then it plays one of those moves. 2) In many positions the last move played does not trigger any obvious response, and then a random move is chosen uniformly 3) There are moves that are inferior 100% of the time both locally and globally. These moves are pruned if they are selected and a new random move is chosen as long as there are moves left to try. I got hundreds of handcoded patterns for both 1 and 3. It would be too time consuming to test these patterns, so I use my knowledge and intuition (European 2 Dan) to simply decide what patterns to include. So Valkyria has a lot of go knowledge, but mostly such knowledge that all go players have up to some strength such as perhaps 8-10 kyu. It has no knowledge about global matters. The beauty of MC-evaluation is that globally strong moves are most of the time evaluated better than globally weak moves. Heavy playouts removes noise from MC-evaluation and makes it more sensitive to the true value of moves. Still there are biases with all heavy playouts, but they are overcome with MC Tree Search (MCTS) that corrects mistakes in the evaluation recursively. Here are my latest scaling experiment on 9x9 for Valkyria. Valkyria plays 1150 random games per second on my 4 year old laptop. This test is against gnugo 3.7.10 assumed to be Elo 1800. Most datapoints are based on 500 games. N sims means Valkyria playes N heavy playouts per move played. Winrates are in %. N simsWinRateElo (rel Gnu) 477.41361 94221580 188371708 375531821 75069.91946 150081.22054 3000882146 600092.62239 12000942278 2400097.22416 4800097.42429 the heavy playouts of Valkyria needs just 375 random games per move to match gnugo using only 0.3 seconds per move. And even using only 47 simulations per move it can still win. So obviously the heavy playout code of Valkyria is much weaker ( Elo 1361) than Gnugo and most human opponents, but compared to CGOS a lot of programs witho no knowledge are about the same level, although they uses 2000 simulations or more. Searching efficiently using MCTS with AMAF it apparently can be made arbitrarily strong. Hope this explains how both the nature of playouts and the MCTS contributes to the playing strength of a program. Should one go heavy or light? I do not know, I feel that Valkyria is a little bit too slow on equivalent hardware against most top programs. On the other hand I think
Re: [computer-go] FW: computer-go] Monte carlo play?
Some months ago I did several experiments with using tactics and patterns in playouts. Generally I found a big boost in strength using tactics. I also found a boost in strength using patterns but with a severe diminishing return after a certain number and even becoming detrimental when using large number of patterns (1,000s to 10,000s). Since I was using a generalized pattern-matcher, using it slowed things down considerably. Although it played a lot better with the same number of playouts, if I compared MC playouts with patterns to a MC playout without patterns using the same amount of CPU time the gain was not so obvious. Since most of the gain in strength was gained by just a few patterns I concluded just as David that it was probably better to just use a handful of hard-coded patterns during playouts. I only recently started to do real experiments with hard-coded patterns and so far my results are rather inconclusive. I found when mixing different things it's not always clear what contributes to any increased strength observed. So I'm still in the process of trying to dissect what is actually contributing where. I found for example that a lot of the increased level of play using patterns does not come from using them during playouts but comes from the effect they have on move-exploration. I don't know if this is due to my particular way of implementing MC playouts in combination with UCT search, but moves matching a pattern (usually) automatically make it first in the tree- expansion as well and generally I can say so far I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation. For example, in one particular experiment using just 5 patterns I saw a win-rate of 65% against the same program not using patterns (with the same number of playouts). But when not using the patterns during exploration saw the win-rate drop to just 55%. I still have a lot of testing to do and it's too early to draw any hard conclusions. But I think it's worthwhile trying to distinguish where the strength is actually gained. Better yet, finding out exactly 'why' it gained strength, because with MC playouts I often find results during testing highly counter-intuitive, occasionally to the point of being (seemingly) nonsensical. I also think what Don was proposing with his reference-bot could be interesting. Trying to make it play around ELO 1700 on CGOS just using 5,000 (light) playouts. I don't know if it's possible, but I think it's a fruitful exercise. At a time where most people are looking at using more and more hardware to increase playing-strength, knowing what plays best at the other end of the spectrum is valuable as well. With that I mean, finding what plays best using severely constrained resources. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
So you say that: ...I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation., could you elaborate at all? This is very interesting. That almost suggests it might be fruitful to use the patterns in the tree, but keep lighter playouts. - George On Sun, Nov 16, 2008 at 10:39 PM, Mark Boon [EMAIL PROTECTED] wrote: Some months ago I did several experiments with using tactics and patterns in playouts. Generally I found a big boost in strength using tactics. I also found a boost in strength using patterns but with a severe diminishing return after a certain number and even becoming detrimental when using large number of patterns (1,000s to 10,000s). Since I was using a generalized pattern-matcher, using it slowed things down considerably. Although it played a lot better with the same number of playouts, if I compared MC playouts with patterns to a MC playout without patterns using the same amount of CPU time the gain was not so obvious. Since most of the gain in strength was gained by just a few patterns I concluded just as David that it was probably better to just use a handful of hard-coded patterns during playouts. I only recently started to do real experiments with hard-coded patterns and so far my results are rather inconclusive. I found when mixing different things it's not always clear what contributes to any increased strength observed. So I'm still in the process of trying to dissect what is actually contributing where. I found for example that a lot of the increased level of play using patterns does not come from using them during playouts but comes from the effect they have on move-exploration. I don't know if this is due to my particular way of implementing MC playouts in combination with UCT search, but moves matching a pattern (usually) automatically make it first in the tree-expansion as well and generally I can say so far I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation. For example, in one particular experiment using just 5 patterns I saw a win-rate of 65% against the same program not using patterns (with the same number of playouts). But when not using the patterns during exploration saw the win-rate drop to just 55%. I still have a lot of testing to do and it's too early to draw any hard conclusions. But I think it's worthwhile trying to distinguish where the strength is actually gained. Better yet, finding out exactly 'why' it gained strength, because with MC playouts I often find results during testing highly counter-intuitive, occasionally to the point of being (seemingly) nonsensical. I also think what Don was proposing with his reference-bot could be interesting. Trying to make it play around ELO 1700 on CGOS just using 5,000 (light) playouts. I don't know if it's possible, but I think it's a fruitful exercise. At a time where most people are looking at using more and more hardware to increase playing-strength, knowing what plays best at the other end of the spectrum is valuable as well. With that I mean, finding what plays best using severely constrained resources. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
On 17-nov-08, at 02:42, George Dahl wrote: So you say that: ...I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation., could you elaborate at all? This is very interesting. That almost suggests it might be fruitful to use the patterns in the tree, but keep lighter playouts. That's exactly what it's suggesting. But as I said, I need to do some more testing to make a hard case for that. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
I look forward to hearing more! Happy testing. - George On Sun, Nov 16, 2008 at 11:53 PM, Mark Boon [EMAIL PROTECTED] wrote: On 17-nov-08, at 02:42, George Dahl wrote: So you say that: ...I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation., could you elaborate at all? This is very interesting. That almost suggests it might be fruitful to use the patterns in the tree, but keep lighter playouts. That's exactly what it's suggesting. But as I said, I need to do some more testing to make a hard case for that. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] FW: computer-go] Monte carlo play?
It seems move selection in the playouts should be very random at first and more deterministic toward the end of the playout. Has anyone tried that? Mark Boon wrote: On 17-nov-08, at 02:42, George Dahl wrote: So you say that: ...I'm observing that most of the increase in level comes from the selection during exploration and only in small part from the selection during simulation., could you elaborate at all? This is very interesting. That almost suggests it might be fruitful to use the patterns in the tree, but keep lighter playouts. That's exactly what it's suggesting. But as I said, I need to do some more testing to make a hard case for that. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] Monte carlo play
Hello Tony, I'm just speaking for myself here, but I suspect you may not get the answers you are looking for, because you're not asking the right questions. I get the impression that the situation is similar to this scenario: A young guy with a surf board and a hawai shirt turns up at basecamp on mount Everest and says: Hey dudes, what's up! This morning I had this wonderful idea: I wanna climb the mount Everest! So here I am, and I even brought my hiking shoes! But can anyone give me directions to the nearest McDonalds, cuz I'm starving. Everybody in the tent stares at him speechlessy. I may be wrong, but that's my impression. I'm not even sure whether if I myself should be here at basecamp, even though I am a 4d in Go and even though I have 25 years of programming experience in a couple of different languages (including the last 8 years as a fulltime professional developer.) It's only logical that ever since I've learned Go 20 years ago I've had a special interest in computer-go and I spent some time over the years attempting to build a strong go playing program myself (although not nearly as much time as some others here). I'm a mountaineer, but climbing the mount Everest is something else. I feel a bit like a rooky between all the seasoned mountaineers hanging around here at the Everest basecamp. Most of the time I keep my mouth shut when one of them is talking, because I'm a bit apprehensive that I might say something that is utterly obvious to the others or even outright stupid. And then this guy turns up in his hawai shirt. I hope you don't feel offended. Indeed you took up a wonderful endeavour, but I sense that you're not quite ready to go for the summit today. Best Regards, Dave Van: [EMAIL PROTECTED] namens tony tang Verzonden: do 13-11-2008 22:03 Aan: go mailing list Onderwerp: RE: [computer-go] Monte carlo play Thanks Darren, After reading a couple of interesting papers on this field I am applying pattern matching to limit the number of possibilities of the monte carlo tree (on 19x19 it appears to take a day to play). I was wondering if you could comment on weather my following concept is right on the evalution part. The possible patterns for the current situation are passed to a class where self play begins. Each different pattern runs on a instance, and at the end the results are written into a file. The game tree will be constructed using the stats from the file. Thanks again Tony Date: Wed, 12 Nov 2008 10:34:21 +0900 From: [EMAIL PROTECTED] To: computer-go@computer-go.org Subject: Re: [computer-go] Monte carlo play I am new to programming go, could some one explain to me how a monte carlo based evalution manages to play random games by itself? ie: who/what is the oppoent which supplies the opposing moves which allows another move to be randomly played after making the initial It is self-play, so both players are random. I'm not aware of any programs that use different playout strategies for black/white, though it has been briefly mentioned before on this list, and I personally think it might be interesting as a way of discovering and more accurately evaluating the imbalances present in most go positions (i.e. usually one side has more strength and one side has more territory, and so to correctly understand the position one side has to play aggressive moves and the other has to play defensive moves). move at the root. I am implementing in java, is there a package/framework which allows me to train my software. This page, recently started by Eric Marchand, might be a good starting place: http://ricoh51.free.fr/go/engineeng.htm Darren -- Darren Cook, Software Researcher/Developer http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic open source dictionary/semantic network) http://dcook.org/work/ (About me and my work) http://dcook.org/blogs.html (My blogs and articles) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ Get the best wallpapers on the Web - FREE. Click here! http://wallpapers.msn.com/?ocid=[B001MSN42A0716B] ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte carlo play
In message [EMAIL PROTECTED], [EMAIL PROTECTED] writes Hawaiian shirt analogy snipped I hope you don't feel offended. Indeed you took up a wonderful endeavour, but I sense that you're not quite ready to go for the summit today. But Tony never expressed any interest in going for the summit. Maybe he just wants to write a simple MC implementation. Here is what he wrote: I am new to programming go, could some one explain to me how a monte carlo based evalution manages to play random games by itself? ie: who/what is the oppoent which supplies the opposing moves which allows another move to be randomly played after making the initial move at the root. I am implementing in java, is there a package/framework which allows me to train my software. I will try to answer these questions myself. who/what is the oppoent which supplies the opposing moves It is the same piece of your program that supplies your own moves. For each random game or rollout, your program places both black and white stones, alternately, at random. is there a package/framework which allows me to train my software. Nothing trains your software. This isn't a neural net. The way it works is: you write a program A: it plays, badly you modify it so it plays differently, and with luck, better repeat from A Nick -- Nick Wedd[EMAIL PROTECTED] ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] Monte carlo play
You are absolutely right. Tony did not imply he wanted to go for the summit today. Also, Tony's qualifications are surely a lot better than a hawaiian shirt and a surfboard. My analogy was an exaggeration with the intention of explaining to Tony that he has more catching up to do than he might have expected. By no means was my intention to put Tony down. I hope Tony understands this and I sincerely hope that Tony won't give up on joining us in this wonderful endeavour, because we need all the help we can get. Best Regards, Dave Van: [EMAIL PROTECTED] namens Nick Wedd Verzonden: za 15-11-2008 15:48 Aan: computer-go Onderwerp: Re: [computer-go] Monte carlo play In message [EMAIL PROTECTED], [EMAIL PROTECTED] writes Hawaiian shirt analogy snipped I hope you don't feel offended. Indeed you took up a wonderful endeavour, but I sense that you're not quite ready to go for the summit today. But Tony never expressed any interest in going for the summit. Maybe he just wants to write a simple MC implementation. Here is what he wrote: I am new to programming go, could some one explain to me how a monte carlo based evalution manages to play random games by itself? ie: who/what is the oppoent which supplies the opposing moves which allows another move to be randomly played after making the initial move at the root. I am implementing in java, is there a package/framework which allows me to train my software. I will try to answer these questions myself. who/what is the oppoent which supplies the opposing moves It is the same piece of your program that supplies your own moves. For each random game or rollout, your program places both black and white stones, alternately, at random. is there a package/framework which allows me to train my software. Nothing trains your software. This isn't a neural net. The way it works is: you write a program A: it plays, badly you modify it so it plays differently, and with luck, better repeat from A Nick -- Nick Wedd[EMAIL PROTECTED] ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] Monte carlo play
Just a quickstart here, and I'm not one of the real experts, so it may contain inaccuracies. Perhaps you already know all this, perhaps not. The basis of many game-playing programs (like chess programs) is the minimax (or negamax) algorithm. It is a simple algorithm by itself and you can find a lot of background and examples for chess or other games if you google it. Basically, minimax explores all possible variations stemming from the current position and then selects the best move. This algorithm works in principle, but there is a problem with it: For interesting games like chess and go, it will usually take the program billions of years to find the best move, given a game position. In the past decades several adaptations of the minimax algorithm were invented to enable the algorithm to find a reasonably good move in a timespan that humans are prepared to wait for (like evaluation functions, alfabeta pruning and transposition tables). The program keeps investigating possibilities for some time and then the move seeming best so far is played. The longer the program is allowed to think (and the faster the computer), the better the move it will come up with. These adaptations work by reducing the number of possibilities the program explores, without weakening the program too much. The program must ignore many possibilities to save time (this is called pruning), but the more possibilities it ignores, the higher the risk that it accidentally ignores the best move. Also, there is allways the danger of the horizon effect: The program misinterprets the situation because it simply doesn't get enough time to fully investigate every possibility. The challenge for the programmer is to strike an optimal balance between time consumption and losing by accidentally ignoring or misinterpreting the best move. Some game programs are even able to finetune themselves to optimize this balance over time (learning). A combination of these adaptations have proven to be well enough to enable chess programs on current hardware to beat any human with reasonable thinking time. But computer-go is a harder problem than computer-chess. The adaptations that are succesful in computer-chess are just not enough for computer-go, mainly because there ar far more possibilities (and I mean FAR more) to explore in go. But now there is a recent adaptation that has proven to be fairly succesful in computer-go: MonteCarlo playouts. The program reduces the work to be done by gambling a number of times and measuring the average payoff. If the average payoff is good, the move is assumed to be good for the time being. Also, it doesn't automatically ignore moves with a low payoff (remember the risk of ignoring the best move). The program just postpones further investigation of seemingly bad moves until the moves seeming good initially give more and more dissapointing payoffs after further investigation. Although this invention has had only moderate success in computer-chess, it has proven to be fairly succesful in computer-go. But at the heart of a MonteCarlo go program, you can still recognize the minimax algorithm. Best Regards, Dave Van: [EMAIL PROTECTED] namens tony tang Verzonden: wo 12-11-2008 2:23 Aan: computer-go@computer-go.org Onderwerp: [computer-go] Monte carlo play Greetings all go gurus, I am new to programming go, could some one explain to me how a monte carlo based evalution manages to play random games by itself? ie: who/what is the oppoent which supplies the opposing moves which allows another move to be randomly played after making the initial move at the root. I am implementing in java, is there a package/framework which allows me to train my software. Thank you Kind regards Tony Read amazing stories to your kids on Messenger Try it Now! http://clk.atdmt.com/UKM/go/117588488/direct/01/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] FW: computer-go] Monte carlo play?
Van: tony tang [mailto:[EMAIL PROTECTED] Verzonden: za 15-11-2008 21:22 Aan: [EMAIL PROTECTED] Onderwerp: [RE:computer-go] Monte carlo play? Hello Dave, Thank you for a thorough introduction of the theory, and i sincerely hope I am not wasting your time with amateur questions. Because it was quite late when I sent the question I think I didnt word it correctly and mis-communicated my question. Being a computer scientist but new to go, i can grasp some of the theory. The question I was trying to get across was: In a game of self play, if both parties are employing only monte carlo, surely its not a good conceptual representation of a human, and if the reinforcement learning is based on random simulations wouldnt it be very weak when playing a real human? According to nick wedd and some playing around I am clearly mistaken because its works quite well. Is it because they are training(not in the neural network sense) their monte carlo based engine against another engine with prior knowledge or something of that nature? I have seen papers implementing patterns and prior knowledge with monte carlo (dated 2006) has that become a standard now, and when people refer to monte carlo they dont mean absolutly random? Thank you Kind regards Tony Read amazing stories to your kids on Messenger Try it Now! http://clk.atdmt.com/UKM/go/117588488/direct/01/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] FW: computer-go] Monte carlo play?
Van: [EMAIL PROTECTED] Verzonden: za 15-11-2008 22:44 Aan: tony tang Onderwerp: RE: computer-go] Monte carlo play? Hello Tony, Ok, so you know about the minimax algorithm and the like. My impression was wrong and I'm very sorry for my analogy. I'm no expert on montecarlo, so I can only say what I know. But most of the real experts have been rather quiet during the last week, so perhaps I can act as a stand-in. The basics is indeed absolutely random play (alternating turns and removing captured stones as in normal play) until the game finishes. The only exceptions to random plays are occupied intersections, simple ko recaptures, suicide moves (not sure if all programs prune suicide if the rules allow it) and moves that fill one-point eyes of the color who is to play (there may be slight differences in the exact definition of a one point eye). Without the one eye pruning, random games would never finish, because both random players would keep on killing their own stones and refilling the empty space after the opponent has captured. Even with eye detection and simple ko detection, the game could end up in a cycle occasionally because of triple ko or more pathological situations, so one should probably also limit the number of moves and discard the result if that limit is exceeded. This is probably a much cheaper solution to the cycle problem than a more sophisticated detection). A purely random game is played out until both players are forced to pass. This is a called a light playout. Claus Reinke recently dubbed the term random fill-in for this process (http://www.mail-archive.com/computer-go@computer-go.org/msg09749.html). I like this term better than light playout, but it doesn't seem to catch on here. The board is then counted to determine the winner and this binary result is added to a statistic for the move that is currently being evaluated. The accumulated statistic is effectively the evaluation value for the position. It is then backtracked to the game tree root in a minimax way. In go it is hard to quickly evaluate a position statically (or even slowly for that matter). So it is nice to have a dynamic evaluation that becomes more accurate as you spend more cpu cyles on it. You can see that it's important that random fill-ins are fast, but it's still a rather large investment to do a lot of playouts, when you compare this with a static evaluation function. So it is even more important to avoid wasting effort in evaluating positions by random fill-ins. The trick is to use crude statistics (few fill-ins) to decide which statistic to refine by investing more fill-ins. This is where the UCT value comes in (http://senseis.xmp.net/?UCT). It is derived from the statistic and it decides which position to investigate further, preferring good statistics over bad ones, but also preferring crude statistics over accurate ones. There is no training involved and there is no a priori knowledge involved here except for the rules and eye detection. The program has absolutely no idea about go concepts. Would you call this process learning? Programs using this algorithm are called pure MC programs and although they get better with more CPU cycles, they aren't very strong with normal time limits on current hardware (or hardware in the near future). You are right, by itself pure MC is just not effective enough. So programmers started to build in more Go knowledge to improve on this. This makes the fill-in process slower, so they are called heavy playouts. And this is also where my knowledge ends. I may guess that instead of selecting moves in a uniform random way, moves are weighted by a priori Go knowledge built in by the programmer, but I know no details and I don't know if there have been publications on this. Perhaps there aren't any, because it is also a competitive field and commercial interests might be involved too. Other contributors can probably tell you more about this. Best Regards, Dave Van: tony tang [mailto:[EMAIL PROTECTED] Verzonden: za 15-11-2008 21:22 Aan: [EMAIL PROTECTED] Onderwerp: [RE:computer-go] Monte carlo play? Hello Dave, Thank you for a thorough introduction of the theory, and i sincerely hope I am not wasting your time with amateur questions. Because it was quite late when I sent the question I think I didnt word it correctly and mis-communicated my question. Being a computer scientist but new to go, i can grasp some of the theory. The question I was trying to get across was: In a game of self play, if both parties are employing only monte carlo, surely its not a good conceptual representation of a human, and if the reinforcement learning is based on random simulations wouldnt it be very weak when playing a real human? According to nick wedd and some playing around I am clearly mistaken because its works quite well. Is it because they are training(not in the neural