Re: [Computer-go] Teaching Deep Convolutional Neural Networks to Play Go
Hi Oliver Reinforcement learning is different to unsupervised learning. We used reinforcement learning to train the Atari games. Also we published a more recent paper (www.nature.com/articles/nature14236) that applied the same network to 50 different Atari games (achieving human level in around half). Similar neural network architectures can indeed be applied to Go (indeed that was one of the motivations for our recent ICLR paper). However, training by reinforcement learning from self-play is perhaps more challenging than for Atari: our method (DQN) was applied to single-player Atari games, whereas in Go there is also an opponent. I could not guarantee that DQN will be stable in this setting. Cheers Dave On 16 March 2015 at 22:21, Oliver Lewis ojfle...@yahoo.co.uk wrote: Can you say anything about whether you think their approach to unsupervised learning could be applied to networks similar to those you trained? Any practical or theoretical constraints we should be aware of? On Monday, 16 March 2015, Aja Huang ajahu...@gmail.com wrote: Hello Oliver, 2015-03-16 11:58 GMT+00:00 Oliver Lewis ojfle...@yahoo.co.uk: It's impressive that the same network learned to play seven games with just a win/lose signal. It's also interesting that both these teams are in different parts of Google. I assume they are aware of each other's work, but maybe Aja can confirm. The authors are my colleagues at Google DeepMind as on the paper they list DeepMind as their affiliation. Yes we are aware of each other's work. Aja ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional Neural Networks
Hi Martin - Would you be willing to share some of the sgf game records played by your network with the community? I tried to replay the game record in your paper, but got stuck since it does not show any of the moves that got captured. Sorry about that, we will correct the figure and repost. In the meanwhile Aja will post the .sgf for that game. Also, thanks for noticing that we tested against a stronger version of Fuego than Clark and Storkey, we'll evaluate against Fuego 1.1 and post the results. Unfortunately, we only have approval to release the material in the paper, so we can't really give any further data :-( One more thing, Aja said he was tired when he measured his own performance on KGS predictions (working too hard on this paper!) So it would be great to get better statistics on how humans really do at predicting the next move. Does anyone want to measure their own performance, say on 200 randomly chosen positions from the KGS data? - Do you know how large is the effect from using the extra features that are not in the paper by Clarke and Storkey, i.e. the last move info and the extra tactics? As a related question, would you get an OK result if you just zeroed out some inputs in the existing net, or would you need to re-train a new network from fewer inputs. We trained our networks before we knew about Clark and Storkey's results, so we haven't had a chance to evaluate the differences between the approaches. But it's well known that last move info makes a big difference to predictive performance, so I'd guess they would already be close to 50% predictive accuracy if they included those features. - Is there a way to simplify the final network so that it is faster to compute and/or easier to understand? Is there something computed, maybe on an intermediate layer, that would be usable as a new feature in itself? This is an interesting idea, but so far we only focused on building a large and deep enough network to represent Go knowledge at all. Cheers Dave ___ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go
Re: [computer-go] David Silvers Rave formula
Hi Lars, is there anyone who can repost the pdf (rave.pdf?) the following mails are talking about? http://computer-go.org/pipermail/computer-go/2008-February/014095.html I think you can still find the original attachment here: http://computer-go.org/pipermail/computer-go/attachments/20080208/6519e9c5/rave.pdf However, if you can wait a few weeks I will be publishing a clearer (I hope!) explanation of how to combine UCT and RAVE in my PhD thesis. Cheers -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi, We used alpha=0.1. There may well be a better setting of alpha, but this appeared to work nicely in our experiments. -Dave On 3-May-09, at 2:01 AM, elife wrote: Hi Dave, In your experiments what's the constant value alpha you set? Thanks. 2009/5/1 David Silver sil...@cs.ualberta.ca: Yes, in our experiments they were just constant numbers M=N=100. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, If M and N are the same, is there any reason to run M simulations and N simulations separately? What happens if you combine them and calculate V and g in the single loop? I think it gives the wrong answer to do it in a single loop. Note that the simulation outcomes z are used to compute both V and g, so that they are quite strongly correlated. In general, if random variables X and Y are correlated then E[X]E[Y] != E[XY]. A simple example of this problem would be sampling E[X]E[X] for some random variable X. Let's say X is actually just noise with mean zero. If you just take one sample x1, then x1*x1 is always +ve, and you would estimate E[X]E[X]=E[X^2]0. But if you take two independent samples x1 and x2, then x1*x2 would be +ve half the time and -ve half the time, and you would correctly estimate E[X]E[X]=0. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Remi, This is strange: you do not take lost playouts into consideration. I believe there is a problem with your estimation of the gradient. Suppose for instance that you count z = +1 for a win, and z = -1 for a loss. Then you would take lost playouts into consideration. This makes me a little suspicious. Sorry, I should have made it clear that this assumes that we are treating black wins as z=1 and white wins as z=0. In this special case, the gradient is the average of games in which black won. But yes, more generally you need to include games won by both sides. The algorithms in the paper still cover this case - I was just trying to simplify their description to make it easy to understand the ideas. The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. The gradient already compensates for the playout policy (equation 9), so in fact it would bias the gradient to sample uniformly at random! In equation 9, the gradient is taken with respect to the playout policy parameters. Using the product rule (third line), the gradient is equal to the playout policy probabilities multiplied by the sum likelihood ratios multiplied by the simulation outcomes z. This gradient can be computed by sampling playouts instead of multiplying by the playout policy probabilities. This is also why games with outcomes of z=0 can be ignored - they don't affect this gradient computation. More precisely: you should estimate the value of N playouts as Sum p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i / Sum p_i. This would be better. Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ? I think a similar point applies here. We care about the expected value of the playout policy, which can be sampled directly from playouts, instead of multiplying by the policy probabilities. You would only need importance sampling if you were using a different playout policy to the one which you are evaluating. But I guess I'm not seeing any good reason to do this? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Remi, I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Okay, now I understand your point :-) It's a good question - and I think you're right. In REINFORCE any baseline can be subtracted from the reward, without affecting the expected gradient, but possibly reducing its variance. The baseline leading to the best estimate is indeed the average reward. So it should be the case that {-1,+1} would estimate the gradient g more efficiently than {0,1}, assuming that we see similar numbers of black wins as white wins across the training set. So to answer your question, we can safely modify the algorithm to use (z-b) instead of z, where b is the average reward. This would then make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of step-size). I don't think this would have affected the results we presented (because all of the learning algorithms converged anyway, at least approximately, during training) but it could be an important modification for larger boards. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, Thanks for the detailed explanation. M, N and alpha are constant numbers, right? What did you set them to? You're welcome! Yes, in our experiments they were just constant numbers M=N=100. The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. OK, I guess it is the formula 5 in the paper. Yes, exactly. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). I still don't understand this. Is it the formula 6? Could you please give me an example like the above? Yes that's right, this is equation 6. Okay, let's continue the example above. Let's say that in position s, using the current theta, moves a, b and c will be selected with probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was actually selected. Now consider pattern 1, this is matched after a. But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- (0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a) = [0.2,0.3,-0.3,-0.2]. In other words, psi tells us whether each pattern was actually matched more or less than we could have expected. Hope this helps. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
IMO other people's equations/code/ideas/papers always seem smarter than your own. The stuff you understand and do yourself just seems like common sense, and the stuff you don't always has a mystical air of complexity, at least until you understand it too :-) On 30-Apr-09, at 1:59 PM, Michael Williams wrote: I wish I was smart :( ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, Could you give us the source code which you used? Your algorithm is too complicated, so it would be very helpful if possible. Actually I think the source code would be much harder to understand! It is written inside RLGO, and makes use of a substantial existing framework that would take a lot of effort to understand. (On a separate note I am considering making RLGO open source at some point, but I'd prefer to spend some effort cleaning it up before making it public). But I think maybe Algorithm 1 is much easier than you think: A: Estimate value V* of every position in a training set, using deep rollouts. B: Repeat, for each position in the training set 1. Run M simulations, estimate value of position (call this V) 2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g) 3. Adjust parameters by alpha * (V* - V) * g The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). It's also very important to be careful about signs and the colour to play - it's easy to make a mistake and follow the gradient in the wrong direction. Is that any clearer? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: [computer go] Monte-Carlo Simulation Balancing
Hi Remi, What komi did you use for 5x5 and 6x6 ? I used 7.5 komi for both board sizes. I find it strange that you get only 70 Elo points from supervised learning over uniform random. Don't you have any feature for atari extension ? This one alone should improve strength immensely (extend string in atari because of the previous move). Actually no. The features are very simple, and know how to capture but not how to defend ataris. I'm sure that a better set of features could improve by more than 70 Elo, but I expect we would still see a benefit to balancing the weights correctly. For example, the Fuego policy defends ataris and follows several other common-sense rules, but the results in 5x5 and 6x6 show that it is not well balanced on small boards. Let us extrapolate: I got more than 200 Elo points of improvements from my patterns in 9x9 over uniform random (I never really measured that, it may be even more than 200). I guess you got more than 200 Elo on 9x9, in Mogo (Gelly et al. 2006) the improvement from uniform random was at least 400 Elo and I think your simulation policy is probably at least as strong. By the way I was sad to hear you're not working on Crazystone any more. Is it because you are you busy with other projects? So maybe I could get 600 more Elo points with your method. And even more on 19x19. I noticed that, in general, changes in the playout policy have a much bigger impact on larger boards than on smaller boards. As to whether we can extrapolate, I hope so :-) I share the same feeling that improving the simulation policy will be more impactful on bigger boards with longer simulations. On the other hand I've been surprised by many things in MC Go, so nothing is certain until we try it! -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, I like you idea, but why do you use only 5x5 and 6x6 Go? 1. Our second algorithm, two-ply simulation balancing, requires a training set of two-ply rollouts. Rolling out every position from a complete two-ply search is very expensive on larger board sizes, so we would probably have to consider some subset of leaf positions. We wanted to analyse the full algorithm first, before we started making approximations. 2. We can generate a lot more data on small boards, to give high confidence on the results we report. 3. IMO it's important to do the science to understand underlying principles first, and then scale up to bigger boards, more complex Monte-Carlo searches, etc. I don't think the 200+ Elo improvement is so impressive I agree that it would be much more impressive to report positive results on larger boards. But perhaps it is already interesting that tuning the balance of the simulation policy can make a big difference on small boards? Also, larger boards mean longer simulations, and so the importance of simulation balancing should become even more exaggerated. because the previous approaches were not optimized for such a small boards. I'm not sure what you mean here? The supervised learning and reinforcement learning approaches that we compared against are both trained on the small boards, i.e. the pattern weights are specifically optimised for that size of board. I agree that the handcrafted policy from Fuego was not optimised for small boards, which is why it performed poorly. But perhaps this is also interesting, i.e. it suggests that a handcrafted policy for 9x9 Go may be significantly suboptimal when used in 19x19 Go. So automatically learning a simulation policy may ultimately prove to be very beneficial. I'm looking forward to your results on larger boards. Me too :-) Coming soon, will let you know how it goes. But I hope that others will try these ideas too, it's always much better to compare multiple implementations of the same algorithm. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Hi Remi, If I understand correctly, your method makes your program 250 Elo points stronger than my pattern-learning algorithm on 5x5 and 6x6, by just learning better weights. Yes, although this is just in a very simple MC setting. Also we did not compare directly to the algorithm you used (minorisation/maximisation), but to a different algorithm for maximising the likelihood of an expert policy. But as the softmax policy is a log transformation of the generalised Bradley-Terry model, I think this comparison is still valid. I had stopped working on my program for a very long time, but maybe I will go back to work and try your algorithm. I would be really interested to know how this works out! Please let me know. Best wishes -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Monte-Carlo Simulation Balancing
Hi Michael, But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position. But isn't the actual value of the position either a win or a loss? On such small boards, can't you assume that Fuego is able to correctly determin who is winning and round it's evaluation to the nearest win/loss? i.e. if it evaluates the position to 0.674, that gets rounded to 1. If such an assumption about Fuego's ability to read the position on a small board is valid, then it should improve the results of your balanced simulation strategy, right? Or am I missing something? It's true that 5x5 Go is solved, so in principle we could have used the true minimax values. However we chose to use an approach that can scale to larger boards, which means that we should treat the expert evaluations as approximate. And in fact Fuego was not always accurate on 6x6 boards, as we used only 10k simulations in our training set. Also, I think it really helps to have soft rather than hard expert evaluations. We want a simulation policy that helps differentiate e.g. a 90% winning position from an 85% winning position. Rounding all the expert evaluations to 0 or 1 would lose much of this advantage. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: RAVE formula of David Silver (reposted)
This document is confusing, but here is my interpretation of it. And it works well for Valkyria. I would really want to see a pseudocode version of it. I might post the code I use for Valkyria, but it is probably not the same thing so I would probably just increase the confusion if I did... The virtual win-visits (which I think you meant and not 'win/loss') ratios *are* what is computed in Equation 12. Equation 13 is standard UCT. You use equation 14 instead of equation 13 to select the move to search. For moves that are searched a lot Eq14 will finally approach Eq13, since Beta should go towards 0. Thanks for clarifying this Magnus. In fact the first idea (combining the values of UCT and RAVE, eqns 1-11) is reasonably well justified, and works well in practice. The second idea (combining the upper confidence bounds, eqns 12-14) is not so well justified, and I believe several people have found better ways to do this. I think the term RAVE is often used in a confusing manner. Sometimes it just means AMAF or as I prefer virtual win-visit ratios, and sometimes RAVE seems to be that the algorithm that mixes the AMAF values with normal UCT-values as described in the PDF. I'd like to suggest the following use of terminology: AMAF: the idea of estimating the value of an immediate move, from the value of that move played at any time. RAVE: the idea of building a tree of AMAF values, one for each position and move in the tree. UCT-RAVE: the idea of combining a RAVE value with a UCT value, for each position and move in the tree. The way I think of it, RAVE is to AMAF as UCT is to UCB. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Paper for AAAI
Hi Petr, Thanks for the great comments, sorry to be so slow in getting back to you (on vacation/workshop...) Hello, On Sun, Apr 06, 2008 at 08:55:26PM -0600, David Silver wrote: Here is a draft of the paper, any feedback would be very welcome :-) http://www.cs.ualberta.ca/~silver/research/publications/files/MoGoNectar.pdf you are saying that in minimax, opponent moves are selected by minimizing the lower confidence bound - this seems novel, is that so? I always got the impression that for the opponent moves, you reverse the mean value but still use UCB. I think this just depends on how you formulate the problem. If you use minimax then the opponent moves should minimise a lower confidence bound. If you use negamax then the opponent moves should maximise an upper confidence bound. I see three unclear details about the RAVE algorithm: Are only nodes in the tree considered for the inclusion, or also moves in the following random playout? Also moves in the following random playout. Will clarify, thanks. And one of the sentences hints that there is a separate period of playouts purely seeding the RAVE value before the UCT-RAVE linear combination takes over - is that so? No, there is no separate period, it is always a linear combination (with decaying weight). And it does not follow from the paper that that the UCB formula is used for the RAVE value as well, while the ICML paper states that. Unfortunately we did not have space for this discussion :-( In any case, later versions of MoGo used (still use?) an exploration constant of zero, which means that the UCB formula is no longer used. I am surprised on the bad effect of the grandfather heuristic and the good effect of the even game heuristic. I assume that the effect of the heuristics should accumulate when several of them are combined in the prior value? Yes, I agree it is surprising! Combining sources of prior knowledge may well be a good idea, I don't believe much work has been done on this topic. The paper looks very nice otherwise. Thanks! -Dave PS Apologies for the pdf problems that some people encountered, I'll post a new version on my website shortly.___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Paper for AAAI
Hi everyone, Sylvain and myself have had a paper accepted for the Nectar track at the 23rd Conference on Artificial Intelligence (AAAI-08). The idea of this track is to summarise previously published results from a specific field to a wider audience interested in general AI. Please bear in mind the following: 1. There is no new material in this paper that we have not already published. 2. There is a 4 page limit, and so there are many details that have been skipped over in this paper. 3. The learned heuristic, using local shape features, is not used in the competitive version of MoGo (including the MPI MoGo that played a professional player last week). Several people on the mailing list expressed disappointment last year that they found our ICML paper inaccessible. This paper is intended to be read by a wider audience, and should hopefully be more accessible. Having said that, this is still an academic paper for a technical AI conference, and so some background knowledge is assumed :-) Here is a draft of the paper, any feedback would be very welcome :-) http://www.cs.ualberta.ca/~silver/research/publications/files/MoGoNectar.pdf David and Sylvain ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] New UCT-RAVE formula (was Re: computer-go Digest, Vol 43, Issue 8)
I am very confused about the new UCT-RAVE formula. The equation 9 seems to mean: variance_u = value_ur * (1 - value_ur) / n. Is it wrong? If correct, why is it the variance? I think that the variance of the UCT should be: variance_u = value_u * (1 - value_u). Hi Yamato, There are two differences between your suggestion and the original formula, so I'll try and address both: 1. Your formula gives the variance of a single simulation, with probability value_u. But the more simulations you see, the more you reduce the uncertainty, so you must divide by n. In general, the variance of a single coin-flip (with probability p of heads) is p(1-p). The variance of the sum of n coin-flips is np(1-p). The variance of the average of n coin-flips is p(1-p)/n. This is what we want! 2. The variance of the estimate is actually given by: variance_u = true_value_u * (1 - true_value_u) / n, where true_value_u is the real probability of winning a simulation (for the current policy), if we had access to an oracle. Unfortunately, we don't - so we use the best available estimate. If we have seen a large number of simulations, then you are right that value_u is the best estimate. But if we have only seen a few simulations, then value_r gives a better estimate (this is the point of RAVE!) The whole point of this approach is to form the best possible estimate of true_value_u, by combining these two estimates together. In a way this is somewhat circular: we use the best estimate so far to compute the best new estimate. But I don't think that is unreasonable in this case. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] New UCT-RAVE formula (was Re: computer-go Digest, Vol 43, Issue 8)
Hi Erik, Thanks for the thought-provoking response! Yes, but why add upper confidence bounds to the rave values at all? If they really go down that fast, does it make much of a difference? According to the recent experiments in MoGo, you are right :-) However, I've seen slightly different results in other strong UCT-RAVE programs. I think you are right that the exploration interacts heavily with the playout strategy. If the playouts heavily favour a particular response (e.g. black always connects at d3), then using an UCB on RAVE is one way to ensure that the opposite result is also tried (e.g. what happens if white cuts at d3?). I did of course test this and my program actually became weaker when I added UCBs to the rave values! Interesting! Did you use prior knowledge in your RAVE values? (See below) The way I originally understood it the purpose of rave was to play stronger when the number of simulations is low. This can already be achieved by adding a greedy component to emphasize moves that are likely to be winning based on their rave-value alone. Agreed, this is certainly the most important aspect of RAVE. The purpose of UCB in UCT is clear (to ensure sufficient exploration when the tree grows large). UCB in rave does not really do the same. Agreed. I think its two main effects are: (1) it emphasizes the inverse order in which the moves where added, Not sure what you mean by this. and (2) it emphasizes exploration of moves that are infrequently selected in the playouts. Agreed. I think neither of them is a good thing. The better the move ordering or the knowledge in the playouts, the more it hurts... Not if you also encode your playout knowledge as prior knowledge for the RAVE algorithm. This way, you can specify your confidence in the choices made by the playouts. The UCB is always relative to this confidence level. Having said that, there may already be sufficient exploration without the UCB bonuses. Perhaps all we can say is that RAVE, prior knowledge and exploration all interact heavily, and that the best level of exploration depends on the exact choices made. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] New UCT-RAVE formula (was Re: computer-go Digest, Vol 43, Issue 8)
David Silver wrote: BTW if anyone just wants the formula, and doesn't care about the derivation - then just use equations 11-14. Yes, I just want to use the formula. But I don't know what the bias is... How can I get the value of br? Sorry for the slow reply... The simplest answer is that the bias is just a constant, estimating how wrong the RAVE estimates typically are. One way to get this number is trial and error. Another way would be to look at the average absolute difference between RAVE and UCT estimates in nodes with many simulations. -Dave___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: Explanation to MoGo paper wanted.
In other words UCT works well when evaluation/playouts is/are strong. I believe there are still improvements possible to the UCT algorithm as shown by the recent papers by Mogo and Crazystone authors, but what really will make a difference is in the quality in the playouts. Sylvain said that good moves in the playouts do not always improve the performance of UCT. What do you think about this claim? I believe this claim is true in two senses: 1) If the computation necessary to find better moves is too expensive, performing many dumb playouts may be a better investment. Sure, this is true. But even with the same number of simulations, stonger playouts do not necessarily perform better than dumb playouts. This is the real mystery! 2) If the playouts are too deterministic, and the moves are merely pretty good, the program may avoid an important move and thus misjudge the value of a position. We tried the whole spectrum from completely random to completely deterministic playouts, but we never came close to the performance of the dumb playouts! We have seen a similar effect many times in MoGo. Often we try something that seems like it should improve the quality of the simulation player, but it makes the overall performance worse. It is frustrating and surprising! Has anyone else encountered this? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: Explanation to MoGo paper wanted.
Seems like it should be up to the person in the other environment to adapt your successful algorithm (and notation/terminology) to their environment. But how do the other people in other environments find out about the algorithm? And find out that it is something they could use in their own environment? I think we can help with both, by presenting our work in a more general way. -Dave___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: Explanation to MoGo paper wanted. (BackGammon Code)
It's because Go is not only game in the world and certainly not only reinforcement learning problem. They are using a widely accepted terminology. But a very inappropriate one. I have read Suttons book and all the things I know (e.g. TD-Gammon) are completly obfuscated. Really? I think it is a wonderful example of clear thinking and clear writing - I couldn't put the book down. It is the reason I chose to study RL, and to come study with Rich Sutton. Its maybe suitable to present generel concepts, but it is extremly complicated to formulate an algorithm in this framework. Of course everyone hopes that ideas will be presented to them in their personal terminology, as it saves them some effort. But we make progress in science by unifying and identifying the general concepts. But the main point is: I think game programmers should be more proud of their work and should present their results in the language of game programming. We are the ones which make progress, not these paper tigers. Isn't there room for both? Shouldn't we present our work within our own community, but also make efforts to share our ideas with others? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: Explanation to MoGo paper wanted. (BackGammon Code)
It's because Go is not only game in the world and certainly not only reinforcement learning problem. They are using a widely accepted terminology. But a very inappropriate one. I have read Suttons book and all the things I know (e.g. TD-Gammon) are completly obfuscated. Its maybe suitable to present generel concepts, but it is extremly complicated to formulate an algorithm in this framework. Here is quick and dirty RL-Computer Go translation kit to try and help bridge the gap! RL terminology Go terminology State Position Action Move Reward Win/Loss Return Win/Loss Episode Game Time-step One move Agent Program Value function Evaluation function Policy Player Default policy Simulation player Uniform random policy Light simulation player Other stochastic policy Heavy simulation player Greedy policy 1-ply search player Epsilon-greedy policy 1-ply search player with some random moves FeatureFactor used for position evaluation Weight Weight of each factor in evaluation function Tabular representation One weight for each complete position Partial tabular UCT tree representation State abstraction One weight for many positions Linear value function Evaluation function approximation using weighted sum of various factors Feature discovery Learning new factors for the evaluation function Sample-based search Simulation (Monte-Carlo methods, etc.) Transition function Rules of the game Environment Rules of the game + opponent Trajectory Move sequence Online During actual play Offline Before/after actual play (e.g. preprocessing) On-policy If both players play as normal Off-policy If either player behaves differently -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: Amsterdam 2007 paper
On 5/18/07, Rémi Coulom [EMAIL PROTECTED] wrote: My idea was very similar to what you describe. The program built a collection of rules of the kind if condition then move. Condition could be anything from a tree-search rule of the kind in this particular position play x, or general rule such as in atari, extend. It could be also anything in-between, such as a miai specific to the current position. The strengths of moves were updated with an incremental Elo-rating algorithm, from the outcomes of random simulations. The obvious way to update weights is to reward all the rules that fired for the winning side, and penalize all rules that fired for the losing side, with rewards and penalties decaying toward the end of the playout. But this is not quite Elo like, since it doesn't consider rules to beat each other. So one could make the reward dependent on the relative weight of the chosen rule versus all alternatives. increasing the reward if the alternatives carried a lot of weight. Is that how your ratings worked? I'm not sure how that compares with TD learning. Maybe someone more familiar with the latter can point out the differences. TD learning (with linear function approximation) uses a gradient descent rule to update weights. The simplest gradient descent rule, LMS or Widrow-Hoff, does something like you describe: rules that are followed by positive reward (win) are increased in weight, and rules that are followed by negative reward (loss) are decreased. The exact update depends on the set of rules firing, and is proportional to the error between the estimated reward (based on all rules) and the actual reward. In other words, each weight is updated a little towards the value which would have made a correct overall prediction. TD learning is similar, except that it updates weights towards a subsequent prediction of the reward (e.g. on the next move), instead of the actual reward. Rich Sutton gives a much better explanation than me: http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: computer-go Digest, Vol 34, Issue 15
Very interesting paper! I have one question. The assumption in your paper is that increasing the performance of the simulation player will increase the performance of Monte-Carlo methods that use that simulation player. However, we found in MoGo that this is not necessarily the case! Do you think there is some property of your learning algorithm that makes it particularly suitable for Monte-Carlo methods? Thanks! Dave Date: Thu, 17 May 2007 01:17:55 +0200 From: R?mi Coulom [EMAIL PROTECTED] Subject: [computer-go] Amsterdam 2007 paper To: computer-go computer-go@computer-go.org Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi, I first thought I would keep my ideas secret until the Asmterdam tournament, but now that I have submitted my paper, I cannot wait to share it. So, here it is: http://remi.coulom.free.fr/Amsterdam2007/ Comments and questions are very welcome. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: Amsterdam 2007 paper
I also use an online learning algorithm in RLGO to adjust feature weights during the game. I use around a million features (all possible patterns from 1x1 up to 3x3 at all locations on the board) and update the weights online from simulated games using temporal difference learning. I also use the sum of feature weights to estimate the value of a move, rather than a multiplicative estimate. The learning signal is only win/lose at the end of each simulation, rather than supervised learning like Remi. The results are encouraging (currently ~1820 Elo on CGOS, based on 5000 simulations per move) for a program that does not use UCT or Monte-Carlo Tree Search in any way. Like Alvaro and Remi, I am convinced that online learning during simulation has great potential. In fact, I would go even further, and suggest that Monte-Carlo Tree Search and UCT are at one end of a large spectrum of promising sample-based search algorithms. We don't need to restrict ourselves to a tabular representation: state abstraction (for example, using pattern features) provides a way to generalise between positions and learn more efficiently. Nor do we need to restrict our learning algorithm to Monte-Carlo: any online reinforcement learning algorithm can be applied to learn from simulated experience. In classical game programming terms, we can think of this approach as dynamically learning an evaluation function. Instead of learning a single evaluation function that applies to all positions, we tailor the evaluation function to the current position, by learning from simulated experience. Rather than thinking of learning as an offline procedure that takes months to train weights, let's think of it as an online procedure that can run in a few seconds. Once we learn the evaluation function, it can be used in any way we please - for example alpha-beta search (RLGO 2.1 uses a full width 5-ply search) or as a heuristic function for UCT (current research in MoGo). I would be very interested to hear more about Dimwit's approach, and also Remi's experiments with online learning in CrazyStone. -Dave PS Apologies if you receive this twice, I had some difficulties posting. Rémi: John and I are planning a similar usage of patterns and features in the MC simulations, although we were thinking of a very different method to update strengths. Although we derived our formulas from an extension of the logistic regression instead of Bradley-Terry models, we arrived at very similar expressions. The main difference is that what we have been calling score seems to be the log of your strength, and instead of multiplying strengths when considering teams, we just add the scores of different features. We use an online learning algorithm to adjust the scores during the game, which severely limits the kind of algorithms we can use, but also allows us to learn things that apply specifically to the situations that are appearing on the board in this game. Now we only have to make dimwit 400 points stronger to show the world that our approach works. :) There are many things in the paper that we had never thought of, like considering the distance to the penultimate move. Most of those things won't be directly applicable to dimwit, but it gives us many suggestions of things to try. Thanks for this important contribution to computer go. Álvaro. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Re: Amsterdam 2007 paper
Thanks for the great paper. And thanks for sharing it before it's published. Now I know what directions to take my engine in next. Time for Team MoGo so share some more secrets :) We are publishing MoGo's secrets at ICML 2007, in just over a month. So not long to wait now! -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/