Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: >Yes, in our experiments they were just constant numbers M=N=100. If M and N are the same, is there any reason to run M simulations and N simulations separately? What happens if you combine them and calculate V and g in the single loop? >Okay, let's continue the example above. Let's say that in position s, >using the current theta, moves a, b and c will be selected with >probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was >actually selected. Now consider pattern 1, this is matched after a. >But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So >psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- >(0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a) >= [0.2,0.3,-0.3,-0.2]. > >In other words, psi tells us whether each pattern was actually matched >more or less than we could have expected. I understood what psi was. I am not sure how it works, but anyway I can see your algorithm now. Thanks. -- Yamato ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
IMO other people's equations/code/ideas/papers always seem smarter than your own. The stuff you understand and do yourself just seems like common sense, and the stuff you don't always has a mystical air of complexity, at least until you understand it too :-) On 30-Apr-09, at 1:59 PM, Michael Williams wrote: I wish I was smart :( ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Yamato, Thanks for the detailed explanation. M, N and alpha are constant numbers, right? What did you set them to? You're welcome! Yes, in our experiments they were just constant numbers M=N=100. The feature vector is the set of patterns you use, with value 1 if a pattern is matched and 0 otherwise. The simulation policy selects actions in proportion to the exponentiated, weighted sum of all matching patterns. For example let's say move a matches patterns 1 and 2, move b matches patterns 1 and 3, and move c matches patterns 2 and 4. Then move a would be selected with probability e^(theta1 + theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 + theta4)). The theta values are the weights on the patterns which we would like to learn. They are the log of the Elo ratings in Remi Coulom's approach. OK, I guess it is the formula 5 in the paper. Yes, exactly. The only tricky part is computing the vector psi(s,a). Each component of psi(s,a) corresponds to a particular pattern, and is the difference between the observed feature (i.e. whether the pattern actually occurred after move a in position s) and the expected feature (the average value of the pattern, weighted by the probability of selecting each action). I still don't understand this. Is it the formula 6? Could you please give me an example like the above? Yes that's right, this is equation 6. Okay, let's continue the example above. Let's say that in position s, using the current theta, moves a, b and c will be selected with probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was actually selected. Now consider pattern 1, this is matched after a. But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- (0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a) = [0.2,0.3,-0.3,-0.2]. In other words, psi tells us whether each pattern was actually matched more or less than we could have expected. Hope this helps. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
I wish I was smart :( David Silver wrote: Hi Remi, I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Okay, now I understand your point :-) It's a good question - and I think you're right. In REINFORCE any baseline can be subtracted from the reward, without affecting the expected gradient, but possibly reducing its variance. The baseline leading to the best estimate is indeed the average reward. So it should be the case that {-1,+1} would estimate the gradient g more efficiently than {0,1}, assuming that we see similar numbers of black wins as white wins across the training set. So to answer your question, we can safely modify the algorithm to use (z-b) instead of z, where b is the average reward. This would then make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of step-size). I don't think this would have affected the results we presented (because all of the learning algorithms converged anyway, at least approximately, during training) but it could be an important modification for larger boards. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Remi, I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Okay, now I understand your point :-) It's a good question - and I think you're right. In REINFORCE any baseline can be subtracted from the reward, without affecting the expected gradient, but possibly reducing its variance. The baseline leading to the best estimate is indeed the average reward. So it should be the case that {-1,+1} would estimate the gradient g more efficiently than {0,1}, assuming that we see similar numbers of black wins as white wins across the training set. So to answer your question, we can safely modify the algorithm to use (z-b) instead of z, where b is the average reward. This would then make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of step-size). I don't think this would have affected the results we presented (because all of the learning algorithms converged anyway, at least approximately, during training) but it could be an important modification for larger boards. -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: Sorry, I should have made it clear that this assumes that we are treating black wins as z=1 and white wins as z=0. In this special case, the gradient is the average of games in which black won. But yes, more generally you need to include games won by both sides. The algorithms in the paper still cover this case - I was just trying to simplify their description to make it easy to understand the ideas. I understood this. What I find strange is that using -1/1 should be equivalent to using 0/1, but your algorithm behaves differently: it ignores lost games with 0/1, and uses them with -1/1. Imagine you add a big constant to z. One million, say. This does not change the problem. You get either 100 or 101 as outcome of a playout. But then, your estimate of the gradient becomes complete noise. So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so much on the definition of the reward, there must be an optimal way to set the reward. Or there must a better way to define an algorithm that would not depend on an offset in the reward. The gradient already compensates for the playout policy (equation 9), so in fact it would bias the gradient to sample uniformly at random! Yes, you are right. There is still something wrong that I don't understand. There may be a way to quantify the amount of noise in the unbiased gradient estimate, and it would depend on the average reward. Probably setting the average reward to zero is what would minimize noise in the gradient estimate. This is just an intuitive guess. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Hi Remi, This is strange: you do not take lost playouts into consideration. I believe there is a problem with your estimation of the gradient. Suppose for instance that you count z = +1 for a win, and z = -1 for a loss. Then you would take lost playouts into consideration. This makes me a little suspicious. Sorry, I should have made it clear that this assumes that we are treating black wins as z=1 and white wins as z=0. In this special case, the gradient is the average of games in which black won. But yes, more generally you need to include games won by both sides. The algorithms in the paper still cover this case - I was just trying to simplify their description to make it easy to understand the ideas. The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. The gradient already compensates for the playout policy (equation 9), so in fact it would bias the gradient to sample uniformly at random! In equation 9, the gradient is taken with respect to the playout policy parameters. Using the product rule (third line), the gradient is equal to the playout policy probabilities multiplied by the sum likelihood ratios multiplied by the simulation outcomes z. This gradient can be computed by sampling playouts instead of multiplying by the playout policy probabilities. This is also why games with outcomes of z=0 can be ignored - they don't affect this gradient computation. More precisely: you should estimate the value of N playouts as Sum p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i / Sum p_i. This would be better. Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ? I think a similar point applies here. We care about the expected value of the playout policy, which can be sampled directly from playouts, instead of multiplying by the policy probabilities. You would only need importance sampling if you were using a different playout policy to the one which you are evaluating. But I guess I'm not seeing any good reason to do this? -Dave ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] Today's "Guardian"
Today's "Guardian" newspaper has, on the front page of its technology supplement, an article about recent developments in computer Go. You can read it at http://www.guardian.co.uk/technology/2009/apr/30/games-software-mogo Unlike the last newspaper article on computer Go that I saw, this one is intelligent and well-informed. Nick -- Nick Weddn...@maproom.co.uk ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
Rémi Coulom wrote: The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. Rémi More precisely: you should estimate the value of N playouts as Sum p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i / Sum p_i. This would be better. Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ? Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Monte-Carlo Simulation Balancing
David Silver wrote: 2. Run another N simulations, average the value of psi(s,a) over all positions and moves in games that black won (call this g) This is strange: you do not take lost playouts into consideration. I believe there is a problem with your estimation of the gradient. Suppose for instance that you count z = +1 for a win, and z = -1 for a loss. Then you would take lost playouts into consideration. This makes me a little suspicious. The fundamental problem here may be that your estimate of the gradient is biased by the playout policy. You should probably sample X(s) uniformly at random to have an unbiased estimator. Maybe this can be fixed with importance sampling, and then you may get a formula that is symmetrical regarding wins and losses. I don't have time to do it now, but it may be worth taking a look. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/