On Jan 23, 2008 3:11 AM, Hiroshi Yamashita <[EMAIL PROTECTED]> wrote: > 2. Change UCT exploration parameter > exp_param = sqrt(2.0); > uct = exp_param * sqrt( log(sum of all children playout) / > (number of child playout) ); > uct_value = (child winning rate) + uct; > > I changed exp_param sqrt(2.0) to sqrt(0.1).
2.0 to 0.1? That's a pretty big step. I did notice that libego worked better with lower constants than recommended by the formula -- in libego, the original UCB1 formula corresponds to an explore_rate coefficient of 8, but the default explore_rate coefficient of 1 seemed to be about the best. I am curious if any of those of you who have heavy-playout programs would find a benefit from the following modification: > exp_param = sqrt(0.2); // sqrt(2) times the original parameter value. > uct = exp_param * sqrt( log(sum of all children playout) > * (child-win-rate-2) / > (number of child playout) ); > uct_value = (child winning rate) + uct; where child-win-rate-2 is defined as (#wins + 1) / (#wins + #losses + 2) If it happens that you do the equivalent of initializing #wins to 1 and #losses to 1 (in libego, setting the initial bias to 2), then you can just use your original (child winning rate) value. (W+1)/(W+L+2) is the mean of a beta(W+1,L+1) random variable, which is what you get when you start with a uniform(0,1) distribution and condition it by the observation of a W-L record. The explore parameter inside the square root is doubled so that when you have an average child-win-rate-2 value of 0.5, the formula returns the same value as before. (Using an initial bias of 2.0 seems like a good thing anyhow.) Adding this extra term seemed to help a bit (57% +/- 4.5% win rate over unmodified program) when the basic playouts were uniform. This change tends to make the formula stick with proven winners more than before, so you might need to increase the explore rate by a little more than sqrt(2). Your mileage may vary -- I'd also like to know if you try this change and find it unhelpful. The justification is that the standard deviation of a beta(1,21) is lower than the standard deviation of a beta(11,11) variable. The error term of a beta(21,1) is lower too, but applying the (L+1)/(W+L+2) term to the UCB1 formula can yield nonsensical results where when two moves have the same bias, the one with the worse win-loss record is assigned the higher UCB1 value. _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/