On Jan 23, 2008 3:11 AM, Hiroshi Yamashita <[EMAIL PROTECTED]> wrote:
> 2. Change UCT exploration parameter
>   exp_param = sqrt(2.0);
>   uct = exp_param * sqrt( log(sum of all children playout) /
>                         (number of child playout) );
>   uct_value = (child winning rate) + uct;
>
>  I changed exp_param sqrt(2.0) to sqrt(0.1).

2.0 to 0.1? That's a pretty big step. I did notice that libego worked
better with lower constants than recommended by the formula -- in
libego, the original UCB1 formula corresponds to an explore_rate
coefficient of 8, but the default explore_rate coefficient of 1 seemed
to be about the best.

I am curious if any of those of you who have heavy-playout programs
would find a benefit from the following modification:

>   exp_param = sqrt(0.2); // sqrt(2) times the original parameter value.
>   uct = exp_param * sqrt( log(sum of all children playout)
>                           * (child-win-rate-2) /
>                         (number of child playout) );
>   uct_value = (child winning rate) + uct;

where child-win-rate-2 is defined as

(#wins + 1) / (#wins + #losses + 2)

If it happens that you do the equivalent of initializing #wins to 1
and #losses to 1 (in libego, setting the initial bias to 2), then you
can just use your original (child winning rate) value. (W+1)/(W+L+2)
is the mean of a beta(W+1,L+1) random variable, which is what you get
when you start with a uniform(0,1) distribution and condition it by
the observation of a W-L record. The explore parameter inside the
square root is doubled so that when you have an average
child-win-rate-2 value of 0.5, the formula returns the same value as
before. (Using an initial bias of 2.0 seems like a good thing anyhow.)

Adding this extra term seemed to help a bit (57% +/- 4.5% win rate over
unmodified program) when the basic playouts were uniform.

This change tends to make the formula stick with proven winners more
than before, so you might need to increase the explore rate by a
little more than sqrt(2).

Your mileage may vary -- I'd also like to know if you try this change
and find it unhelpful.

The justification is that the standard deviation of a beta(1,21) is
lower than the standard deviation of a beta(11,11) variable. The error
term of a beta(21,1) is lower too, but applying the (L+1)/(W+L+2) term
to the UCB1 formula can yield nonsensical results where when two moves
have the same bias, the one with the worse win-loss record is assigned
the higher UCB1 value.
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to