My hand-wavy argument succumbs to experimental data. And to a better argument. :)
I stand corrected. Thanks, Álvaro. On Wed, Dec 6, 2017 at 8:52 AM, Gian-Carlo Pascutto <[email protected]> wrote: > On 06-12-17 11:47, Aja Huang wrote: > > All I can say is that first-play-urgency is not a significant > > technical detail, and what's why we didn't specify it in the paper. > > I will have to disagree here. Of course, it's always possible I'm > misunderstanding something, or I have a program bug that I'm mixing up > with this. > > Or maybe you mean that you expect the program to improve regardless of > this setting. In any case, I've now seen people state here twice that > this is detail that doesn't matter. But practical results suggest > otherwise. > > For a strong supervised network, FPU=0 (i.e. not exploring all successor > nodes for a longer time, relying strongly on policy priors) is much > stronger. I've seen this in Leela Zero after we tested it, and I've > known it to be true from regular Leela for a long time. IIRC, the strong > open source Go bots also use some form of progressive widening, which > produces the same effect. > > For a weak RL network without much useful policy priors, FPU>1 is much > stronger than FPU=0. > > Now these are relative scores of course, so one could argue they don't > affect the learning process. But they actually do that as well! > > The new AZ paper uses MCTS playouts = 800, and plays proportionally > according to MCTS output. (Previous AGZ had playouts = 1600, > proportional for first 30 moves). > > Consider what this means for the search probability outputs, exactly the > thing the policy network has to learn. With FPU=1, the move > probabilities are much more uniform, and the moves played are > consequentially much more likely to be bad or even blunders, because > there are less playouts that can be spent on the best move, even if it > was found. > > > The initial value of Q is not very important because Q+U is > > dominated by the U piece when the number of visits is small. > > a = Q(s, a) + coeff * P(s,a) * (sqrt(parent->visits) / 1.0f + > child->visits()); > > Assume parent->visits = 100, sqrt = 10 > Assume child->visits = 0 > Assume P(s, a) = 0.0027 (near uniform prior for "weak" network) > > The right most side of this (U term) is ~1. This clearly does not > dominate the Q term. If Q > 1 (classic FPU) then every child node will > get expanded. If Q = 0 (Q(s, a) = 0) then the first picked child > (largest policy prior) will get something like 10 expansions before > another child gets picked. That's a massive difference in search tree > shape, *especially* with only 800 total playouts. > > -- > GCP > _______________________________________________ > Computer-go mailing list > [email protected] > http://computer-go.org/mailman/listinfo/computer-go >
_______________________________________________ Computer-go mailing list [email protected] http://computer-go.org/mailman/listinfo/computer-go
