Re: [Computer-go] mini-max with Policy and Value network

Erik van der Werf Tue, 23 May 2017 05:15:28 -0700

On Mon, May 22, 2017 at 4:54 PM, Gian-Carlo Pascutto <g...@sjeng.org> wrote:

> On 22-05-17 15:46, Erik van der Werf wrote:
> > Anyway, LMR seems like a good idea, but last time I tried it (in Migos)
> > it did not help. In Magog I had some good results with fractional depth
> > reductions (like in Realization Probability Search), but it's a long
> > time ago and the engines were much weaker then...
>
> What was generating your probabilities, though? A strong policy DCNN or
> something weaker?
>

Nothing deep. Back then for the move predictor I don't think I ever tried
more than two hidden layers (and it was only used near the root of the
search tree). RPS was even simpler (so it could be used with fast deep
searches). In hindsight I traded way too much accuracy for speed, but
coming from standard a AlphaBeta framework it still was a big improvement.

ERPS (LMR with fractional reductions based on move probabilities) with
> alpha-beta seems very similar to having MCTS with the policy prior being
> a factor in the UCT formula.

In the sense of the shape of the tree, possibly yes, but I have the
impression that AlphaBeta and similar search algorithms are more brittle
when working with high-variance (noisy) evaluations. In chess-like games it
may be less of an issue due to the implicit mobility feature that it adds,
but for Go mobility seems to be mostly irrelevant. The MCTS backup
(averaging evaluations) seems to reduce the variance much better than a
minimax backup.

Using a value net instead of raw Monte Carlo evaluation also reduces
variance (a lot), so trying out AlphaBeta with DCNN evaluations definitely
seems like an interesting experiment.

> This is what AlphaGo did according to their
> 2015 paper, so it can't be terrible, but it does mean that you are 100%
> blind to something the policy network doesn't see, which seems
> worrisome. I think I asked Aja once about what they do with first play
> urgency given that the paper doesn't address it - he politely ignored
> the question :-)
>

I don't think anyone has had good results with high FPU; it seems in Go we
simply cannot afford a very wide search (except perhaps near the root or on
the PV). I'm not sure if they still use an UCB term (which would ensure
some exploration of unlikely candidates). I think at some point David and
others argued against it, but in my own experiments it was always helpful,
and I think Aja may have found the same in Erica. Nevertheless, even
without it I think an argument can be made that the minimax result can
eventually be found.

I have an idea on what's causing the problems in Leela (and how you could
fix it), but I'll hold of on further commenting until I have some more time
to look at the examples.

Best,
Erik

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] mini-max with Policy and Value network

Reply via email to