Re: [Computer-go] Accelerating Self-Play Learning in Go

Brian Sheppard via Computer-go Fri, 08 Mar 2019 10:52:51 -0800

>> contrary to intuition built up from earlier-generation MCTS programs in Go, 
>> putting significant weight on score maximization rather than only 
>> win/loss seems to help.


This narrative glosses over important nuances.

Collectively we are trying to find the golden mean of cost efficiency... The 
"Zero" community has proven that you can succeed without knowing anything about 
Go. The rest of us are trying to discover benefits of *some* prior knowledge, 
perhaps because we don't have cloud datacenters with TPUs. 😊

A lot of the Kago paper shows that techniques that are useful to MCTS/rollout 
programs are also useful for NN programs.

Brief historical summary...

MCTS/rollout programs that maximized win/loss outperformed programs that 
maximized point differential, because occasionally winning by a lot does not 
compensate for a lot of small losses. Rollouts are stochastic, so every 
position has opportunities to win/lose by a lot.

This result has been widely quoted along the lines of, "Go programs should only 
use wins and losses and not use point differential." This has long been known 
to be overstated.

Because using only win/loss has a core problem: a pure win/loss program is 
content to bleed points, which sometimes results in unnecessary losses. 
Fundamentally, it should be easier to win games where the theoretical point 
differential is larger, so losing points contributes to difficulties in 
distinguishing wins from losses using rollouts.

There were two responses to this: dynamic komi and point diff as tiebreaker. 
Dynamic komi adjusts komi when the rollout winning percentage falls outside of 
a range like [40%,60%]. Point-diff-as-tiebreaker reserves 90% of the result for 
the win/loss value of a rollout and 10% for a sigmoid function of the final 
point differential.

IIRC, Pachi invented point-diff-as-tiebreaker. The technique worked in my 
program as well, and it should work in a lot of MCTS/rollout programs.

Kago is using point-diff-as-tiebreaker. That is, the invention is to adapt the 
existing idea to the NN framework.

Did the Kago paper mention of dynamic komi? Kago can use that too, because its 
komi is a settable input parameter.


>Score maximization in self-play means it is encouraged to play more 
>aggressively/dangerously, by creating life/death problems on the board.

Point-diff-as-tiebreaker is *risk-averse*. The purpose is to keep all of the 
points, not to engage in risks to earn more. There is a graph in the Kago paper 
that will help to visualize  how 0.9 * winning + 0.1 * sigmoid(point-diff) 
trades off gains against losses.

Best,
Brian


_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Accelerating Self-Play Learning in Go

Reply via email to