>> contrary to intuition built up from earlier-generation MCTS programs in Go, >> putting significant weight on score maximization rather than only >> win/loss seems to help.
This narrative glosses over important nuances. Collectively we are trying to find the golden mean of cost efficiency... The "Zero" community has proven that you can succeed without knowing anything about Go. The rest of us are trying to discover benefits of *some* prior knowledge, perhaps because we don't have cloud datacenters with TPUs. 😊 A lot of the Kago paper shows that techniques that are useful to MCTS/rollout programs are also useful for NN programs. Brief historical summary... MCTS/rollout programs that maximized win/loss outperformed programs that maximized point differential, because occasionally winning by a lot does not compensate for a lot of small losses. Rollouts are stochastic, so every position has opportunities to win/lose by a lot. This result has been widely quoted along the lines of, "Go programs should only use wins and losses and not use point differential." This has long been known to be overstated. Because using only win/loss has a core problem: a pure win/loss program is content to bleed points, which sometimes results in unnecessary losses. Fundamentally, it should be easier to win games where the theoretical point differential is larger, so losing points contributes to difficulties in distinguishing wins from losses using rollouts. There were two responses to this: dynamic komi and point diff as tiebreaker. Dynamic komi adjusts komi when the rollout winning percentage falls outside of a range like [40%,60%]. Point-diff-as-tiebreaker reserves 90% of the result for the win/loss value of a rollout and 10% for a sigmoid function of the final point differential. IIRC, Pachi invented point-diff-as-tiebreaker. The technique worked in my program as well, and it should work in a lot of MCTS/rollout programs. Kago is using point-diff-as-tiebreaker. That is, the invention is to adapt the existing idea to the NN framework. Did the Kago paper mention of dynamic komi? Kago can use that too, because its komi is a settable input parameter. >Score maximization in self-play means it is encouraged to play more >aggressively/dangerously, by creating life/death problems on the board. Point-diff-as-tiebreaker is *risk-averse*. The purpose is to keep all of the points, not to engage in risks to earn more. There is a graph in the Kago paper that will help to visualize how 0.9 * winning + 0.1 * sigmoid(point-diff) trades off gains against losses. Best, Brian _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go