I admit that it's difficult for me to include such deterministic default
policy. :-)
With softmax policy, using the information of "last-LOST-reply" is maybe a good
direction.
Aja
----- Original Message -----
From: Peter Drake
To: [email protected]
Sent: Wednesday, January 26, 2011 5:48 AM
Subject: Re: [Computer-go] The heuristic "last good reply"
I'm all for a learning policy, if you can figure out how to do it. :-)
Peter Drake
http://www.lclark.edu/~drake/
On Jan 25, 2011, at 11:31 AM, Aja wrote:
Hi Professor Drake,
I will try with more playouts. Thanks for your reminding.
I give an example to show my view: default policy should also be included
to learn. I suppose: if there are several decisive life-and-death or semeai in
a position, the tree search cannot go to/clarify every one of them.
In this example, Black's L2 and L4 will cause White to play L3 to capture
by default policy (it's completely bad). Then Black may learn quickly by "last
good play" to atari immediately and kill White's whole group to win. The
problem is, White is not able to learn the correct answer H1 or H2 because it
is fixed in default policy.
In the playouts, the configuration of such a big semeai might be very
similar. Such evaluation bias is exactly an issue that we can fix by learning.
By considering probability, I can fix this problem by increasing the
probability of the "last good reply" H1 or H2, without tree's aid.
Every program's implementation of the playout is more or less different.
But I think excluding default policy from the learning might limit the full
power of "last good reply".
Aja
----- Original Message -----
From: Peter Drake
To: [email protected]
Sent: Wednesday, January 26, 2011 2:27 AM
Subject: Re: [Computer-go] The heuristic "last good reply"
On Jan 25, 2011, at 10:19 AM, Aja wrote:
Dear all,
Today I have tried Professor Drake's "last good reply" in Erica. So
far, I got at most 20-30 elo from it.
I tested by self-play, with 3000 playouts/move on 19x19. The amount of
playouts might be too few, but I would like to test more playouts IF the
playing strength is not weaker with 3000 playouts.
Yes -- the smallest experiments in the paper were with 8k playouts per
move. There may not be time to fill up the reply tables with only 3k.
From this preliminary experiments with 3000 playouts, I have some
observations:
1. In Erica, it's better to consider probability for this heuristic.
2. In Prof. Drake's implementation, there is a weakness in learning. I
think the main problem is that for a reply which is deterministically played by
default policy, there is no room to learn a new reply. For example, if "save by
capture" produces a lost game, then in the next simulation, it will still play
"save by capture" by default policy. If I am wrong in this point, I am glad to
be corrected by anyone.
This is true, but only if the previous move (or previous two moves) come
up again in exactly the same board configuration. When the configuration is
exactly the same, we are probably still in the search tree, which overrides the
policy. If we are beyond the tree, the configuration is almost always different.
3. This heuristic has potential to perform better in Erica. I hope this
brief result would encourage other authors to try it.
It's reassuring to see that you got some strength improvement out of it!
Thanks,
Peter Drake
http://www.lclark.edu/~drake/
--------------------------------------------------------------------------
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
<default_policy.sgf>
------------------------------------------------------------------------------
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go