On 20-11-16 11:16, Detlef Schmicker wrote:
> Hi Hiroshi,
> 
>> Now I'm making 13x13 selfplay games like AlphaGo paper. 1. make a
>> position by Policy(SL) probability from initial position. 2. play a
>> move uniformly at random from available moves. 3. play left moves
>> by Policy(RL) to the end. (2) means it plays very bad move usually.
>> Maybe it is because making completely different position? I don't
>> understand why this (2) is
> needed.
> 
> I did not read the alphago paper like this.
> 
> I read it uses the RL policy the "usual" way (I would say it means
> something like randomizing with the net probabilities for the best 5
> moves or so)
> 
> but randomize the opponent uniformaly, meaning the net values of the
> opponent are taken from an earlier step in the reinforcement learning.
> 
> Meaning e.g.
> 
> step 10000 playing against step 7645 in the reinforcement history?
> 
> Or did I understand you wrong?

You are confusing the Policy Network RL procedure with the Value Network
data production.

For the Value Network indeed the procedure is as described, with one
move at time U being uniformly sampled from {1,361} until it is legal. I
think it's because we're not interested (only) in playing good moves,
but also analyzing as diverse as possible positions to learn whether
they're won or lost. Throwing in one totally random move vastly
increases the diversity and the number of odd positions the network
sees, while still not leading to totally nonsensical positions.

-- 
GCP
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to