Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Yamato
David Silver wrote:
>Yes, in our experiments they were just constant numbers M=N=100.

If M and N are the same, is there any reason to run M simulations and
N simulations separately?  What happens if you combine them and calculate
V and g in the single loop?

>Okay, let's continue the example above. Let's say that in position s,  
>using the current theta, moves a, b and c will be selected with  
>probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was  
>actually selected. Now consider pattern 1, this is matched after a.  
>But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So  
>psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- 
>(0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a)  
>= [0.2,0.3,-0.3,-0.2].
>
>In other words, psi tells us whether each pattern was actually matched  
>more or less than we could have expected.

I understood what psi was. I am not sure how it works, but anyway I can
see your algorithm now. Thanks.

--
Yamato
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver
IMO other people's equations/code/ideas/papers always seem smarter  
than your own. The stuff you understand and do yourself just seems  
like common sense, and the stuff you don't always has a mystical air  
of complexity, at least until you understand it too :-)


On 30-Apr-09, at 1:59 PM, Michael Williams wrote:


I wish I was smart   :(


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Yamato,


Thanks for the detailed explanation.
M, N and alpha are constant numbers, right?  What did you set them to?


You're welcome!
Yes, in our experiments they were just constant numbers M=N=100.


The feature vector is the set of patterns you use, with value 1 if a
pattern is matched and 0 otherwise. The simulation policy selects
actions in proportion to the exponentiated, weighted sum of all
matching patterns. For example let's say move a matches patterns 1  
and

2, move b matches patterns 1 and 3, and move c matches patterns 2 and
4. Then move a would be selected with probability e^(theta1 +
theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +
theta4)). The theta values are the weights on the patterns which we
would like to learn. They are the log of the Elo ratings in Remi
Coulom's approach.


OK, I guess it is the formula 5 in the paper.


Yes, exactly.


The only tricky part is computing the vector psi(s,a). Each component
of psi(s,a) corresponds to a particular pattern, and is the  
difference

between the observed feature (i.e. whether the pattern actually
occurred after move a in position s) and the expected feature (the
average value of the pattern, weighted by the probability of  
selecting

each action).


I still don't understand this. Is it the formula 6?
Could you please give me an example like the above?


Yes that's right, this is equation 6.

Okay, let's continue the example above. Let's say that in position s,  
using the current theta, moves a, b and c will be selected with  
probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was  
actually selected. Now consider pattern 1, this is matched after a.  
But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So  
psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- 
(0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a)  
= [0.2,0.3,-0.3,-0.2].


In other words, psi tells us whether each pattern was actually matched  
more or less than we could have expected.


Hope this helps.
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Michael Williams

I wish I was smart   :(


David Silver wrote:

Hi Remi,

I understood this. What I find strange is that using -1/1 should be 
equivalent to using 0/1, but your algorithm behaves differently: it 
ignores lost games with 0/1, and uses them with -1/1.


Imagine you add a big constant to z. One million, say. This does not 
change the problem. You get either 100 or 101 as outcome of a 
playout. But then, your estimate of the gradient becomes complete noise.


So maybe using -1/1 is better than 0/1 ? Since your algorithm depends 
so much on the definition of the reward, there must be an optimal way 
to set the reward. Or there must a better way to define an algorithm 
that would not depend on an offset in the reward.


There is still something wrong that I don't understand. There may be a 
way to quantify the amount of noise in the unbiased gradient estimate, 
and it would depend on the average reward. Probably setting the 
average reward to zero is what would minimize noise in the gradient 
estimate. This is just an intuitive guess.


Okay, now I understand your point :-) It's a good question - and I think 
you're right. In REINFORCE any baseline can be subtracted from the 
reward, without affecting the expected gradient, but possibly reducing 
its variance. The baseline leading to the best estimate is indeed the 
average reward.  So it should be the case that  {-1,+1} would estimate 
the gradient g more efficiently than {0,1}, assuming that we see similar 
numbers of black wins as white wins across the training set.


So to answer your question, we can safely modify the algorithm to use 
(z-b) instead of z, where b is the average reward. This would then make 
the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of 
step-size). I don't think this would have affected the results we 
presented (because all of the learning algorithms converged anyway, at 
least approximately, during training) but it could be an important 
modification for larger boards.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Remi,

I understood this. What I find strange is that using -1/1 should be  
equivalent to using 0/1, but your algorithm behaves differently: it  
ignores lost games with 0/1, and uses them with -1/1.


Imagine you add a big constant to z. One million, say. This does not  
change the problem. You get either 100 or 101 as outcome of  
a playout. But then, your estimate of the gradient becomes complete  
noise.


So maybe using -1/1 is better than 0/1 ? Since your algorithm  
depends so much on the definition of the reward, there must be an  
optimal way to set the reward. Or there must a better way to define  
an algorithm that would not depend on an offset in the reward.


There is still something wrong that I don't understand. There may be  
a way to quantify the amount of noise in the unbiased gradient  
estimate, and it would depend on the average reward. Probably  
setting the average reward to zero is what would minimize noise in  
the gradient estimate. This is just an intuitive guess.


Okay, now I understand your point :-) It's a good question - and I  
think you're right. In REINFORCE any baseline can be subtracted from  
the reward, without affecting the expected gradient, but possibly  
reducing its variance. The baseline leading to the best estimate is  
indeed the average reward.  So it should be the case that  {-1,+1}  
would estimate the gradient g more efficiently than {0,1}, assuming  
that we see similar numbers of black wins as white wins across the  
training set.


So to answer your question, we can safely modify the algorithm to use  
(z-b) instead of z, where b is the average reward. This would then  
make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling  
of step-size). I don't think this would have affected the results we  
presented (because all of the learning algorithms converged anyway, at  
least approximately, during training) but it could be an important  
modification for larger boards.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Rémi Coulom

David Silver wrote:


Sorry, I should have made it clear that this assumes that we are 
treating black wins as z=1 and white wins as z=0.
In this special case, the gradient is the average of games in which 
black won.
But yes, more generally you need to include games won by both sides. 
The algorithms in the paper still cover this case - I was just trying 
to simplify their description to make it easy to understand the ideas.


I understood this. What I find strange is that using -1/1 should be 
equivalent to using 0/1, but your algorithm behaves differently: it 
ignores lost games with 0/1, and uses them with -1/1.


Imagine you add a big constant to z. One million, say. This does not 
change the problem. You get either 100 or 101 as outcome of a 
playout. But then, your estimate of the gradient becomes complete noise.


So maybe using -1/1 is better than 0/1 ? Since your algorithm depends so 
much on the definition of the reward, there must be an optimal way to 
set the reward. Or there must a better way to define an algorithm that 
would not depend on an offset in the reward.




The gradient already compensates for the playout policy (equation 9), 
so in fact it would bias the gradient to sample uniformly at random!

Yes, you are right.

There is still something wrong that I don't understand. There may be a 
way to quantify the amount of noise in the unbiased gradient estimate, 
and it would depend on the average reward. Probably setting the average 
reward to zero is what would minimize noise in the gradient estimate. 
This is just an intuitive guess.


Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Remi,


This is strange: you do not take lost playouts into consideration.

I believe there is a problem with your estimation of the gradient.  
Suppose for instance that you count z = +1 for a win, and z = -1 for  
a loss. Then you would take lost playouts into consideration. This  
makes me a little suspicious.


Sorry, I should have made it clear that this assumes that we are  
treating black wins as z=1 and white wins as z=0.
In this special case, the gradient is the average of games in which  
black won.
But yes, more generally you need to include games won by both sides.  
The algorithms in the paper still cover this case - I was just trying  
to simplify their description to make it easy to understand the ideas.


The fundamental problem here may be that your estimate of the  
gradient is biased by the playout policy. You should probably sample  
X(s) uniformly at random to have an unbiased estimator. Maybe this  
can be fixed with importance sampling, and then you may get a  
formula that is symmetrical regarding wins and losses. I don't have  
time to do it now, but it may be worth taking a look.


The gradient already compensates for the playout policy (equation 9),  
so in fact it would bias the gradient to sample uniformly at random!


In equation 9, the gradient is taken with respect to the playout  
policy parameters. Using the product rule (third line), the gradient  
is equal to the playout policy probabilities multiplied by the sum  
likelihood ratios multiplied by the simulation outcomes z. This  
gradient can be computed by sampling playouts  instead of multiplying  
by the playout policy probabilities. This is also why games with  
outcomes of z=0 can be ignored - they don't affect this gradient  
computation.


More precisely: you should estimate the value of N playouts as Sum  
p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum  
p_i z_i / Sum p_i. This would be better.


Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ?


I think a similar point applies here. We care about the expected value  
of the playout policy, which can be sampled directly from playouts,  
instead of multiplying by the policy probabilities. You would only  
need importance sampling if you were using a different playout policy  
to the one which you are evaluating. But I guess I'm not seeing any  
good reason to do this?


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Today's "Guardian"

2009-04-30 Thread Nick Wedd
Today's "Guardian" newspaper has, on the front page of its technology 
supplement, an article about recent developments in computer Go.  You 
can read it at

http://www.guardian.co.uk/technology/2009/apr/30/games-software-mogo

Unlike the last newspaper article on computer Go that I saw, this one is 
intelligent and well-informed.


Nick
--
Nick Weddn...@maproom.co.uk
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Rémi Coulom

Rémi Coulom wrote:
The fundamental problem here may be that your estimate of the gradient 
is biased by the playout policy. You should probably sample X(s) 
uniformly at random to have an unbiased estimator. Maybe this can be 
fixed with importance sampling, and then you may get a formula that is 
symmetrical regarding wins and losses. I don't have time to do it now, 
but it may be worth taking a look.


Rémi 


More precisely: you should estimate the value of N playouts as Sum p_i 
z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum p_i z_i 
/ Sum p_i. This would be better.


Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ?

Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread Rémi Coulom

David Silver wrote:


2. Run another N simulations, average the value of psi(s,a) over 
all positions and moves in games that black won (call this g) 


This is strange: you do not take lost playouts into consideration.

I believe there is a problem with your estimation of the gradient. 
Suppose for instance that you count z = +1 for a win, and z = -1 for a 
loss. Then you would take lost playouts into consideration. This makes 
me a little suspicious.


The fundamental problem here may be that your estimate of the gradient 
is biased by the playout policy. You should probably sample X(s) 
uniformly at random to have an unbiased estimator. Maybe this can be 
fixed with importance sampling, and then you may get a formula that is 
symmetrical regarding wins and losses. I don't have time to do it now, 
but it may be worth taking a look.


Rémi
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/