Re: [Computer-go] Teaching Deep Convolutional Neural Networks to Play Go

2015-03-17 Thread David Silver
Hi Oliver

Reinforcement learning is different to unsupervised learning. We used
reinforcement learning to train the Atari games. Also we published a more
recent paper (www.nature.com/articles/nature14236) that applied the same
network to 50 different Atari games (achieving human level in around half).

Similar neural network architectures can indeed be applied to Go (indeed
that was one of the motivations for our recent ICLR paper). However,
training by reinforcement learning from self-play is perhaps more
challenging than for Atari: our method (DQN) was applied to single-player
Atari games, whereas in Go there is also an opponent. I could not guarantee
that DQN will be stable in this setting.

Cheers
Dave


On 16 March 2015 at 22:21, Oliver Lewis ojfle...@yahoo.co.uk wrote:

 Can you say anything about whether you think their approach to
 unsupervised learning could be applied to networks similar to those you
 trained? Any practical or theoretical constraints we should be aware of?


 On Monday, 16 March 2015, Aja Huang ajahu...@gmail.com wrote:

 Hello Oliver,

 2015-03-16 11:58 GMT+00:00 Oliver Lewis ojfle...@yahoo.co.uk:

 It's impressive that the same network learned to play seven games with
 just a win/lose signal.  It's also interesting that both these teams are in
 different parts of Google. I assume they are aware of each other's work,
 but maybe Aja can confirm.


 The authors are my colleagues at Google DeepMind as on the paper they
 list DeepMind as their affiliation. Yes we are aware of each other's
 work.

 Aja


 ___
 Computer-go mailing list
 Computer-go@computer-go.org
 http://computer-go.org/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Move Evaluation in Go Using Deep Convolutional Neural Networks

2014-12-22 Thread David Silver
Hi Martin

- Would you be willing to share some of the sgf game records played by your
 network with the community? I tried to replay the game record in your
 paper, but got stuck since it does not show any of the moves that got
 captured.


Sorry about that, we will correct the figure and repost. In the meanwhile
Aja will post the .sgf for that game. Also, thanks for noticing that we
tested against a stronger version of Fuego than Clark and Storkey, we'll
evaluate against Fuego 1.1 and post the results. Unfortunately, we only
have approval to release the material in the paper, so we can't really give
any further data :-(

One more thing, Aja said he was tired when he measured his own performance
on KGS predictions (working too hard on this paper!) So it would be great
to get better statistics on how humans really do at predicting the next
move. Does anyone want to measure their own performance, say on 200
randomly chosen positions from the KGS data?

- Do you know how large is the effect from using the extra features that
 are not in the paper by Clarke and Storkey, i.e. the last move info and the
 extra tactics? As a related question, would you get an OK result if you
 just zeroed out some inputs in the existing net, or would you need to
 re-train a new network from fewer inputs.


We trained our networks before we knew about Clark and Storkey's results,
so we haven't had a chance to evaluate the differences between the
approaches. But it's well known that last move info makes a big difference
to predictive performance, so I'd guess they would already be close to 50%
predictive accuracy if they included those features.


 - Is there a way to simplify the final network so that it is faster to
 compute and/or easier to understand? Is there something computed, maybe on
 an intermediate layer, that would be usable as a new feature in itself?


This is an interesting idea, but so far we only focused on building a large
and deep enough network to represent Go knowledge at all.

Cheers
Dave
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [computer-go] David Silvers Rave formula

2009-05-07 Thread David Silver

Hi Lars,


is there anyone who can repost the pdf (rave.pdf?) the following mails
are talking about?

http://computer-go.org/pipermail/computer-go/2008-February/014095.html


I think you can still find the original attachment here:

http://computer-go.org/pipermail/computer-go/attachments/20080208/6519e9c5/rave.pdf

However, if you can wait a few weeks I will be publishing a clearer (I  
hope!) explanation of how to combine UCT and RAVE in my PhD thesis.


Cheers
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-05-04 Thread David Silver

Hi,

We used alpha=0.1. There may well be a better setting of alpha, but  
this appeared to work nicely in our experiments.

-Dave

On 3-May-09, at 2:01 AM, elife wrote:


Hi Dave,

 In your experiments what's the constant value alpha you set?

Thanks.

2009/5/1 David Silver sil...@cs.ualberta.ca:


Yes, in our experiments they were just constant numbers M=N=100.

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-05-01 Thread David Silver

Hi Yamato,


If M and N are the same, is there any reason to run M simulations and
N simulations separately?  What happens if you combine them and  
calculate

V and g in the single loop?


I think it gives the wrong answer to do it in a single loop. Note that  
the simulation outcomes z are used to compute both V and g, so that  
they are quite strongly correlated. In general, if random variables X  
and Y are correlated then E[X]E[Y] != E[XY].


A simple example of this problem would be sampling E[X]E[X] for some  
random variable X. Let's say X is actually just noise with mean zero.  
If you just take one sample x1, then x1*x1 is always +ve, and you  
would estimate E[X]E[X]=E[X^2]0. But if you take two independent  
samples x1 and x2, then x1*x2 would be +ve half the time and -ve half  
the time, and you would correctly estimate E[X]E[X]=0.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Remi,


This is strange: you do not take lost playouts into consideration.

I believe there is a problem with your estimation of the gradient.  
Suppose for instance that you count z = +1 for a win, and z = -1 for  
a loss. Then you would take lost playouts into consideration. This  
makes me a little suspicious.


Sorry, I should have made it clear that this assumes that we are  
treating black wins as z=1 and white wins as z=0.
In this special case, the gradient is the average of games in which  
black won.
But yes, more generally you need to include games won by both sides.  
The algorithms in the paper still cover this case - I was just trying  
to simplify their description to make it easy to understand the ideas.


The fundamental problem here may be that your estimate of the  
gradient is biased by the playout policy. You should probably sample  
X(s) uniformly at random to have an unbiased estimator. Maybe this  
can be fixed with importance sampling, and then you may get a  
formula that is symmetrical regarding wins and losses. I don't have  
time to do it now, but it may be worth taking a look.


The gradient already compensates for the playout policy (equation 9),  
so in fact it would bias the gradient to sample uniformly at random!


In equation 9, the gradient is taken with respect to the playout  
policy parameters. Using the product rule (third line), the gradient  
is equal to the playout policy probabilities multiplied by the sum  
likelihood ratios multiplied by the simulation outcomes z. This  
gradient can be computed by sampling playouts  instead of multiplying  
by the playout policy probabilities. This is also why games with  
outcomes of z=0 can be ignored - they don't affect this gradient  
computation.


More precisely: you should estimate the value of N playouts as Sum  
p_i z_i / Sum p_i instead of Sum z_i. Then, take the gradient of Sum  
p_i z_i / Sum p_i. This would be better.


Maybe Sum p_i z_i / Sum p_i would be better for MCTS, too ?


I think a similar point applies here. We care about the expected value  
of the playout policy, which can be sampled directly from playouts,  
instead of multiplying by the policy probabilities. You would only  
need importance sampling if you were using a different playout policy  
to the one which you are evaluating. But I guess I'm not seeing any  
good reason to do this?


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Remi,

I understood this. What I find strange is that using -1/1 should be  
equivalent to using 0/1, but your algorithm behaves differently: it  
ignores lost games with 0/1, and uses them with -1/1.


Imagine you add a big constant to z. One million, say. This does not  
change the problem. You get either 100 or 101 as outcome of  
a playout. But then, your estimate of the gradient becomes complete  
noise.


So maybe using -1/1 is better than 0/1 ? Since your algorithm  
depends so much on the definition of the reward, there must be an  
optimal way to set the reward. Or there must a better way to define  
an algorithm that would not depend on an offset in the reward.


There is still something wrong that I don't understand. There may be  
a way to quantify the amount of noise in the unbiased gradient  
estimate, and it would depend on the average reward. Probably  
setting the average reward to zero is what would minimize noise in  
the gradient estimate. This is just an intuitive guess.


Okay, now I understand your point :-) It's a good question - and I  
think you're right. In REINFORCE any baseline can be subtracted from  
the reward, without affecting the expected gradient, but possibly  
reducing its variance. The baseline leading to the best estimate is  
indeed the average reward.  So it should be the case that  {-1,+1}  
would estimate the gradient g more efficiently than {0,1}, assuming  
that we see similar numbers of black wins as white wins across the  
training set.


So to answer your question, we can safely modify the algorithm to use  
(z-b) instead of z, where b is the average reward. This would then  
make the {0,1} and {-1,+1} cases equivalent (with appropriate scaling  
of step-size). I don't think this would have affected the results we  
presented (because all of the learning algorithms converged anyway, at  
least approximately, during training) but it could be an important  
modification for larger boards.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver

Hi Yamato,


Thanks for the detailed explanation.
M, N and alpha are constant numbers, right?  What did you set them to?


You're welcome!
Yes, in our experiments they were just constant numbers M=N=100.


The feature vector is the set of patterns you use, with value 1 if a
pattern is matched and 0 otherwise. The simulation policy selects
actions in proportion to the exponentiated, weighted sum of all
matching patterns. For example let's say move a matches patterns 1  
and

2, move b matches patterns 1 and 3, and move c matches patterns 2 and
4. Then move a would be selected with probability e^(theta1 +
theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +
theta4)). The theta values are the weights on the patterns which we
would like to learn. They are the log of the Elo ratings in Remi
Coulom's approach.


OK, I guess it is the formula 5 in the paper.


Yes, exactly.


The only tricky part is computing the vector psi(s,a). Each component
of psi(s,a) corresponds to a particular pattern, and is the  
difference

between the observed feature (i.e. whether the pattern actually
occurred after move a in position s) and the expected feature (the
average value of the pattern, weighted by the probability of  
selecting

each action).


I still don't understand this. Is it the formula 6?
Could you please give me an example like the above?


Yes that's right, this is equation 6.

Okay, let's continue the example above. Let's say that in position s,  
using the current theta, moves a, b and c will be selected with  
probability 0.5, 0.3 and 0.2 respectively. Let's say that move a was  
actually selected. Now consider pattern 1, this is matched after a.  
But the probability of matching was (0.5*1 +0.3*1 +0.2*0) = 0.8. So  
psi_1(s,a)=1-0.8 = 0.2. For pattern 2, psi_2(s,a)=1- 
(0.5*1+0.3*0+0.2*1)=0.3, etc.. So in this example the vector psi(s,a)  
= [0.2,0.3,-0.3,-0.2].


In other words, psi tells us whether each pattern was actually matched  
more or less than we could have expected.


Hope this helps.
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-30 Thread David Silver
IMO other people's equations/code/ideas/papers always seem smarter  
than your own. The stuff you understand and do yourself just seems  
like common sense, and the stuff you don't always has a mystical air  
of complexity, at least until you understand it too :-)


On 30-Apr-09, at 1:59 PM, Michael Williams wrote:


I wish I was smart   :(


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


Re: [computer-go] Monte-Carlo Simulation Balancing

2009-04-29 Thread David Silver

Hi Yamato,


Could you give us the source code which you used?  Your algorithm is
too complicated, so it would be very helpful if possible.


Actually I think the source code would be much harder to understand!  
It is written inside RLGO, and makes use of a substantial existing  
framework that would take a lot of effort to understand. (On a  
separate note I am considering making RLGO open source at some point,  
but I'd prefer to spend some effort cleaning it up before making it  
public).


But I think maybe Algorithm 1 is much easier than you think:

A: Estimate value V* of every position in a training set, using deep  
rollouts.


B: Repeat, for each position in the training set
1. Run M simulations, estimate value of position (call this V)
	2. Run another N simulations, average the value of psi(s,a) over all  
positions and moves in games that black won (call this g)

3. Adjust parameters by alpha * (V* - V) * g

The feature vector is the set of patterns you use, with value 1 if a  
pattern is matched and 0 otherwise. The simulation policy selects  
actions in proportion to the exponentiated, weighted sum of all  
matching patterns. For example let's say move a matches patterns 1 and  
2, move b matches patterns 1 and 3, and move c matches patterns 2 and  
4. Then move a would be selected with probability e^(theta1 +  
theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +  
theta4)). The theta values are the weights on the patterns which we  
would like to learn. They are the log of the Elo ratings in Remi  
Coulom's approach.


The only tricky part is computing the vector psi(s,a). Each component  
of psi(s,a) corresponds to a particular pattern, and is the difference  
between the observed feature (i.e. whether the pattern actually  
occurred after move a in position s) and the expected feature (the  
average value of the pattern, weighted by the probability of selecting  
each action).


It's also very important to be careful about signs and the colour to  
play - it's easy to make a mistake and follow the gradient in the  
wrong direction.


Is that any clearer?
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Re: [computer go] Monte-Carlo Simulation Balancing

2009-04-29 Thread David Silver

Hi Remi,


What komi did you use for 5x5 and 6x6 ?

I used 7.5 komi for both board sizes.


I find it strange that you get only 70 Elo points from supervised
learning over uniform random. Don't you have any feature for atari
extension ? This one alone should improve strength immensely (extend
string in atari because of the previous move).
Actually no. The features are very simple, and know how to capture but  
not how to defend ataris. I'm sure that a better set of features could  
improve by more than 70 Elo, but I expect we would still see a benefit  
to balancing the weights correctly. For example, the Fuego policy  
defends ataris and follows several other common-sense rules, but the  
results in 5x5 and 6x6 show that it is not well balanced on small  
boards.


Let us extrapolate: I got more than 200 Elo points of improvements  
from
my patterns in 9x9 over uniform random (I never really measured  
that, it

may be even more than 200).
I guess you got more than 200 Elo on 9x9, in Mogo (Gelly et al. 2006)  
the improvement from uniform random was at least 400 Elo and I think  
your simulation policy is probably at least as strong.


By the way I was sad to hear you're not working on Crazystone any  
more. Is it because you are you busy with other projects?



So maybe I could get 600 more Elo points
with your method. And even more on 19x19.
I noticed that, in general, changes in the playout policy have a much
bigger impact on larger boards than on smaller boards.


As to whether we can extrapolate, I hope so :-)
I share the same feeling that improving the simulation policy will be  
more impactful on bigger boards with longer simulations.
On the other hand I've been surprised by many things in MC Go, so  
nothing is certain until we try it!


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Monte-Carlo Simulation Balancing

2009-04-28 Thread David Silver

Hi Yamato,


I like you idea, but why do you use only 5x5 and 6x6 Go?


1. Our second algorithm, two-ply simulation balancing, requires a  
training set of two-ply rollouts. Rolling out every position from a  
complete two-ply search is very expensive on larger board sizes, so we  
would probably have to consider some subset of leaf positions. We  
wanted to analyse the full algorithm first, before we started making  
approximations.
2. We can generate a lot more data on small boards, to give high  
confidence on the results we report.
3. IMO it's important to do the science to understand underlying  
principles first, and then scale up to bigger boards, more complex  
Monte-Carlo searches, etc.



I don't think the 200+ Elo improvement is so impressive


I agree that it would be much more impressive to report positive  
results on larger boards. But perhaps it is already interesting that  
tuning the balance of the simulation policy can make a big difference  
on small boards? Also, larger boards mean longer simulations, and so  
the importance of simulation balancing should become even more  
exaggerated.


because the previous approaches were not optimized for such a small  
boards.


I'm not sure what you mean here? The supervised learning and  
reinforcement learning approaches that we compared against are both  
trained on the small boards, i.e. the pattern weights are specifically  
optimised for that size of board.


I agree that the handcrafted policy from Fuego was not optimised for  
small boards, which is why it performed poorly. But perhaps this is  
also interesting, i.e. it suggests that a handcrafted policy for 9x9  
Go may be significantly suboptimal when used in 19x19 Go. So  
automatically learning a simulation policy may ultimately prove to be  
very beneficial.



I'm looking forward to your results on larger boards.


Me too :-)
Coming soon, will let you know how it goes.
But I hope that others will try these ideas too, it's always much  
better to compare multiple implementations of the same algorithm.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Monte-Carlo Simulation Balancing

2009-04-27 Thread David Silver

Hi Remi,

If I understand correctly, your method makes your program 250 Elo  
points

stronger than my pattern-learning algorithm on 5x5 and 6x6, by just
learning better weights.


Yes, although this is just in a very simple MC setting.

Also we did not compare directly to the algorithm you used  
(minorisation/maximisation), but to a different algorithm for  
maximising the likelihood of an expert policy. But as the softmax  
policy is a log transformation of the generalised Bradley-Terry model,  
I think this comparison is still valid.


I had stopped working on my program for a very long time, but maybe  
I will go back to work and try

your algorithm.


I would be really interested to know how this works out! Please let me  
know.


Best wishes
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Monte-Carlo Simulation Balancing

2009-04-27 Thread David Silver

Hi Michael,

But one thing confuses me: You are using the value from Fuego's 10k  
simulations as an approximation of the actual value of the  
position.  But isn't the actual
value of the position either a win or a loss?  On such small boards,  
can't you assume that Fuego is able to correctly determin who is  
winning and round it's
evaluation to the nearest win/loss?  i.e. if it evaluates the  
position to 0.674, that gets rounded to 1.  If such an assumption  
about Fuego's ability to read
the position on a small board is valid, then it should improve the  
results of your balanced simulation strategy, right?  Or am I  
missing something?


It's true that 5x5 Go is solved, so in principle we could have used  
the true minimax values. However we chose to use an approach that can  
scale to larger boards, which means that we should treat the expert  
evaluations as approximate. And in fact Fuego was not always accurate  
on 6x6 boards, as we used only 10k simulations in our training set.


Also, I think it really helps to have soft rather than hard expert  
evaluations. We want a simulation policy that helps differentiate e.g.  
a 90% winning position from an 85% winning position. Rounding all the  
expert evaluations to 0 or 1 would lose much of this advantage.


-Dave


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Re: RAVE formula of David Silver (reposted)

2008-11-28 Thread David Silver

This document is confusing, but here is my interpretation of it. And
it works well for Valkyria. I would really want to see a pseudocode
version of it. I might post the code I use for Valkyria, but it is
probably not the same thing so I would probably just increase the
confusion if I did...

The virtual win-visits (which I think you meant and not 'win/loss')
ratios *are* what is computed in Equation 12. Equation 13 is standard
UCT. You use equation 14 instead of equation 13 to select the move to
search. For moves that are searched a lot Eq14 will finally approach
Eq13, since Beta should go towards 0.

Thanks for clarifying this Magnus.
In fact the first idea (combining the values of UCT and RAVE, eqns  
1-11) is reasonably well justified, and works well in practice. The  
second idea (combining the upper confidence bounds, eqns 12-14) is not  
so well justified, and I believe several people have found better ways  
to do this.

I think the term RAVE is often used in a confusing manner. Sometimes
it just means AMAF or as I prefer virtual win-visit ratios, and
sometimes RAVE seems to be that the algorithm that mixes the AMAF
values with normal UCT-values as described in the PDF.

I'd like to suggest the following use of terminology:
AMAF: the idea of estimating the value of an immediate move, from the  
value of that move played at any time.
RAVE: the idea of building a tree of AMAF values, one for each  
position and move in the tree.
UCT-RAVE: the idea of combining a RAVE value with a UCT value, for  
each position and move in the tree.

The way I think of it, RAVE is to AMAF as UCT is to UCB.
-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Paper for AAAI

2008-04-17 Thread David Silver

Hi Petr,
Thanks for the great comments, sorry to be so slow in getting back to  
you (on vacation/workshop...)

Hello,

On Sun, Apr 06, 2008 at 08:55:26PM -0600, David Silver wrote:
 Here is a draft of the paper, any feedback would be very welcome :-)

 http://www.cs.ualberta.ca/~silver/research/publications/files/MoGoNectar.pdf

  you are saying that in minimax, opponent moves are selected by
minimizing the lower confidence bound - this seems novel, is that  
so? I

always got the impression that for the opponent moves, you reverse the
mean value but still use UCB.
I think this just depends on how you formulate the problem. If you use  
minimax then the opponent moves should minimise a lower confidence  
bound. If you use negamax then the opponent moves should maximise an  
upper confidence bound.


  I see three unclear details about the RAVE algorithm: Are only nodes
in the tree considered for the inclusion, or also moves in the  
following

random playout?

Also moves in the following random playout. Will clarify, thanks.

And one of the sentences hints that there is a separate
period of playouts purely seeding the RAVE value before the UCT-RAVE
linear combination takes over - is that so?
No, there is no separate period, it is always a linear combination  
(with decaying weight).

And it does not follow from
the paper that that the UCB formula is used for the RAVE value as  
well,

while the ICML paper states that.

Unfortunately we did not have space for this discussion :-(
In any case, later versions of MoGo used (still use?) an exploration  
constant of zero, which means that the UCB formula is no longer used.


  I am surprised on the bad effect of the grandfather heuristic and  
the
good effect of the even game heuristic. I assume that the effect of  
the

heuristics should accumulate when several of them are combined in the
prior value?
Yes, I agree it is surprising! Combining sources of prior knowledge  
may well be a good idea, I don't believe much work has been done on  
this topic.




  The paper looks very nice otherwise.

Thanks!
-Dave
PS Apologies for the pdf problems that some people encountered, I'll  
post a new version on my website shortly.___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Paper for AAAI

2008-04-06 Thread David Silver

Hi everyone,

Sylvain and myself have had a paper accepted for the Nectar track at  
the 23rd Conference on Artificial Intelligence (AAAI-08). The idea of  
this track is to summarise previously published results from a  
specific field to a wider audience interested in general AI.


Please bear in mind the following:

1. There is no new material in this paper that we have not already  
published.
2. There is a 4 page limit, and so there are many details that have  
been skipped over in this paper.
3. The learned heuristic, using local shape features, is not used in  
the competitive version of MoGo (including the MPI MoGo that played a  
professional player last week).


Several people on the mailing list expressed disappointment last year  
that they found our ICML paper inaccessible. This paper is intended to  
be read by a wider audience, and should hopefully be more accessible.  
Having said that, this is still an academic paper for a technical AI  
conference, and so some background knowledge is assumed :-)


Here is a draft of the paper, any feedback would be very welcome :-)

http://www.cs.ualberta.ca/~silver/research/publications/files/MoGoNectar.pdf

David and Sylvain
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] New UCT-RAVE formula (was Re: computer-go Digest, Vol 43, Issue 8)

2008-02-16 Thread David Silver

I am very confused about the new UCT-RAVE formula.
The equation 9 seems to mean:

variance_u = value_ur * (1 - value_ur) / n.

Is it wrong?  If correct, why is it the variance?
I think that the variance of the UCT should be:

variance_u = value_u * (1 - value_u).

Hi Yamato,

There are two differences between your suggestion and the original  
formula, so I'll try and address both:


1. Your formula gives the variance of a single simulation, with  
probability value_u. But the more simulations you see, the more you  
reduce the uncertainty, so you must divide by n.


In general, the variance of a single coin-flip (with probability p of  
heads) is p(1-p).

The variance of the sum of n coin-flips is np(1-p).
The variance of the average of n coin-flips is p(1-p)/n. This is what  
we want!


2. The variance of the estimate is actually given by: variance_u =  
true_value_u * (1 - true_value_u) / n, where true_value_u is the real  
probability of winning a simulation (for the current policy), if we  
had access to an oracle. Unfortunately, we don't - so we use the best  
available estimate. If we have seen a large number of simulations,  
then you are right that value_u is the best estimate. But if we have  
only seen a few simulations, then value_r gives a better estimate  
(this is the point of RAVE!)   The whole point of this approach is to  
form the best possible estimate of true_value_u, by combining these  
two estimates together. In a way this is somewhat circular: we use the  
best estimate so far to compute the best new estimate. But I don't  
think that is unreasonable in this case.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] New UCT-RAVE formula (was Re: computer-go Digest, Vol 43, Issue 8)

2008-02-16 Thread David Silver

Hi Erik,
Thanks for the thought-provoking response!

Yes, but why add upper confidence bounds to the rave values at all? If
they really go down that fast, does it make much of a difference?
According to the recent experiments in MoGo, you are right :-)  
However, I've seen slightly different results in other strong UCT-RAVE  
programs.
I think you are right that the exploration interacts heavily with the  
playout strategy. If the playouts heavily favour a particular response  
(e.g. black always connects at d3), then using an UCB on RAVE is one  
way to ensure that the opposite result is also tried (e.g. what  
happens if white cuts at d3?).

I did of course test this and my program actually became weaker when I
added UCBs to the rave values!
Interesting! Did you use prior knowledge in your RAVE values? (See  
below)


The way I originally understood it the purpose of rave was to play
stronger when the number of simulations is low. This can already be
achieved by adding a greedy component to emphasize moves that are
likely to be winning based on their rave-value alone.

Agreed, this is certainly the most important aspect of RAVE.
The purpose of UCB in UCT is clear (to ensure sufficient exploration  
when the tree

grows large).  UCB in rave does not really do the same.

Agreed.

I think its
two main effects are: (1) it emphasizes the inverse order in which the
moves where added,

Not sure what you mean by this.

and (2) it emphasizes exploration of moves that are
infrequently selected in the playouts.

Agreed.

I think neither of them is a
good thing. The better the move ordering or the knowledge in the
playouts, the more it hurts...



Not if you also encode your playout knowledge as prior knowledge for  
the RAVE algorithm. This way, you can specify your confidence in the  
choices made by the playouts. The UCB is always relative to this  
confidence level.
Having said that, there may already be sufficient exploration without  
the UCB bonuses. Perhaps all we can say is that RAVE, prior knowledge  
and exploration all interact heavily, and that the best level of  
exploration depends on the exact choices made.

-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] New UCT-RAVE formula (was Re: computer-go Digest, Vol 43, Issue 8)

2008-02-16 Thread David Silver

David Silver wrote:
BTW if anyone just wants the formula, and doesn't care about the
derivation - then just use equations 11-14.

Yes, I just want to use the formula.
But I don't know what the bias is...
How can I get the value of br?

Sorry for the slow reply...

The simplest answer is that the bias is just a constant, estimating  
how wrong the RAVE estimates typically are. One way to get this number  
is trial and error. Another way would be to look at the average  
absolute difference between RAVE and UCT estimates in nodes with many  
simulations.


-Dave___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Explanation to MoGo paper wanted.

2007-07-05 Thread David Silver

 In other words UCT works well when evaluation/playouts is/are
 strong. I
 believe
 there are still improvements possible to the UCT algorithm as
 shown by the
 recent papers by Mogo and Crazystone authors, but what really will
 make a
 difference is in the quality in the playouts.

 Sylvain said that good moves in the playouts do not always improve
 the performance of UCT. What do you think about this claim?


I believe this claim is true in two senses:

1) If the computation necessary to find better moves is too
expensive, performing many dumb playouts may be a better investment.



Sure, this is true. But even with the same number of simulations,  
stonger playouts do not necessarily perform better than dumb  
playouts. This is the real mystery!



2) If the playouts are too deterministic, and the moves are merely
pretty good, the program may avoid an important move and thus
misjudge the value of a position.


We tried the whole spectrum from completely random to completely  
deterministic playouts, but we never came close to the performance of  
the dumb playouts!


We have seen a similar effect many times in MoGo. Often we try  
something that seems like it should improve the quality of the  
simulation player, but it makes the overall performance worse. It is  
frustrating and surprising! Has anyone else encountered this?


-Dave


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Explanation to MoGo paper wanted.

2007-07-05 Thread David Silver
Seems like it should be up to the person in the other environment  
to adapt your

successful algorithm (and notation/terminology) to their environment.


But how do the other people in other environments find out about the  
algorithm? And find out that it is something they could use in their  
own environment? I think we can help with both, by presenting our  
work in a more general way.


-Dave___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Explanation to MoGo paper wanted. (BackGammon Code)

2007-07-03 Thread David Silver


 It's because Go is not only game in the world and certainly not only
 reinforcement learning problem. They are using a widely accepted
 terminology.

But a very inappropriate one. I have read Suttons book and all the  
things I

know (e.g. TD-Gammon) are completly obfuscated.


Really? I think it is a wonderful example of clear thinking and clear  
writing - I couldn't put the book down. It is the reason I chose to  
study RL, and to come study with Rich Sutton.



Its maybe suitable to
present generel concepts, but it is extremly complicated to  
formulate an

algorithm in this framework.


Of course everyone hopes that ideas will be presented to them in  
their personal terminology, as it saves them some effort. But we make  
progress in science by unifying and identifying the general concepts.


But the main point is: I think game programmers should be more  
proud of

their work and should present their results in the language of game
programming. We are the ones which make progress, not these paper  
tigers.


Isn't there room for both? Shouldn't we present our work within our  
own community, but also make efforts to share our ideas with others?


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Explanation to MoGo paper wanted. (BackGammon Code)

2007-07-03 Thread David Silver

 It's because Go is not only game in the world and certainly not only
 reinforcement learning problem. They are using a widely accepted
 terminology.

But a very inappropriate one. I have read Suttons book and all the  
things I

know (e.g. TD-Gammon) are completly obfuscated. Its maybe suitable to
present generel concepts, but it is extremly complicated to  
formulate an

algorithm in this framework.


Here is quick and dirty RL-Computer Go translation kit to try and  
help bridge the gap!


RL terminology  Go terminology

State   Position
Action  Move
Reward  Win/Loss
Return  Win/Loss
Episode Game
Time-step   One move
Agent   Program
Value function  Evaluation function
Policy  Player
Default policy  Simulation player
Uniform random policy   Light simulation player
Other stochastic policy Heavy simulation player
Greedy policy   1-ply search player
Epsilon-greedy policy   1-ply search player with some random moves
FeatureFactor used for  
position evaluation

Weight  Weight of each factor in evaluation function
Tabular representation  One weight for each complete position
Partial tabular UCT tree
representation
State abstraction   One weight for many positions
Linear value function   Evaluation function
approximation  using weighted sum of various factors
Feature discovery   Learning new factors for the evaluation function
Sample-based search Simulation (Monte-Carlo methods, etc.)
Transition function Rules of the game
Environment Rules of the game + opponent
Trajectory  Move sequence
Online  During actual play
Offline Before/after actual play (e.g. preprocessing)
On-policy   If both players play as normal
Off-policy  If either player behaves differently

-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Amsterdam 2007 paper

2007-05-21 Thread David Silver


On 5/18/07, Rémi Coulom [EMAIL PROTECTED] wrote:


My idea was very similar to what you describe. The program built a
collection of rules of the kind if condition then move. Condition
could be anything from a tree-search rule of the kind in this
particular position play x, or general rule such as in atari,  
extend.

It could be also anything in-between, such as a miai specific to the
current position. The strengths of moves were updated with an
incremental Elo-rating algorithm, from the outcomes of random  
simulations.


The obvious way to update weights is to reward all the
rules that fired for the winning side, and penalize all rules that  
fired for

the losing side, with rewards and penalties decaying toward the end
of the playout. But this is not quite Elo like, since it doesn't  
consider rules
to beat each other. So one could make the reward dependent on the  
relative
weight of the chosen rule versus all alternatives. increasing the  
reward if the

alternatives carried a lot of weight.
Is that how your ratings worked?

I'm not sure how that compares with TD learning. Maybe someone more
familiar with the latter can point out the differences.


TD learning (with linear function approximation) uses a gradient  
descent rule to update weights. The simplest gradient descent rule,  
LMS or Widrow-Hoff, does something like you describe: rules that are  
followed by positive reward (win) are increased in weight, and rules  
that are followed by negative reward (loss) are decreased. The exact  
update depends on the set of rules firing, and is proportional to the  
error between the estimated reward (based on all rules) and the  
actual reward. In other words, each weight is updated a little  
towards the value which would have made a correct overall prediction.  
TD learning is similar, except that it updates weights towards a  
subsequent prediction of the reward (e.g. on the next move), instead  
of the actual reward. Rich Sutton gives a much better explanation  
than me: http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Re: computer-go Digest, Vol 34, Issue 15

2007-05-18 Thread David Silver

Very interesting paper!

I have one question. The assumption in your paper is that increasing  
the performance of the simulation player will increase the  
performance of Monte-Carlo methods that use that simulation player.  
However, we found in MoGo that this is not necessarily the case! Do  
you think there is some property of your learning algorithm that  
makes it particularly suitable for Monte-Carlo methods?


Thanks!
Dave


Date: Thu, 17 May 2007 01:17:55 +0200
From: R?mi Coulom [EMAIL PROTECTED]
Subject: [computer-go] Amsterdam 2007 paper
To: computer-go computer-go@computer-go.org
Message-ID: [EMAIL PROTECTED]
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi,

I first thought I would keep my ideas secret until the Asmterdam
tournament, but now that I have submitted my paper, I cannot wait to
share it. So, here it is:

http://remi.coulom.free.fr/Amsterdam2007/

Comments and questions are very welcome.

Rémi




___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


[computer-go] Re: Amsterdam 2007 paper

2007-05-18 Thread David Silver
I also use an online learning algorithm in RLGO to adjust feature  
weights during the game. I use around a million features (all  
possible patterns from 1x1 up to 3x3 at all locations on the board)  
and update the weights online from simulated games using temporal  
difference learning. I also use the sum of feature weights to  
estimate the value of a move, rather than a multiplicative estimate.  
The learning signal is only win/lose at the end of each simulation,  
rather than supervised learning like Remi. The results are  
encouraging (currently ~1820 Elo on CGOS, based on 5000 simulations  
per move) for a program that does not use UCT or Monte-Carlo Tree  
Search in any way.


Like Alvaro and Remi, I am convinced that online learning during  
simulation has great potential. In fact, I would go even further, and  
suggest that Monte-Carlo Tree Search and UCT are at one end of a  
large spectrum of promising sample-based search algorithms. We don't  
need to restrict ourselves to a tabular representation: state  
abstraction (for example, using pattern features) provides a way to  
generalise between positions and learn more efficiently. Nor do we  
need to restrict our learning algorithm to Monte-Carlo: any online  
reinforcement learning algorithm can be applied to learn from  
simulated experience.


In classical game programming terms, we can think of this approach as  
dynamically learning an evaluation function. Instead of learning a  
single evaluation function that applies to all positions, we tailor  
the evaluation function to the current position, by learning from  
simulated experience. Rather than thinking of learning as an offline  
procedure that takes months to train weights, let's think of it as an  
online procedure that can run in a few seconds. Once we learn the  
evaluation function, it can be used in any way we please - for  
example alpha-beta search (RLGO 2.1 uses a full width 5-ply search)  
or as a heuristic function for UCT (current research in MoGo).


I would be very interested to hear more about Dimwit's approach, and  
also Remi's experiments with online learning in CrazyStone.


-Dave

PS Apologies if you receive this twice, I had some difficulties posting.


Rémi:

John and I are planning a similar usage of patterns and features in
the MC simulations, although we were thinking of a very different
method to update strengths. Although we derived our formulas from an
extension of the logistic regression instead of Bradley-Terry models,
we arrived at very similar expressions. The main difference is that
what we have been calling score seems to be the log of your
strength, and instead of multiplying strengths when considering
teams, we just add the scores of different features. We use an
online learning algorithm to adjust the scores during the game, which
severely limits the kind of algorithms we can use, but also allows us
to learn things that apply specifically to the situations that are
appearing on the board in this game.

Now we only have to make dimwit 400 points stronger to show the world
that our approach works. :)

There are many things in the paper that we had never thought of, like
considering the distance to the penultimate move. Most of those things
won't be directly applicable to dimwit, but it gives us many
suggestions of things to try.

Thanks for this important contribution to computer go.


Álvaro.


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Re: Amsterdam 2007 paper

2007-05-18 Thread David Silver



Thanks for the great paper.  And thanks for sharing it before it's  
published.


Now I know what directions to take my engine in next.

Time for Team MoGo so share some more secrets  :)


We are publishing MoGo's secrets at ICML 2007, in just over a month.
So not long to wait now!

-Dave


___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/