Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Andy
I made a pull request to Leela, and put some data in there. It shows the
details of how Q is initialized are actually important:
https://github.com/gcp/leela-zero/pull/238


2017-12-03 19:56 GMT-06:00 Álvaro Begué :

> You are asking about the selection of the move that goes to a leaf. When
> the node before the move was expanded (in a previous playout), the value of
> Q(s,a) for that move was initialized to 0.
>
> The UCB-style formula they use in the tree part of the playout is such
> that the first few visits will follow the probability distribution from the
> policy output of the network, and over time it converges to using primarily
> the moves that have best results. So the details of how Q is initialized
> are not very relevant.
>
>
> On Sun, Dec 3, 2017 at 5:11 PM, Andy  wrote:
>
>> Álvaro, you are quoting from "Expand and evaluate (Figure 2b)". But my
>> question is about the section before that "Select (Figure 2a)". So the node
>> has not been expanded+initialized.
>>
>> As Brian Lee mentioned, his MuGo uses the parent's value, which assumes
>> without further information the value should be close to the same as before.
>>
>> LeelaZ uses 1.1 for a "first play urgency", which assumes you should
>> prioritize getting at least one evaluation from the NN for each node.
>> https://github.com/gcp/leela-zero/blob/master/src/UCTNode.cpp#L323
>>
>> Finally using a value of 0 would seem to place extra confidence in the
>> policy net values.
>>
>> I feel like MuGo's implementation makes sense, but I'm trying to get some
>> experimental evidence showing the impact before suggesting it to Leela's
>> author. So far my self-play tests with different settings do not show a big
>> impact, but I am changing other variables at the same time.
>>
>> - Andy
>>
>>
>>
>> 2017-12-03 14:30 GMT-06:00 Álvaro Begué :
>>
>>> The text in the appendix has the answer, in a paragraph titled "Expand
>>> and evaluate (Fig. 2b)":
>>>   "[...] The leaf node is expanded and and each edge (s_t, a) is
>>> initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
>>> p_a}; [...]"
>>>
>>>
>>>
>>> On Sun, Dec 3, 2017 at 11:27 AM, Andy  wrote:
>>>
 Figure 2a shows two bolded Q+U max values. The second one is going to a
 leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
 Q value from?

 The associated text doesn't clarify the situation: "Figure 2:
 Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the
 tree by selecting the edge with maximum action-value Q, plus an upper
 confidence bound U that depends on a stored prior probability P and visit
 count N for that edge (which is incremented once traversed). b The leaf
 node is expanded..."






 2017-12-03 9:44 GMT-06:00 Álvaro Begué :

> I am not sure where in the paper you think they use Q(s,a) for a node
> s that hasn't been expanded yet. Q(s,a) is a property of an edge of the
> graph. At a leaf they only use the `value' output of the neural network.
>
> If this doesn't match your understanding of the paper, please point to
> the specific paragraph that you are having trouble with.
>
> Álvaro.
>
>
>
> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>
>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>> should be for a node that hasn't been expanded yet. The equation for 
>> Q(s,a)
>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>> visits. But in this case N(s,a)=0 so that won't work.
>>
>> Does anyone know how this is supposed to work? Or is it another
>> detail AGZ didn't spell out?
>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>


 ___
 Computer-go mailing list
 Computer-go@computer-go.org
 http://computer-go.org/mailman/listinfo/computer-go

>>>
>>>
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Álvaro Begué
You are asking about the selection of the move that goes to a leaf. When
the node before the move was expanded (in a previous playout), the value of
Q(s,a) for that move was initialized to 0.

The UCB-style formula they use in the tree part of the playout is such that
the first few visits will follow the probability distribution from the
policy output of the network, and over time it converges to using primarily
the moves that have best results. So the details of how Q is initialized
are not very relevant.


On Sun, Dec 3, 2017 at 5:11 PM, Andy  wrote:

> Álvaro, you are quoting from "Expand and evaluate (Figure 2b)". But my
> question is about the section before that "Select (Figure 2a)". So the node
> has not been expanded+initialized.
>
> As Brian Lee mentioned, his MuGo uses the parent's value, which assumes
> without further information the value should be close to the same as before.
>
> LeelaZ uses 1.1 for a "first play urgency", which assumes you should
> prioritize getting at least one evaluation from the NN for each node.
> https://github.com/gcp/leela-zero/blob/master/src/UCTNode.cpp#L323
>
> Finally using a value of 0 would seem to place extra confidence in the
> policy net values.
>
> I feel like MuGo's implementation makes sense, but I'm trying to get some
> experimental evidence showing the impact before suggesting it to Leela's
> author. So far my self-play tests with different settings do not show a big
> impact, but I am changing other variables at the same time.
>
> - Andy
>
>
>
> 2017-12-03 14:30 GMT-06:00 Álvaro Begué :
>
>> The text in the appendix has the answer, in a paragraph titled "Expand
>> and evaluate (Fig. 2b)":
>>   "[...] The leaf node is expanded and and each edge (s_t, a) is
>> initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
>> p_a}; [...]"
>>
>>
>>
>> On Sun, Dec 3, 2017 at 11:27 AM, Andy  wrote:
>>
>>> Figure 2a shows two bolded Q+U max values. The second one is going to a
>>> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
>>> Q value from?
>>>
>>> The associated text doesn't clarify the situation: "Figure 2:
>>> Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the
>>> tree by selecting the edge with maximum action-value Q, plus an upper
>>> confidence bound U that depends on a stored prior probability P and visit
>>> count N for that edge (which is incremented once traversed). b The leaf
>>> node is expanded..."
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2017-12-03 9:44 GMT-06:00 Álvaro Begué :
>>>
 I am not sure where in the paper you think they use Q(s,a) for a node s
 that hasn't been expanded yet. Q(s,a) is a property of an edge of the
 graph. At a leaf they only use the `value' output of the neural network.

 If this doesn't match your understanding of the paper, please point to
 the specific paragraph that you are having trouble with.

 Álvaro.



 On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:

> I don't see the AGZ paper explain what the mean action-value Q(s,a)
> should be for a node that hasn't been expanded yet. The equation for 
> Q(s,a)
> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
> visits. But in this case N(s,a)=0 so that won't work.
>
> Does anyone know how this is supposed to work? Or is it another detail
> AGZ didn't spell out?
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>


 ___
 Computer-go mailing list
 Computer-go@computer-go.org
 http://computer-go.org/mailman/listinfo/computer-go

>>>
>>>
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Andy
Álvaro, you are quoting from "Expand and evaluate (Figure 2b)". But my
question is about the section before that "Select (Figure 2a)". So the node
has not been expanded+initialized.

As Brian Lee mentioned, his MuGo uses the parent's value, which assumes
without further information the value should be close to the same as before.

LeelaZ uses 1.1 for a "first play urgency", which assumes you should
prioritize getting at least one evaluation from the NN for each node.
https://github.com/gcp/leela-zero/blob/master/src/UCTNode.cpp#L323

Finally using a value of 0 would seem to place extra confidence in the
policy net values.

I feel like MuGo's implementation makes sense, but I'm trying to get some
experimental evidence showing the impact before suggesting it to Leela's
author. So far my self-play tests with different settings do not show a big
impact, but I am changing other variables at the same time.

- Andy



2017-12-03 14:30 GMT-06:00 Álvaro Begué :

> The text in the appendix has the answer, in a paragraph titled "Expand and
> evaluate (Fig. 2b)":
>   "[...] The leaf node is expanded and and each edge (s_t, a) is
> initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
> p_a}; [...]"
>
>
>
> On Sun, Dec 3, 2017 at 11:27 AM, Andy  wrote:
>
>> Figure 2a shows two bolded Q+U max values. The second one is going to a
>> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
>> Q value from?
>>
>> The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
>> tree search in AlphaGo Zero. a Each simulation traverses the tree by
>> selecting the edge with maximum action-value Q, plus an upper confidence
>> bound U that depends on a stored prior probability P and visit count N for
>> that edge (which is incremented once traversed). b The leaf node is
>> expanded..."
>>
>>
>>
>>
>>
>>
>> 2017-12-03 9:44 GMT-06:00 Álvaro Begué :
>>
>>> I am not sure where in the paper you think they use Q(s,a) for a node s
>>> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
>>> graph. At a leaf they only use the `value' output of the neural network.
>>>
>>> If this doesn't match your understanding of the paper, please point to
>>> the specific paragraph that you are having trouble with.
>>>
>>> Álvaro.
>>>
>>>
>>>
>>> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>>>
 I don't see the AGZ paper explain what the mean action-value Q(s,a)
 should be for a node that hasn't been expanded yet. The equation for Q(s,a)
 has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
 visits. But in this case N(s,a)=0 so that won't work.

 Does anyone know how this is supposed to work? Or is it another detail
 AGZ didn't spell out?



 ___
 Computer-go mailing list
 Computer-go@computer-go.org
 http://computer-go.org/mailman/listinfo/computer-go

>>>
>>>
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Álvaro Begué
The initial value of Q is not very important because Q+U is dominated by
the U piece when the number of visits is small.

On Sun, Dec 3, 2017 at 3:39 PM, Brian Lee 
wrote:

> It should default to the Q of the parent node. Otherwise, let's say that
> the root node is a losing position. Upon choosing a followup move, the Q
> will be updated to a very negative value, and that node won't get explored
> again - at least until all 362 top-level children have been explored and
> revealed to have negative values. So without initializing Q to the parent's
> Q, you would end up wasting 362 MCTS iterations.
>
> Brian
>
> On Sun, Dec 3, 2017 at 3:25 PM 
> wrote:
>
>> Send Computer-go mailing list submissions to
>> computer-go@computer-go.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://computer-go.org/mailman/listinfo/computer-go
>> or, via email, send a message with subject or body 'help' to
>> computer-go-requ...@computer-go.org
>>
>> You can reach the person managing the list at
>> computer-go-ow...@computer-go.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Computer-go digest..."
>>
>>
>> Today's Topics:
>>
>>1. action-value Q for unexpanded nodes (Andy)
>>2. Re: action-value Q for unexpanded nodes (Álvaro Begué)
>>3. Re: action-value Q for unexpanded nodes (Andy)
>>4. Re: action-value Q for unexpanded nodes (Rémi Coulom)
>>
>>
>> --
>>
>> Message: 1
>> Date: Sun, 3 Dec 2017 08:53:02 -0600
>> From: Andy 
>> To: computer-go 
>> Subject: [Computer-go] action-value Q for unexpanded nodes
>> Message-ID:
>> > gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> I don't see the AGZ paper explain what the mean action-value Q(s,a) should
>> be for a node that hasn't been expanded yet. The equation for Q(s,a) has
>> the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>> visits. But in this case N(s,a)=0 so that won't work.
>>
>> Does anyone know how this is supposed to work? Or is it another detail AGZ
>> didn't spell out?
>> -- next part --
>> An HTML attachment was scrubbed...
>> URL: <http://computer-go.org/pipermail/computer-go/
>> attachments/20171203/8fc94bcd/attachment-0001.html>
>>
>> --
>>
>> Message: 2
>> Date: Sun, 3 Dec 2017 10:44:00 -0500
>> From: Álvaro Begué 
>> To: computer-go 
>> Subject: Re: [Computer-go] action-value Q for unexpanded nodes
>> Message-ID:
>> > gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> I am not sure where in the paper you think they use Q(s,a) for a node s
>> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
>> graph. At a leaf they only use the `value' output of the neural network.
>>
>> If this doesn't match your understanding of the paper, please point to the
>> specific paragraph that you are having trouble with.
>>
>> Álvaro.
>>
>>
>>
>> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>>
>> > I don't see the AGZ paper explain what the mean action-value Q(s,a)
>> should
>> > be for a node that hasn't been expanded yet. The equation for Q(s,a) has
>> > the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>> > visits. But in this case N(s,a)=0 so that won't work.
>> >
>> > Does anyone know how this is supposed to work? Or is it another detail
>> AGZ
>> > didn't spell out?
>> >
>> >
>> >
>> > ___
>> > Computer-go mailing list
>> > Computer-go@computer-go.org
>> > http://computer-go.org/mailman/listinfo/computer-go
>> >
>> -- next part --
>> An HTML attachment was scrubbed...
>> URL: <http://computer-go.org/pipermail/computer-go/
>> attachments/20171203/b8f3d1cc/attachment-0001.html>
>>
>> --
>>
>> Message: 3
>> Date: Sun, 3 Dec 2017 10:27:16 -0600
>> From: Andy 
>> To: computer-go 
>> Subject: Re: [Computer-go] action-value Q for unexpanded nodes
>> Message-ID:
>> > gmail.com>
>> Content-Type: text/plain; charset=

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Brian Lee
It should default to the Q of the parent node. Otherwise, let's say that
the root node is a losing position. Upon choosing a followup move, the Q
will be updated to a very negative value, and that node won't get explored
again - at least until all 362 top-level children have been explored and
revealed to have negative values. So without initializing Q to the parent's
Q, you would end up wasting 362 MCTS iterations.

Brian

On Sun, Dec 3, 2017 at 3:25 PM  wrote:

> Send Computer-go mailing list submissions to
> computer-go@computer-go.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://computer-go.org/mailman/listinfo/computer-go
> or, via email, send a message with subject or body 'help' to
> computer-go-requ...@computer-go.org
>
> You can reach the person managing the list at
> computer-go-ow...@computer-go.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Computer-go digest..."
>
>
> Today's Topics:
>
>1. action-value Q for unexpanded nodes (Andy)
>2. Re: action-value Q for unexpanded nodes (Álvaro Begué)
>3. Re: action-value Q for unexpanded nodes (Andy)
>4. Re: action-value Q for unexpanded nodes (Rémi Coulom)
>
>
> --
>
> Message: 1
> Date: Sun, 3 Dec 2017 08:53:02 -0600
> From: Andy 
> To: computer-go 
> Subject: [Computer-go] action-value Q for unexpanded nodes
> Message-ID:
> <
> caatbd5cguzt4arbsum8-d91j31znq+2tkzpbxv4u5fxthhd...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I don't see the AGZ paper explain what the mean action-value Q(s,a) should
> be for a node that hasn't been expanded yet. The equation for Q(s,a) has
> the term 1/N(s,a) in it because it's supposed to average over N(s,a)
> visits. But in this case N(s,a)=0 so that won't work.
>
> Does anyone know how this is supposed to work? Or is it another detail AGZ
> didn't spell out?
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://computer-go.org/pipermail/computer-go/attachments/20171203/8fc94bcd/attachment-0001.html
> >
>
> --
>
> Message: 2
> Date: Sun, 3 Dec 2017 10:44:00 -0500
> From: Álvaro Begué 
> To: computer-go 
> Subject: Re: [Computer-go] action-value Q for unexpanded nodes
> Message-ID:
> <
> caf8dvmu_f0ue2yykvbwvkrcsuy93wn-x9m8tgmcz+dqfbe4...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I am not sure where in the paper you think they use Q(s,a) for a node s
> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
> graph. At a leaf they only use the `value' output of the neural network.
>
> If this doesn't match your understanding of the paper, please point to the
> specific paragraph that you are having trouble with.
>
> Álvaro.
>
>
>
> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>
> > I don't see the AGZ paper explain what the mean action-value Q(s,a)
> should
> > be for a node that hasn't been expanded yet. The equation for Q(s,a) has
> > the term 1/N(s,a) in it because it's supposed to average over N(s,a)
> > visits. But in this case N(s,a)=0 so that won't work.
> >
> > Does anyone know how this is supposed to work? Or is it another detail
> AGZ
> > didn't spell out?
> >
> >
> >
> > ___
> > Computer-go mailing list
> > Computer-go@computer-go.org
> > http://computer-go.org/mailman/listinfo/computer-go
> >
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://computer-go.org/pipermail/computer-go/attachments/20171203/b8f3d1cc/attachment-0001.html
> >
>
> --
>
> Message: 3
> Date: Sun, 3 Dec 2017 10:27:16 -0600
> From: Andy 
> To: computer-go 
> Subject: Re: [Computer-go] action-value Q for unexpanded nodes
> Message-ID:
> <
> caatbd5cbdtsj7whjm9mybrtdbzlhqdujitosn49ce8kut5_...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Figure 2a shows two bolded Q+U max values. The second one is going to a
> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
> Q value from?
>
> The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
> tree search in AlphaGo Zero. a Each simulation traverses the tree by
> selecting the edge with maximum action-value Q, plus an upper

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Álvaro Begué
The text in the appendix has the answer, in a paragraph titled "Expand and
evaluate (Fig. 2b)":
  "[...] The leaf node is expanded and and each edge (s_t, a) is
initialized to {N(s_t, a) = 0, W(s_t, a) = 0, Q(s_t, a) = 0, P(s_t, a) =
p_a}; [...]"



On Sun, Dec 3, 2017 at 11:27 AM, Andy  wrote:

> Figure 2a shows two bolded Q+U max values. The second one is going to a
> leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
> Q value from?
>
> The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
> tree search in AlphaGo Zero. a Each simulation traverses the tree by
> selecting the edge with maximum action-value Q, plus an upper confidence
> bound U that depends on a stored prior probability P and visit count N for
> that edge (which is incremented once traversed). b The leaf node is
> expanded..."
>
>
>
>
>
>
> 2017-12-03 9:44 GMT-06:00 Álvaro Begué :
>
>> I am not sure where in the paper you think they use Q(s,a) for a node s
>> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
>> graph. At a leaf they only use the `value' output of the neural network.
>>
>> If this doesn't match your understanding of the paper, please point to
>> the specific paragraph that you are having trouble with.
>>
>> Álvaro.
>>
>>
>>
>> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>>
>>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>>> should be for a node that hasn't been expanded yet. The equation for Q(s,a)
>>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>>> visits. But in this case N(s,a)=0 so that won't work.
>>>
>>> Does anyone know how this is supposed to work? Or is it another detail
>>> AGZ didn't spell out?
>>>
>>>
>>>
>>> ___
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Rémi Coulom
They have a Q(s,a) term in their node-selection formula, but they don't tell 
what value they give to an action that has not yet been visited. Maybe Aja can 
tell us.

- Mail original -
De: "Álvaro Begué" 
À: "computer-go" 
Envoyé: Dimanche 3 Décembre 2017 16:44:00
Objet: Re: [Computer-go] action-value Q for unexpanded nodes




I am not sure where in the paper you think they use Q(s,a) for a node s that 
hasn't been expanded yet. Q(s,a) is a property of an edge of the graph. At a 
leaf they only use the `value' output of the neural network. 

If this doesn't match your understanding of the paper, please point to the 
specific paragraph that you are having trouble with. 

Álvaro. 





On Sun, Dec 3, 2017 at 9:53 AM, Andy < andy.olsen...@gmail.com > wrote: 



I don't see the AGZ paper explain what the mean action-value Q(s,a) should be 
for a node that hasn't been expanded yet. The equation for Q(s,a) has the term 
1/N(s,a) in it because it's supposed to average over N(s,a) visits. But in this 
case N(s,a)=0 so that won't work. 


Does anyone know how this is supposed to work? Or is it another detail AGZ 
didn't spell out? 




___ 
Computer-go mailing list 
Computer-go@computer-go.org 
http://computer-go.org/mailman/listinfo/computer-go 


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Andy
Figure 2a shows two bolded Q+U max values. The second one is going to a
leaf that doesn't exist yet, i.e. not expanded yet. Where do they get that
Q value from?

The associated text doesn't clarify the situation: "Figure 2: Monte-Carlo
tree search in AlphaGo Zero. a Each simulation traverses the tree by
selecting the edge with maximum action-value Q, plus an upper confidence
bound U that depends on a stored prior probability P and visit count N for
that edge (which is incremented once traversed). b The leaf node is
expanded..."






2017-12-03 9:44 GMT-06:00 Álvaro Begué :

> I am not sure where in the paper you think they use Q(s,a) for a node s
> that hasn't been expanded yet. Q(s,a) is a property of an edge of the
> graph. At a leaf they only use the `value' output of the neural network.
>
> If this doesn't match your understanding of the paper, please point to the
> specific paragraph that you are having trouble with.
>
> Álvaro.
>
>
>
> On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:
>
>> I don't see the AGZ paper explain what the mean action-value Q(s,a)
>> should be for a node that hasn't been expanded yet. The equation for Q(s,a)
>> has the term 1/N(s,a) in it because it's supposed to average over N(s,a)
>> visits. But in this case N(s,a)=0 so that won't work.
>>
>> Does anyone know how this is supposed to work? Or is it another detail
>> AGZ didn't spell out?
>>
>>
>>
>> ___
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Álvaro Begué
I am not sure where in the paper you think they use Q(s,a) for a node s
that hasn't been expanded yet. Q(s,a) is a property of an edge of the
graph. At a leaf they only use the `value' output of the neural network.

If this doesn't match your understanding of the paper, please point to the
specific paragraph that you are having trouble with.

Álvaro.



On Sun, Dec 3, 2017 at 9:53 AM, Andy  wrote:

> I don't see the AGZ paper explain what the mean action-value Q(s,a) should
> be for a node that hasn't been expanded yet. The equation for Q(s,a) has
> the term 1/N(s,a) in it because it's supposed to average over N(s,a)
> visits. But in this case N(s,a)=0 so that won't work.
>
> Does anyone know how this is supposed to work? Or is it another detail AGZ
> didn't spell out?
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] action-value Q for unexpanded nodes

2017-12-03 Thread Andy
I don't see the AGZ paper explain what the mean action-value Q(s,a) should
be for a node that hasn't been expanded yet. The equation for Q(s,a) has
the term 1/N(s,a) in it because it's supposed to average over N(s,a)
visits. But in this case N(s,a)=0 so that won't work.

Does anyone know how this is supposed to work? Or is it another detail AGZ
didn't spell out?
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Significance of resignation in AGZ

2017-12-03 Thread Brian Sheppard via Computer-go
I have been interested in a different approach, and it had some elements in 
common with AGZ, so AGZ gave me the confidence to try it.

 

From: Computer-go [mailto:computer-go-boun...@computer-go.org] On Behalf Of 
Chaz G.
Sent: Sunday, December 3, 2017 4:05 AM
To: computer-go@computer-go.org
Subject: Re: [Computer-go] Significance of resignation in AGZ

 

Hi Brian,

 

Thanks for sharing your genuinely interesting result. One question though: why 
would you train on a non-"zero" program? Do you think your program as a result 
of your rules would perform better than zero, or is it imitating the best known 
algorithm inconvenient for your purposes?

 

Best,

-Chaz

 

On Sat, Dec 2, 2017 at 7:31 PM, Brian Sheppard via Computer-go 
mailto:computer-go@computer-go.org> > wrote:

I implemented the ad hoc rule of not training on positions after the first 
pass, and my program is basically playing moves until the first pass is forced. 
(It is not a “zero” program, so I don’t mind ad hoc rules like this.)

 

From: Computer-go [mailto:computer-go-boun...@computer-go.org 
 ] On Behalf Of Xavier Combelle
Sent: Saturday, December 2, 2017 12:36 PM
To: computer-go@computer-go.org  


Subject: Re: [Computer-go] Significance of resignation in AGZ

 

It might make sense to enable resignation threshold even on stupid level. As 
such the first thing the network should learn would be not to resign to early 
(even before not passing)

 

Le 02/12/2017 à 18:17, Brian Sheppard via Computer-go a écrit :

I have some hard data now. My network’s initial training reached the same 
performance in half the iterations. That is, the steepness of skill gain in the 
first day of training was twice as great when I avoided training on fill-ins.

 

The has all the usual caveats: only one run before/after, YMMV, etc.

 

From: Brian Sheppard [mailto:sheppar...@aol.com] 
Sent: Friday, December 1, 2017 5:39 PM
To: 'computer-go'   

Subject: RE: [Computer-go] Significance of resignation in AGZ

 

I didn’t measure precisely because as soon as I saw the training artifacts I 
changed the code. And I am not doing an AGZ-style experiment, so there are 
differences for sure. So I will give you a swag…

 

Speed difference is maybe 20%-ish for 9x9 games.

 

A frequentist approach will overstate the frequency of fill-in plays by a 
pretty large factor, because fill-in plays are guaranteed to occur in every 
game but are not best in the competitive part of the game. This will affect the 
speed of learning in the early going.

 

The network will use some fraction (almost certainly <= 20%) of its capacity to 
improve accuracy on positions that will not contribute to its ultimate 
strength. This applies to both ordering and evaluation aspects.

 

 

 

 

From: Andy [mailto:andy.olsen...@gmail.com] 
Sent: Friday, December 1, 2017 4:55 PM
To: Brian Sheppard   ; 
computer-go   
Subject: Re: [Computer-go] Significance of resignation in AGZ

 

Brian, do you have any experiments showing what kind of impact it has? It 
sounds like you have tried both with and without your ad hoc first pass 
approach?

 

 

 

 

2017-12-01 15:29 GMT-06:00 Brian Sheppard via Computer-go 
mailto:computer-go@computer-go.org> >:

I have concluded that AGZ's policy of resigning "lost" games early is somewhat 
significant. Not as significant as using residual networks, for sure, but you 
wouldn't want to go without these advantages.

The benefit cited in the paper is speed. Certainly a factor. I see two other 
advantages.

First is that training does not include the "fill in" portion of the game, 
where every move is low value. I see a specific effect on the move ordering 
system, since it is based on frequency. By eliminating training on fill-ins, 
the prioritization function will not be biased toward moves that are not 
relevant to strong play. (That is, there are a lot of fill-in moves, which are 
usually not best in the interesting portion of the game, but occur a lot if the 
game is played out to the end, and therefore the move prioritization system 
would predict them more often.) My ad hoc alternative is to not train on 
positions after the first pass in a game. (Note that this does not qualify as 
"zero knowledge", but that is OK with me since I am not trying to reproduce 
AGZ.)

Second is the positional evaluation is not training on situations where 
everything is decided, so less of the NN capacity is devoted to situations in 
which nothing can be gained.

As always, YMMV.

Best,
Brian


___
Computer-go mailing list
Computer-go@computer-go.org  
http://computer-go.org/mailman/listinfo/computer-go

 





___
Computer-go mailing list
Computer-go@computer-go.org 

[Computer-go] What happens if you only feed the current board position to AGZ?

2017-12-03 Thread Imran Hendley
AlphaGo Zero's Neural Network takes a 19x19x17 input representing the
current and 15 previous board positons, and the side to play. What if you
were to only give it the current board position and side to play, and you
handled all illegal ko moves only in the tree?

So obviously the network cannot distinguish between two identical positions
one where there is an illegal ko move and one where there is not. But after
running MCTS long enough and expanding the tree AGZ should understand what
is going on, right?

Does this just make it require more time to find the best move, or is it
somehow fundamentally broken?

The only thing I can think of is that ko threats might sometimes linger for
a very long time, so maybe this is a big problem, but my understanding of
Go is limited.

For comparison, the original AlphaGo used a feature plane of ones and zeros
to indicate legal and illegal moves.
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Significance of resignation in AGZ

2017-12-03 Thread Chaz G.
Hi Brian,

Thanks for sharing your genuinely interesting result. One question though:
why would you train on a non-"zero" program? Do you think your program as a
result of your rules would perform better than zero, or is it imitating the
best known algorithm inconvenient for your purposes?

Best,
-Chaz

On Sat, Dec 2, 2017 at 7:31 PM, Brian Sheppard via Computer-go <
computer-go@computer-go.org> wrote:

> I implemented the ad hoc rule of not training on positions after the first
> pass, and my program is basically playing moves until the first pass is
> forced. (It is not a “zero” program, so I don’t mind ad hoc rules like
> this.)
>
>
>
> *From:* Computer-go [mailto:computer-go-boun...@computer-go.org] *On
> Behalf Of *Xavier Combelle
> *Sent:* Saturday, December 2, 2017 12:36 PM
> *To:* computer-go@computer-go.org
>
> *Subject:* Re: [Computer-go] Significance of resignation in AGZ
>
>
>
> It might make sense to enable resignation threshold even on stupid level.
> As such the first thing the network should learn would be not to resign to
> early (even before not passing)
>
>
>
> Le 02/12/2017 à 18:17, Brian Sheppard via Computer-go a écrit :
>
> I have some hard data now. My network’s initial training reached the same
> performance in half the iterations. That is, the steepness of skill gain in
> the first day of training was twice as great when I avoided training on
> fill-ins.
>
>
>
> The has all the usual caveats: only one run before/after, YMMV, etc.
>
>
>
> *From:* Brian Sheppard [mailto:sheppar...@aol.com ]
> *Sent:* Friday, December 1, 2017 5:39 PM
> *To:* 'computer-go' 
> 
> *Subject:* RE: [Computer-go] Significance of resignation in AGZ
>
>
>
> I didn’t measure precisely because as soon as I saw the training artifacts
> I changed the code. And I am not doing an AGZ-style experiment, so there
> are differences for sure. So I will give you a swag…
>
>
>
> Speed difference is maybe 20%-ish for 9x9 games.
>
>
>
> A frequentist approach will overstate the frequency of fill-in plays by a
> pretty large factor, because fill-in plays are guaranteed to occur in every
> game but are not best in the competitive part of the game. This will affect
> the speed of learning in the early going.
>
>
>
> The network will use some fraction (almost certainly <= 20%) of its
> capacity to improve accuracy on positions that will not contribute to its
> ultimate strength. This applies to both ordering and evaluation aspects.
>
>
>
>
>
>
>
>
>
> *From:* Andy [mailto:andy.olsen...@gmail.com ]
> *Sent:* Friday, December 1, 2017 4:55 PM
> *To:* Brian Sheppard  ;
> computer-go  
> *Subject:* Re: [Computer-go] Significance of resignation in AGZ
>
>
>
> Brian, do you have any experiments showing what kind of impact it has? It
> sounds like you have tried both with and without your ad hoc first pass
> approach?
>
>
>
>
>
>
>
>
>
> 2017-12-01 15:29 GMT-06:00 Brian Sheppard via Computer-go <
> computer-go@computer-go.org>:
>
> I have concluded that AGZ's policy of resigning "lost" games early is
> somewhat significant. Not as significant as using residual networks, for
> sure, but you wouldn't want to go without these advantages.
>
> The benefit cited in the paper is speed. Certainly a factor. I see two
> other advantages.
>
> First is that training does not include the "fill in" portion of the game,
> where every move is low value. I see a specific effect on the move ordering
> system, since it is based on frequency. By eliminating training on
> fill-ins, the prioritization function will not be biased toward moves that
> are not relevant to strong play. (That is, there are a lot of fill-in
> moves, which are usually not best in the interesting portion of the game,
> but occur a lot if the game is played out to the end, and therefore the
> move prioritization system would predict them more often.) My ad hoc
> alternative is to not train on positions after the first pass in a game.
> (Note that this does not qualify as "zero knowledge", but that is OK with
> me since I am not trying to reproduce AGZ.)
>
> Second is the positional evaluation is not training on situations where
> everything is decided, so less of the NN capacity is devoted to situations
> in which nothing can be gained.
>
> As always, YMMV.
>
> Best,
> Brian
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
>
>
>
>
>
> ___
>
> Computer-go mailing list
>
> Computer-go@computer-go.org
>
> http://computer-go.org/mailman/listinfo/computer-go
>
>
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go