Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Erik van der Werf
On Fri, Nov 13, 2015 at 10:46 AM, Darren Cook  wrote:
>
> The advantages of storing games:
>   * accountability/traceability
>   * for programs who want to learn sequences of moves.
>

Another advantage of storing games is that it is much more efficient; you
only have to encode one move per position.

Erik
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Gonçalo Mendes Ferreira
I think if you start calculating the Zobrist hashes and scraping 
features yourself you will have a neverending variety of datasets.


I would prefer datasets of whole, high quality games without SGF errors, 
perhaps cleaned of identifying information. Parsing an SGF is already 
trivial. I personally divide them in:


- Handicap used or not
- Normal (5.5 - 7.5) or not komi, this disqualifies some older games
- Rules used
- Board size

Following the idea of having more information instead of very specific 
features already extracted, it would be interesting to also have the 
playing times, although I don't know where you'd get that from.


You'd be an angel if you could provide a large dataset of matches with 
Chinese rules, specially in board sizes other than 19x19.


It would of course also have to be completely free for any use. I 
personally only use the KGS 6d+ and a collection of 70k pro games that I 
don't know where it came from. The GoGoD is proprietary. :)


Gonçalo F.

On 11/13/2015 08:39 AM, Josef Moudrik wrote:

Hello List,

There has been some debate in science about making the research more
reproducible and open. Recently, I have been thinking about making a
standard public fixed dataset of Go games, mainly to ease comparison of
different methods, to make results more reproducible and maybe free the
authors of the burden of composing a dataset. I think that the current
practice can be improved a lot.

Since the success of this endeavor crucially depends on how many authors
use the dataset, I would like to ask You (potential authors) a few
questions:

1) Would this be welcomed and used? Would You personally use it? (Am I not
reinventing the wheel?)

2) What parameters should the dataset have? The number of dataset variants
(if any) should be in my opinion kept at bare minimum to reduce
"fragmentation".

2a) Size: My current view is that at least 2 sizes are necessary: small
(1000-2000 games?) and large dataset (5-6 games).
2b) Strength & year span: Currently I am thinking about including modern
professional games only (1970-2015)

3) Do you have any other comments, requirements for the dataset and ideas?


Thanks for Your attention,
Kind regards
Josef Moudrik



___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Gonçalo Mendes Ferreira
At least in the past some DCNN made use of the players ranks, so it 
should be best to leave it.


On 11/13/2015 10:27 AM, Josef Moudrik wrote:

On Fri, Nov 13, 2015 at 11:16 AM Erik van der Werf 
wrote:


On Fri, Nov 13, 2015 at 10:46 AM, Darren Cook  wrote:


The advantages of storing games:
   * accountability/traceability
   * for programs who want to learn sequences of moves.



Another advantage of storing games is that it is much more efficient; you
only have to encode one move per position.

Erik



Yes,
I think that having full games would be much more useful. The anonymization
of the I had in mind would include hiding information not important for
computer processing such as file-names, player names, dates, ranks,
comments (given that the dataset would ensure consistent "balanced"
distribution). Like this, the database would have no (or much less) use for
human study.



___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Josef Moudrik
Hello List,

There has been some debate in science about making the research more
reproducible and open. Recently, I have been thinking about making a
standard public fixed dataset of Go games, mainly to ease comparison of
different methods, to make results more reproducible and maybe free the
authors of the burden of composing a dataset. I think that the current
practice can be improved a lot.

Since the success of this endeavor crucially depends on how many authors
use the dataset, I would like to ask You (potential authors) a few
questions:

1) Would this be welcomed and used? Would You personally use it? (Am I not
reinventing the wheel?)

2) What parameters should the dataset have? The number of dataset variants
(if any) should be in my opinion kept at bare minimum to reduce
"fragmentation".

2a) Size: My current view is that at least 2 sizes are necessary: small
(1000-2000 games?) and large dataset (5-6 games).
2b) Strength & year span: Currently I am thinking about including modern
professional games only (1970-2015)

3) Do you have any other comments, requirements for the dataset and ideas?


Thanks for Your attention,
Kind regards
Josef Moudrik
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Petr Baudis
  Hi!

On Fri, Nov 13, 2015 at 08:39:20AM +, Josef Moudrik wrote:
> There has been some debate in science about making the research more
> reproducible and open. Recently, I have been thinking about making a
> standard public fixed dataset of Go games, mainly to ease comparison of
> different methods, to make results more reproducible and maybe free the
> authors of the burden of composing a dataset. I think that the current
> practice can be improved a lot.

  I think the current de facto standard dataset is GoGoD (some year, not
quite fixed).  So I think it's useful to differentiate your proposal
against this dataset - what are the current problems and what will be
the advantage?

  One advantage would be of course if the dataset is freely available.
But it's not clear how to achieve that, i.e. where to get a large
professional game collection without copyright protection.

> 2a) Size: My current view is that at least 2 sizes are necessary: small
> (1000-2000 games?) and large dataset (5-6 games).

  What's the usecase for a small dataset?

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Darren Cook
> standard public fixed dataset of Go games, mainly to ease comparison of
> different methods, to make results more reproducible and maybe free the
> authors of the burden of composing a dataset. 

Maybe the first question should be is if people want a database of
*positions* or *games*.

I imagine a position database to be a set of board descriptions, with
each pro move marked on it. Ideally each move would say not just the
number of times it was chosen, but break it down by rank of player.

Each would have a zobrist hash calculated, in all 8 combinations, and
the lowest chosen. This handles rotations and duplicates. If there was
as a ko-illegal point on the board that needs to be stored, and also be
part of the zobrist hash.


A database of positions has some advantages:
  * No licensing issues (*)
  * Rotational duplicates already removed
  * Ready-to-go with the information (most) programs want to learn.


The advantages of storing games:
  * accountability/traceability
  * for programs who want to learn sequences of moves.

Darren


*: At least that was my conclusion when I looked into this before. Game
collections can be copyrighted; moves cannot. A database of moves can be
freely distributed, even it was generated from copyrighted game
collections, as long as there exists no way to regenerate the game
collection from it.

Text corpora (used in machine translation studies, for instance) follow
the same idea: if you split the corpora into sentences, then shuffle
them up randomly, you can distribute the set of sentences.

(I did wonder about storing player ranks, e.g. if a given position has a
move chosen by only a single 9p, and you can then extract each follow-up
position, you could extract a game. But, IMHO, you cannot regenerate any
particular game collection this way. If it is a concern, it can be
solved by only using a random 80% of moves from games.)

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread fotland

I would only use it if it is licensed for commercial use.
 
David

On Fri, 13 Nov 2015 08:39:20 +, Josef Moudrik  wrote:

  Hello List, 
There has been some debate in science about making the research more
reproducible and open. Recently, I have been thinking about making a
standard public fixed dataset of Go games, mainly to ease comparison of
different methods, to make results more reproducible and maybe free the
authors of the burden of composing a dataset. I think that the current
practice can be improved a lot.
 
Since the success of this endeavor crucially depends on how many
authors use the dataset, I would like to ask You (potential authors) a
few questions:
 
1) Would this be welcomed and used? Would You personally use it? (Am I
not reinventing the wheel?)
 
2) What parameters should the dataset have? The number of dataset
variants (if any) should be in my opinion kept at bare minimum to
reduce "fragmentation".
 
2a) Size: My current view is that at least 2 sizes are necessary: small
(1000-2000 games?) and large dataset (5-6 games). 
2b) Strength & year span: Currently I am thinking about including
modern professional games only (1970-2015)
 
3) Do you have any other comments, requirements for the dataset and ideas?
 
 
Thanks for Your attention,
Kind regards
Josef Moudrik
 

-

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Josef Moudrik
Hello,

On Fri, Nov 13, 2015 at 10:13 AM  wrote:
> I would only use it if it is licensed for commercial use.

Yes, I would like to licence this as such, please see below.

On Fri, Nov 13, 2015 at 10:23 AM Petr Baudis  wrote:

> I think the current de facto standard dataset is GoGoD (some year, not
> quite fixed). So I think it's useful to differentiate your proposal
> against this dataset - what are the current problems and what will be
> the advantage?

Yes, I know GoGoD is used frequently, but I think that the lack of
"precise" specification is the problem. There are many choices an author
has to make when using the GoGoD database: year of release, year span,
handicap games?, amateur/professional? (how to tell? pro rank is d not p).
Related thing is that some of the games (If I remember my experience
correctly) cannot be parsed by some libraries in which case they are
usually skipped. All these are branching points that make "precise"
replication of results hard.

> One advantage would be of course if the dataset is freely available.
> But it's not clear how to achieve that, i.e. where to get a large
> professional game collection without copyright protection.

I consider this "negotiation" as the hardest work I will have to do, but
before I start, I want to research if the dataset would be even used. From
the point of view of copyright law, I believe that what is protected is the
"collection of games" and "additional materials" (comments, etc), not the
actual individual games themselves (which as a record of a historical event
afaik cannot be copyrighted). The "collection of games" and "additional
materials" right of current collection owners could be protected by
anonymization of the records and mixing of different databases, if the
current owners agree.

>From the licensing point of view, again given that owners agree, I would
like to release the dataset under something like
free-for-all-purposes-with-attribution license. This I have to research yet.

> What's the usecase for a small dataset?

I had prototype testing in mind, s.t. authors can say "our method is slow,
so we only tested on the SmallGoDataset" instead of "we randomly took 1000
games from the BigGoDataset", but I assume there would be other usecases as
well. Anyway, I think the big and small datasets would not imo cause much
use-fragmentation, because the use cases for big vs small would be
different. But maybe I am overthinking things and this would not be used
much..


Regards,
Josef
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Dave Dyer

I was recently working on assigning final scores to completed games, using
the large data set from Badukmovies.com.  

My observation is that the size of the data set (50,000 games) is not
large enough to get good coverage of unusual situations occurring in real
games.

There's a definite need for a curated collection of atypical but
interesting games, probably manipulated to explore the boundaries
between interesting and normal.

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Dave Dyer

I was recently working on assigning final scores to completed games, using
the large data set from Badukmovies.com.  

My observation is that the size of the data set (50,000 games) is not
large enough to get good coverage of unusual situations occurring in real
games.

There's a definite need for a curated collection of atypical but
interesting games, probably manipulated to explore the boundaries
between interesting and normal.

___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Steven Clark
To answer the original question: yes, the curation of a dataset like this
would be hugely beneficial to the community. Look at what ImageNet has done
for computer vision. In fact, it might be good to emulate ImageNet further
and pre-split the dataset into a publicly-available training set, and a
hidden testing set, for truly objective comparisons between move-prediction
algorithms.

If you undertake this, many thanks in advance!

On Fri, Nov 13, 2015 at 1:20 PM, Dave Dyer  wrote:

>
> I was recently working on assigning final scores to completed games, using
> the large data set from Badukmovies.com.
>
> My observation is that the size of the data set (50,000 games) is not
> large enough to get good coverage of unusual situations occurring in real
> games.
>
> There's a definite need for a curated collection of atypical but
> interesting games, probably manipulated to explore the boundaries
> between interesting and normal.
>
> ___
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Petr Baudis
  Hi!

On Fri, Nov 13, 2015 at 09:46:54AM +, Darren Cook wrote:
> (I did wonder about storing player ranks, e.g. if a given position has a
> move chosen by only a single 9p, and you can then extract each follow-up
> position, you could extract a game. But, IMHO, you cannot regenerate any
> particular game collection this way. If it is a concern, it can be
> solved by only using a random 80% of moves from games.)

  Dropping player names and some positions is a nice idea - especially,
from a moral standpoint, if the collection includes a prominent notice
encouraging voluntary donations by the users to the source collection,
e.g. GoGoD.

  (A technical notice: you want info about last + second-to-last move
in the position as that's a feature that's often used in patterns.
Plus, bridging over just a 1-3 moves seems pretty easy to do by brute
force.  A better scheme might be to drop, say, a block of 20 moves
starting at move 40-80 at random.)

  I think a good question is what other uses besides learning move
patterns do people envision.

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Standard Computer Go Datasets - Proposal

2015-11-13 Thread Josef Moudrik
On Fri, Nov 13, 2015 at 11:16 AM Erik van der Werf 
wrote:

> On Fri, Nov 13, 2015 at 10:46 AM, Darren Cook  wrote:
>>
>> The advantages of storing games:
>>   * accountability/traceability
>>   * for programs who want to learn sequences of moves.
>>
>
> Another advantage of storing games is that it is much more efficient; you
> only have to encode one move per position.
>
> Erik
>

Yes,
I think that having full games would be much more useful. The anonymization
of the I had in mind would include hiding information not important for
computer processing such as file-names, player names, dates, ranks,
comments (given that the dataset would ensure consistent "balanced"
distribution). Like this, the database would have no (or much less) use for
human study.
___
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go