Le 1 oct. 2015 à 21:10, Stéphane Corlosquet <scorlosq...@gmail.com> a écrit :
Hi Denny,
This is great work! who is Tpt?
Steph.
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić <vrande...@google.com> wrote:
Hi all,
as you know, Tpt has been working as an intern this summer at Google. He
finished his work a few weeks ago and I am happy to announce today the
publication of all scripts and the resulting data he has been working on.
Additionally, we publish a few novel visualizations of the data in Wikidata and
Freebase. We are still working on the actual report summarizing the effort and
providing numbers on its effectiveness and progress. This will take another few
weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich
results. He has exceeded my expectations by far, and produced much more
transferable data than I expected. Additionally, he also was working on the
primary sources tool directly and helped Marco Fossati to upload a second,
sports-related dataset (you can select that by clicking on the gears icon next
to the Freebase item link in the sidebar on Wikidata, when you switch on the
Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0
license (in order to make this explicit, a comment has been added to the start
of each file, stating the copyright and the license. If any script dealing with
the files hiccups due to that line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-missing.tsv.gz
The actual missing statements, including URLs for sources, are in this file.
This was filtered against statements already existing in Wikidata, and the
statements are mapped to Wikidata IDs. This contains about 14.3M statements
(214MB gzipped, 831MB unzipped). These are created using the mappings below in
addition to the mappings already in Wikidata. The quality of these statements
is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously
released and uploaded directly to Wikidata by Tpt, following community
consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.pairs.gz
Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are
not available in Wikidata. These are mappings based on statistical methods and
single interwiki links. Unlike the first set of mappings we had created and
published previously (which required multiple interwiki links at least), these
mappings are expected to have a lower quality - sufficient for a manual
process, but probably not sufficient for an automatic upload. This contains
about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.tsv.gz
This file includes labels and aliases for Wikidata items which seem to be
currently missing. The quality of these labels is undetermined. The file
contains about 860k labels in about 160 languages, with 33 languages having
more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-missing.tsv.gz
This is an interesting file as it includes a quality signal for the statements
in Freebase. What you will find here are ordered pairs of Freebase mids and
properties, each indicating that the given pair were going through a review
process and likely have a higher quality on average. This is only for those
pairs that are missing from Wikidata. The file includes about 1.4M pairs, and
this can be used for importing part of the data directly (6MB gzipped, 52MB
unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload
them, use them for your own tools and games, etc. They remain available through
the primary sources tool as well, which has already led to several thousand new
statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a few
visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down)
represents time, the x-axis (left/right) represents space / geolocation. The
further down, the closer you are to the present, the further up the more you go
in the past. Time is given in a rational scale - the 20th century gets much
more space than the 1st century. The x-axis represents longitude, with the
prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its
earliest point of time mentioned on the item. For items without either,
neighbouring items propagate their value to them (averaging, if necessary).
This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a
supporting grid: each line from left to right represents a century (up to the
first century), and each line from top to bottom represent a meridian (with
London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color.png
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color.png
In order to compare the two graphs, we also overlaid them over each other. I
will leave the interpretation to you, but you can easily see the strengths of
weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebase-green.png
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidata-green.png
The programs for creating the visualizations are all available in the Github
repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to
answer questions. I hope this will help with understanding and analysing some
of the results of the work that we did this summer.
Cheers,
Denny
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Steph.
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata