Hi Denny,

This is great work! who is Tpt?

Steph.

On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić <vrande...@google.com>
wrote:

> Hi all,
>
> as you know, Tpt has been working as an intern this summer at Google. He
> finished his work a few weeks ago and I am happy to announce today the
> publication of all scripts and the resulting data he has been working on.
> Additionally, we publish a few novel visualizations of the data in Wikidata
> and Freebase. We are still working on the actual report summarizing the
> effort and providing numbers on its effectiveness and progress. This will
> take another few weeks.
>
> First, thanks to Tpt for his amazing work! I have not expected to see such
> rich results. He has exceeded my expectations by far, and produced much
> more transferable data than I expected. Additionally, he also was working
> on the primary sources tool directly and helped Marco Fossati to upload a
> second, sports-related dataset (you can select that by clicking on the
> gears icon next to the Freebase item link in the sidebar on Wikidata, when
> you switch on the Primary Sources tool).
>
> The scripts that were created and used can be found here:
>
> https://github.com/google/freebase-wikidata-converter
>
> All scripts are released under the Apache license v2.
>
> The following data files are also released. All data is released under the
> CC0 license (in order to make this explicit, a comment has been added to
> the start of each file, stating the copyright and the license. If any
> script dealing with the files hiccups due to that line, simply remove the
> first line).
>
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-missing.tsv.gz
> The actual missing statements, including URLs for sources, are in this
> file. This was filtered against statements already existing in Wikidata,
> and the statements are mapped to Wikidata IDs. This contains about 14.3M
> statements (214MB gzipped, 831MB unzipped). These are created using the
> mappings below in addition to the mappings already in Wikidata. The quality
> of these statements is rather mixed.
>
> Additional datasets that we know meet a higher quality bar have been
> previously released and uploaded directly to Wikidata by Tpt, following
> community consultation.
>
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.pairs.gz
> Contains additional mappings between Freebase MIDs and Wikidata QIDs,
> which are not available in Wikidata. These are mappings based on
> statistical methods and single interwiki links. Unlike the first set of
> mappings we had created and published previously (which required multiple
> interwiki links at least), these mappings are expected to have a lower
> quality - sufficient for a manual process, but probably not sufficient for
> an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB
> unzipped).
>
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.tsv.gz
> This file includes labels and aliases for Wikidata items which seem to be
> currently missing. The quality of these labels is undetermined. The file
> contains about 860k labels in about 160 languages, with 33 languages having
> more than 10k labels each (14MB gzipped, 32MB unzipped).
>
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-missing.tsv.gz
> This is an interesting file as it includes a quality signal for the
> statements in Freebase. What you will find here are ordered pairs of
> Freebase mids and properties, each indicating that the given pair were
> going through a review process and likely have a higher quality on average.
> This is only for those pairs that are missing from Wikidata. The file
> includes about 1.4M pairs, and this can be used for importing part of the
> data directly (6MB gzipped, 52MB unzipped).
>
> Now anyone can take the statements, analyse them, slice and dice them,
> upload them, use them for your own tools and games, etc. They remain
> available through the primary sources tool as well, which has already led
> to several thousand new statements in the last few weeks.
>
> Additionally, Tpt and I created in the last few days of his internship a
> few visualizations of the current data in Wikidata and in Freebase.
>
> First, the following is a visualization of the whole of Wikidata:
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
>
> The visualization needs a bit of explanation, I guess. The y-axis
> (up/down) represents time, the x-axis (left/right) represents space /
> geolocation. The further down, the closer you are to the present, the
> further up the more you go in the past. Time is given in a rational scale -
> the 20th century gets much more space than the 1st century. The x-axis
> represents longitude, with the prime meridian in the center of the image.
>
> Every item is being put at its longitude (averaged, if several) and at its
> earliest point of time mentioned on the item. For items without either,
> neighbouring items propagate their value to them (averaging, if necessary).
> This is done repeatedly until the items are saturated.
>
> In order to understand that a bit better, the following image offers a
> supporting grid: each line from left to right represents a century (up to
> the first century), and each line from top to bottom represent a meridian
> (with London in the middle of the graph).
>
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color.png
>
> The same visualizations has also been created for Freebase:
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color.png
>
> In order to compare the two graphs, we also overlaid them over each other.
> I will leave the interpretation to you, but you can easily see the
> strengths of weaknesses of both knowledge bases.
>
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebase-green.png
>
> https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidata-green.png
>
> The programs for creating the visualizations are all available in the
> Github repository mentioned above (plenty of RAM is recommended to run it).
>
> Enjoy the visualizations, the data and the script! Tpt and I are available
> to answer questions. I hope this will help with understanding and analysing
> some of the results of the work that we did this summer.
>
> Cheers,
> Denny
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>


-- 
Steph.
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to