Hi Denny, This is great work! who is Tpt?
Steph. On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić <vrande...@google.com> wrote: > Hi all, > > as you know, Tpt has been working as an intern this summer at Google. He > finished his work a few weeks ago and I am happy to announce today the > publication of all scripts and the resulting data he has been working on. > Additionally, we publish a few novel visualizations of the data in Wikidata > and Freebase. We are still working on the actual report summarizing the > effort and providing numbers on its effectiveness and progress. This will > take another few weeks. > > First, thanks to Tpt for his amazing work! I have not expected to see such > rich results. He has exceeded my expectations by far, and produced much > more transferable data than I expected. Additionally, he also was working > on the primary sources tool directly and helped Marco Fossati to upload a > second, sports-related dataset (you can select that by clicking on the > gears icon next to the Freebase item link in the sidebar on Wikidata, when > you switch on the Primary Sources tool). > > The scripts that were created and used can be found here: > > https://github.com/google/freebase-wikidata-converter > > All scripts are released under the Apache license v2. > > The following data files are also released. All data is released under the > CC0 license (in order to make this explicit, a comment has been added to > the start of each file, stating the copyright and the license. If any > script dealing with the files hiccups due to that line, simply remove the > first line). > > > https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-missing.tsv.gz > The actual missing statements, including URLs for sources, are in this > file. This was filtered against statements already existing in Wikidata, > and the statements are mapped to Wikidata IDs. This contains about 14.3M > statements (214MB gzipped, 831MB unzipped). These are created using the > mappings below in addition to the mappings already in Wikidata. The quality > of these statements is rather mixed. > > Additional datasets that we know meet a higher quality bar have been > previously released and uploaded directly to Wikidata by Tpt, following > community consultation. > > > https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.pairs.gz > Contains additional mappings between Freebase MIDs and Wikidata QIDs, > which are not available in Wikidata. These are mappings based on > statistical methods and single interwiki links. Unlike the first set of > mappings we had created and published previously (which required multiple > interwiki links at least), these mappings are expected to have a lower > quality - sufficient for a manual process, but probably not sufficient for > an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB > unzipped). > > > https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.tsv.gz > This file includes labels and aliases for Wikidata items which seem to be > currently missing. The quality of these labels is undetermined. The file > contains about 860k labels in about 160 languages, with 33 languages having > more than 10k labels each (14MB gzipped, 32MB unzipped). > > > https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-missing.tsv.gz > This is an interesting file as it includes a quality signal for the > statements in Freebase. What you will find here are ordered pairs of > Freebase mids and properties, each indicating that the given pair were > going through a review process and likely have a higher quality on average. > This is only for those pairs that are missing from Wikidata. The file > includes about 1.4M pairs, and this can be used for importing part of the > data directly (6MB gzipped, 52MB unzipped). > > Now anyone can take the statements, analyse them, slice and dice them, > upload them, use them for your own tools and games, etc. They remain > available through the primary sources tool as well, which has already led > to several thousand new statements in the last few weeks. > > Additionally, Tpt and I created in the last few days of his internship a > few visualizations of the current data in Wikidata and in Freebase. > > First, the following is a visualization of the whole of Wikidata: > > https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png > > The visualization needs a bit of explanation, I guess. The y-axis > (up/down) represents time, the x-axis (left/right) represents space / > geolocation. The further down, the closer you are to the present, the > further up the more you go in the past. Time is given in a rational scale - > the 20th century gets much more space than the 1st century. The x-axis > represents longitude, with the prime meridian in the center of the image. > > Every item is being put at its longitude (averaged, if several) and at its > earliest point of time mentioned on the item. For items without either, > neighbouring items propagate their value to them (averaging, if necessary). > This is done repeatedly until the items are saturated. > > In order to understand that a bit better, the following image offers a > supporting grid: each line from left to right represents a century (up to > the first century), and each line from top to bottom represent a meridian > (with London in the middle of the graph). > > > https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color.png > > The same visualizations has also been created for Freebase: > > https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png > > https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color.png > > In order to compare the two graphs, we also overlaid them over each other. > I will leave the interpretation to you, but you can easily see the > strengths of weaknesses of both knowledge bases. > > > https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebase-green.png > > https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidata-green.png > > The programs for creating the visualizations are all available in the > Github repository mentioned above (plenty of RAM is recommended to run it). > > Enjoy the visualizations, the data and the script! Tpt and I are available > to answer questions. I hope this will help with understanding and analysing > some of the results of the work that we did this summer. > > Cheers, > Denny > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > > -- Steph.
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata